SLP cannot replace undefs with poison without extra checks, plus need to
preserve the original order of scalars, if decided not to shuffle the
final vectorbuild sequence.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | Doesn't an undef here mean that lane 4 can potentially be poison? Wouldn't that be incorrect? |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | A mask of undef means the result is undef, not poison, even if the input contains poison elements. See https://llvm.org/docs/LangRef.html#shufflevector-instruction (which is less clear than I would have liked, it used to explicitly say undef, it now says "undefined" to be more readable, but it's not obvious that it is still meant to refer to undef rather than poison) |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | Right, thanks! |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | D93818 linked for reference. Okay, are you saying for the purpose of SLPVectorizer, you would prefer that we already treat undef masks as selecting poison? That should always result in code that is already correct today, but it may not always be optimal under the current rules. |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | I'll try to update this patch to follow this new rule |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | I think my D127073 actually already achieved that. The concerns there of regressions there are cases where under the current rules, we can use undef masks, but under those proposed new rules of undef masks, we cannot as we need to ensure the result is not poison. However, this diff is intended to also perform some additional optimisations so it may still be worth updating this one. |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | Yep, my update would be pretty similar to your original patch, just some situation will be handled a bit better. Just need to clarify the question, if shuffle <ld, ld, ld, undef> can be safely replaced by shuffle <ld, ld, ld, ld> |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | If we ensure that the vectorizer only ever generates a mask of <0, 0, 0, undef> when we element 3 is meant to be poison, it will be valid to merge that with <0, 0, 0, 0> even without waiting for D93818, I think: we do not have to support arbitrary existing shufflevectors here that may rely on undef masks giving undef results, we only have to worry about the new shufflevectors we generate. |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) |
It's correct to do that in just 2 cases:
If instead of undef, you've poison, then it can safely be replaced with ld. |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | This explanation is correct for arbitrary existing shuffles, but I wrote earlier that I do not think that is the case we are concerned with. We are concerned with new shuffle instructions that we generate. If we know that the undef mask was generated by SLPVectorizer (because it is a mask that is not even actually the operand to a shufflevector instruction yet, because it is a mask that we have built up in order to potentially generate a shufflevector instruction), and we decide that in SLPVectorizer, we only generate an undef mask if we originally wanted the result to be poison, that is also a case where we can treat an undef mask as selecting poison. |
llvm/test/Transforms/SLPVectorizer/X86/diamond_broadcast_extra_shuffle.ll | ||
---|---|---|
37 ↗ | (On Diff #434535) | Let me slightly make a progress to that patch and its friends. I think I will have some time for this. |
Comparing the changes to the tests in this diff to those in D127073, I am seeing a number of tests where we have more shufflevectors, and none where we have fewer. Are there improvements that are not as obvious to see?
Yes, there are. These extra shuffles caused by changes in performExtractsShuffleAction() and in IsIdenticalOrLessDefined lambda, these changes (they treat UndefMaskElem as possible poison) increase number of shuffles. Without them, there are less shuffles, these extra changes are required for correct handling of UndefMaskElem as posion.
That suggests that D127073 still results in incorrect code in some cases. I was under the impression that it was already correct, just not optimal. Can you point to specific tests where you believe D127073 results in wrong code?
Taking a random example
--- a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll @@ -19,9 +19,11 @@ define { <2 x float>, <2 x float> } @foo(%struct.sw* %v) { ; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], undef ; CHECK-NEXT: [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], undef ; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 0, i32 1> -; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 2, i32 3> -; CHECK-NEXT: [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[TMP10]], 0 -; CHECK-NEXT: [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[TMP11]], 1 +; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x float> undef, <2 x float> [[TMP10]], <2 x i32> <i32 2, i32 3> +; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x float> [[TMP9]], <4 x float> poison, <2 x i32> <i32 2, i32 3> +; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x float> undef, <2 x float> [[TMP12]], <2 x i32> <i32 2, i32 3> +; CHECK-NEXT: [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[TMP11]], 0 +; CHECK-NEXT: [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[TMP13]], 1 ; CHECK-NEXT: ret { <2 x float>, <2 x float> } [[INS2]] ; entry:
The extra shufflevectors here, TMP11 and TMP13, are identity shuffles that should not have been generated, they do not change the handling of undef/poison.
Yes, I see some extra shuffles, that can be removed safely. Working on the improvements.
Thanks, I am seeing improvements with this new version. I'll try to go over the changes in more detail later, some initial superficial comments now.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6994 | This extra overload does not look like it adds anything: it takes a parameter U if and only if U is Value *. The previous overload takes Value * if and only if U is Value *. That is the same thing; this new overload does not appear to add anything and the file compiles successfully for me if I remove this. | |
6999 | This extra overload goes against the documentation above ValueSelect. The documentation for ValueSelect says it takes a Value *, and if a Value * is wanted, returns that value, otherwise returns a default value. This extra overload is needed because ValueSelect::get now gets called with a const TreeEntry * in the instantiation of performExtractsShuffleAction<const TreeEntry>, but that is contrary to its documentation and it is not clear what it is supposed to mean. | |
7030 | a && b || !a && c is simpler expressed as a ? b : c. | |
7039 | (a ? b : c) == b is simpler expressed as a || c == b. | |
7045 | This comment seems like it does not match how IsCompleteIdentity is set: we can get IsCompleteIdentity to be false even when we select elements only from a single vector. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9146 | These ->getAggregateElement(...) and ->getOperand(...) expressions repeat several times in the code and make this code quite verbose. Is there a way to avoid this repetition? |
llvm/test/Transforms/SLPVectorizer/X86/buildvector-same-lane-insert.ll | ||
---|---|---|
50 | TMP9 seems to be redundant here, it looks like it is a copy of TMP3: I guess this was an issue even before this patch: TMP8 was a copy of TMP3, so the TMP8 shufflevector was redundant. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9165–9168 | Actually... That just looks very wrong, = rather than ==. |
This extra overload does not look like it adds anything: it takes a parameter U if and only if U is Value *. The previous overload takes Value * if and only if U is Value *. That is the same thing; this new overload does not appear to add anything and the file compiles successfully for me if I remove this.