This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
7/13
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
reorder_diamond_match.ll

Differential D115811

[SLP]Early exit out of the reordering if shuffled/perfect diamond match found.
ClosedPublic

Authored by ABataev on Dec 15 2021, 11:09 AM.

Download Raw Diff

Details

Reviewers

RKSimon
vporpo
anton-afanasyev
dtemirbulatov

Commits

rG65fc99257990: [SLP]Early exit out of the reordering if shuffled/perfect diamond match found.

Summary

Need to early exit out of the reordering process if the perfect/shuffled match is found in the operands. Such pattern will result in not profitable reordering because of (false positive) external use of scalars.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Dec 15 2021, 11:09 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptDec 15 2021, 11:09 AM

ABataev requested review of this revision.Dec 15 2021, 11:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 15 2021, 11:09 AM

Harbormaster completed remote builds in B139473: Diff 394611.Dec 15 2021, 11:46 AM

vporpo added inline comments.Dec 15 2021, 12:06 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	Isn't `VLOperands` a better place for this logic? Perhaps a method like: `isDiamondMatch()` ? This will also help separate the temporary check `UniqueValues.size() == 2 \|\| !isPowerOf2_32(UniqueValues.size())`. What do you think?
1666	nit: Perhaps a TODO here?

ABataev added inline comments.Dec 15 2021, 12:14 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	I just thought that we may have this situation after the very first iteration of the reordering, not only initially.
1666	Will do

Added a comment

vporpo added inline comments.Dec 15 2021, 1:31 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	I am not sure I follow why moving this logic to a member method in VLOperands won't work in this case. Could you elaborate a bit on this?

ABataev added inline comments.Dec 15 2021, 1:33 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	I have the same problem because this function `reorder()` is a member of `VLOperands` class :)

ABataev added inline comments.Dec 15 2021, 2:46 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	Or do you suggest transforming this lambda to a member function? If so, I think keeping lambda is better because it does not increase the number of interfaces of the class. If (or when) we'll have several users of this functionality, it can be outlined into a private member function.

Harbormaster completed remote builds in B139509: Diff 394651.Dec 15 2021, 2:46 PM

vporpo added inline comments.Dec 15 2021, 3:03 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	Yes, I was suggesting moving the loops of this lambda to a function. I understand that this is the only use so it is not really needed, but if we need this same functionality in the future it will be hard to remember that this code already exists in this lambda. So we will probably end up re-implementing it. Anyway, regarding your earlier comment, sorry, I still don't understand what issue you are referring to about the first iteration of reordering. Are you referring to the `Pass`es in the for loop? I am confused.

ABataev added inline comments.Dec 15 2021, 3:05 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	Yes, to passes in the loop. If the pass failed, we still do some reordering and may end up with the diamond match situation.

vporpo added inline comments.Dec 15 2021, 3:27 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	OK, but why do you want to avoid it if it happens in a later pass? Is it going to produce a worse reordering? Do you think the diamond case could be handled by adding a new `ReorderingMode::Diamond` that disables reordering for those operand indexes (or perhaps disables reordering completely)? This could be set in the loop under `// Initialize the modes.`, line 1627. This would also fit with the existing design and won't look like a workaround. What do you think?

ABataev added inline comments.Dec 15 2021, 3:35 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	OK, but why do you want to avoid it if it happens in a later pass? Is it going to produce a worse reordering? I think it may. Do you think the diamond case could be handled by adding a new ReorderingMode::Diamond ... Not sure we can do it. We perform the analysis lane by lane, but here we need to perform the analysis in the orthogonal order - operand by operand. We can implement some deeper analysis, probably, but it requires extra time to understand how to implement it. Plus, this is not the analysis but a kind of corner case check. The analysis process is not aware of the diamond match and currently, it is pretty hard to teach it about it, need to change the design completely.

vporpo added inline comments.Dec 15 2021, 4:33 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	Are you planning to move the bail-out check before the pass loop then? Also I would rename the lambda to something like `SkipReordering` because it is actually looking for corner cases when it should bail-out, it is not looking for cases when it should apply reordering. And please make sure that it is clear from the comment that this is the place where any code related to reordering bail-outs should be placed.

ABataev added inline comments.Dec 15 2021, 4:39 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1656–1665	I intentionally moved it into the loop, because even if we failed to reorder all the operands/lanes, some of them still might be reordered and after the failed reordering attempt we still may have diamond match. I’ll rename it.

Renamed lambda.

Harbormaster completed remote builds in B139641: Diff 394834.Dec 16 2021, 5:37 AM

LGTM

This revision is now accepted and ready to land.Dec 16 2021, 9:43 AM

This revision was landed with ongoing or failed builds.Dec 16 2021, 11:10 AM

Closed by commit rG65fc99257990: [SLP]Early exit out of the reordering if shuffled/perfect diamond match found. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG65fc99257990: [SLP]Early exit out of the reordering if shuffled/perfect diamond match found..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

26 lines

test/

Transforms/

SLPVectorizer/

X86/

reorder_diamond_match.ll

69 lines

Diff 394929

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,642 Lines • ▼ Show 20 Lines	void reorder() {
else if (isa<Argument>(OpLane0))		else if (isa<Argument>(OpLane0))
// Our best hope is a Splat. It may save some cost in some cases.		// Our best hope is a Splat. It may save some cost in some cases.
ReorderingModes[OpIdx] = ReorderingMode::Splat;		ReorderingModes[OpIdx] = ReorderingMode::Splat;
else		else
// NOTE: This should be unreachable.		// NOTE: This should be unreachable.
ReorderingModes[OpIdx] = ReorderingMode::Failed;		ReorderingModes[OpIdx] = ReorderingMode::Failed;
}		}

		// Check that we don't have same operands. No need to reorder if operands
		// are just perfect diamond or shuffled diamond match. Do not do it only
		// for possible broadcasts or non-power of 2 number of scalars (just for
		// now).
		auto &&SkipReordering = [this]() {
		SmallPtrSet<Value *, 4> UniqueValues;
		ArrayRef<OperandData> Op0 = OpsVec.front();
		for (const OperandData &Data : Op0)
		UniqueValues.insert(Data.V);
		for (ArrayRef<OperandData> Op : drop_begin(OpsVec, 1)) {
		if (any_of(Op, [&UniqueValues](const OperandData &Data) {
		return !UniqueValues.contains(Data.V);
		}))
		return false;
		}
		vporpoUnsubmitted Not Done Reply Inline Actions Isn't `VLOperands` a better place for this logic? Perhaps a method like: `isDiamondMatch()` ? This will also help separate the temporary check `UniqueValues.size() == 2 \|\| !isPowerOf2_32(UniqueValues.size())`. What do you think? vporpo: Isn't `VLOperands` a better place for this logic? Perhaps a method like: `isDiamondMatch()` ?
		ABataevAuthorUnsubmitted Done Reply Inline Actions I just thought that we may have this situation after the very first iteration of the reordering, not only initially. ABataev: I just thought that we may have this situation after the very first iteration of the reordering…
		vporpoUnsubmitted Not Done Reply Inline Actions I am not sure I follow why moving this logic to a member method in VLOperands won't work in this case. Could you elaborate a bit on this? vporpo: I am not sure I follow why moving this logic to a member method in VLOperands won't work in…
		ABataevAuthorUnsubmitted Done Reply Inline Actions I have the same problem because this function `reorder()` is a member of `VLOperands` class :) ABataev: I have the same problem because this function `reorder()` is a member of `VLOperands` class :)
		ABataevAuthorUnsubmitted Done Reply Inline Actions Or do you suggest transforming this lambda to a member function? If so, I think keeping lambda is better because it does not increase the number of interfaces of the class. If (or when) we'll have several users of this functionality, it can be outlined into a private member function. ABataev: Or do you suggest transforming this lambda to a member function? If so, I think keeping lambda…
		vporpoUnsubmitted Not Done Reply Inline Actions Yes, I was suggesting moving the loops of this lambda to a function. I understand that this is the only use so it is not really needed, but if we need this same functionality in the future it will be hard to remember that this code already exists in this lambda. So we will probably end up re-implementing it. Anyway, regarding your earlier comment, sorry, I still don't understand what issue you are referring to about the first iteration of reordering. Are you referring to the `Pass`es in the for loop? I am confused. vporpo: Yes, I was suggesting moving the loops of this lambda to a function. I understand that this is…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, to passes in the loop. If the pass failed, we still do some reordering and may end up with the diamond match situation. ABataev: Yes, to passes in the loop. If the pass failed, we still do some reordering and may end up with…
		vporpoUnsubmitted Not Done Reply Inline Actions OK, but why do you want to avoid it if it happens in a later pass? Is it going to produce a worse reordering? Do you think the diamond case could be handled by adding a new `ReorderingMode::Diamond` that disables reordering for those operand indexes (or perhaps disables reordering completely)? This could be set in the loop under `// Initialize the modes.`, line 1627. This would also fit with the existing design and won't look like a workaround. What do you think? vporpo: OK, but why do you want to avoid it if it happens in a later pass? Is it going to produce a…
		ABataevAuthorUnsubmitted Done Reply Inline Actions OK, but why do you want to avoid it if it happens in a later pass? Is it going to produce a worse reordering? I think it may. Do you think the diamond case could be handled by adding a new ReorderingMode::Diamond ... Not sure we can do it. We perform the analysis lane by lane, but here we need to perform the analysis in the orthogonal order - operand by operand. We can implement some deeper analysis, probably, but it requires extra time to understand how to implement it. Plus, this is not the analysis but a kind of corner case check. The analysis process is not aware of the diamond match and currently, it is pretty hard to teach it about it, need to change the design completely. ABataev: > OK, but why do you want to avoid it if it happens in a later pass? Is it going to produce a…
		vporpoUnsubmitted Not Done Reply Inline Actions Are you planning to move the bail-out check before the pass loop then? Also I would rename the lambda to something like `SkipReordering` because it is actually looking for corner cases when it should bail-out, it is not looking for cases when it should apply reordering. And please make sure that it is clear from the comment that this is the place where any code related to reordering bail-outs should be placed. vporpo: Are you planning to move the bail-out check before the pass loop then? Also I would rename the…
		ABataevAuthorUnsubmitted Done Reply Inline Actions I intentionally moved it into the loop, because even if we failed to reorder all the operands/lanes, some of them still might be reordered and after the failed reordering attempt we still may have diamond match. I’ll rename it. ABataev: I intentionally moved it into the loop, because even if we failed to reorder all the…
		// TODO: Check if we can remove a check for non-power-2 number of
		vporpoUnsubmitted Not Done Reply Inline Actions nit: Perhaps a TODO here? vporpo: nit: Perhaps a TODO here?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Will do ABataev: Will do
		// scalars after full support of non-power-2 vectorization.
		return UniqueValues.size() != 2 && isPowerOf2_32(UniqueValues.size());
		};

// If the initial strategy fails for any of the operand indexes, then we		// If the initial strategy fails for any of the operand indexes, then we
// perform reordering again in a second pass. This helps avoid assigning		// perform reordering again in a second pass. This helps avoid assigning
// high priority to the failed strategy, and should improve reordering for		// high priority to the failed strategy, and should improve reordering for
// the non-failed operand indexes.		// the non-failed operand indexes.
for (int Pass = 0; Pass != 2; ++Pass) {		for (int Pass = 0; Pass != 2; ++Pass) {
		// Check if no need to reorder operands since they're are perfect or
		// shuffled diamond match.
		// Need to to do it to avoid extra external use cost counting for
		// shuffled matches, which may cause regressions.
		if (SkipReordering())
		break;
// Skip the second pass if the first pass did not fail.		// Skip the second pass if the first pass did not fail.
bool StrategyFailed = false;		bool StrategyFailed = false;
// Mark all operand data as free to use.		// Mark all operand data as free to use.
clearUsed();		clearUsed();
// We keep the original operand order for the FirstLane, so reorder the		// We keep the original operand order for the FirstLane, so reorder the
// rest of the lanes. We are visiting the nodes in a circular fashion,		// rest of the lanes. We are visiting the nodes in a circular fashion,
// using FirstLane as the center point and increasing the radius		// using FirstLane as the center point and increasing the radius
// distance.		// distance.
▲ Show 20 Lines • Show All 8,531 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/reorder_diamond_match.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake-avx512 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake-avx512 \| FileCheck %s

	define void @test() {			define void @test() {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 undef, i64 4			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 undef, i64 4
	; CHECK-NEXT: [[TMP2:%.]] = load i8, i8 [[TMP1]], align 1			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 undef, i64 5
	; CHECK-NEXT: [[TMP3:%.*]] = zext i8 [[TMP2]] to i32			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 undef, i64 6
	; CHECK-NEXT: [[TMP4:%.*]] = sub nsw i32 0, [[TMP3]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 undef, i64 7
	; CHECK-NEXT: [[TMP5:%.*]] = shl nsw i32 [[TMP4]], 0			; CHECK-NEXT: [[TMP5:%.]] = bitcast i8 [[TMP1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP6:%.*]] = add nsw i32 [[TMP5]], 0			; CHECK-NEXT: [[TMP6:%.]] = load <4 x i8>, <4 x i8> [[TMP5]], align 1
	; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds i8, i8 undef, i64 5			; CHECK-NEXT: [[TMP7:%.*]] = zext <4 x i8> [[TMP6]] to <4 x i32>
	; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[TMP7]], align 1			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP7]], <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[TMP9:%.*]] = zext i8 [[TMP8]] to i32			; CHECK-NEXT: [[TMP8:%.*]] = sub nsw <4 x i32> zeroinitializer, [[SHUFFLE]]
	; CHECK-NEXT: [[TMP10:%.*]] = sub nsw i32 0, [[TMP9]]			; CHECK-NEXT: [[TMP9:%.*]] = shl nsw <4 x i32> [[TMP8]], zeroinitializer
	; CHECK-NEXT: [[TMP11:%.*]] = shl nsw i32 [[TMP10]], 0			; CHECK-NEXT: [[TMP10:%.*]] = add nsw <4 x i32> [[TMP9]], zeroinitializer
	; CHECK-NEXT: [[TMP12:%.*]] = add nsw i32 [[TMP11]], 0			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1
	; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds i8, i8 undef, i64 6			; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> poison, i32 [[TMP11]], i32 0
	; CHECK-NEXT: [[TMP14:%.]] = load i8, i8 [[TMP13]], align 1			; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP15:%.*]] = zext i8 [[TMP14]] to i32			; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP13]], i32 1
	; CHECK-NEXT: [[TMP16:%.*]] = sub nsw i32 0, [[TMP15]]			; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i32> [[TMP10]], i32 3
	; CHECK-NEXT: [[TMP17:%.*]] = shl nsw i32 [[TMP16]], 0			; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x i32> [[TMP14]], i32 [[TMP15]], i32 2
	; CHECK-NEXT: [[TMP18:%.*]] = add nsw i32 [[TMP17]], 0			; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i32> [[TMP10]], i32 2
	; CHECK-NEXT: [[TMP19:%.]] = getelementptr inbounds i8, i8 undef, i64 7			; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x i32> [[TMP16]], i32 [[TMP17]], i32 3
	; CHECK-NEXT: [[TMP20:%.]] = load i8, i8 [[TMP19]], align 1			; CHECK-NEXT: [[TMP19:%.*]] = add nsw <4 x i32> [[TMP10]], [[TMP18]]
	; CHECK-NEXT: [[TMP21:%.*]] = zext i8 [[TMP20]] to i32			; CHECK-NEXT: [[TMP20:%.*]] = sub nsw <4 x i32> [[TMP10]], [[TMP18]]
	; CHECK-NEXT: [[TMP22:%.*]] = sub nsw i32 0, [[TMP21]]			; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <4 x i32> [[TMP19]], <4 x i32> [[TMP20]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
	; CHECK-NEXT: [[TMP23:%.*]] = shl nsw i32 [[TMP22]], 0			; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 0
	; CHECK-NEXT: [[TMP24:%.*]] = add nsw i32 [[TMP23]], 0			; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 2
	; CHECK-NEXT: [[TMP25:%.*]] = add nsw i32 [[TMP12]], [[TMP6]]			; CHECK-NEXT: [[TMP24:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 1
	; CHECK-NEXT: [[TMP26:%.*]] = sub nsw i32 [[TMP6]], [[TMP12]]			; CHECK-NEXT: [[TMP25:%.*]] = add nsw <4 x i32> zeroinitializer, [[TMP21]]
	; CHECK-NEXT: [[TMP27:%.*]] = add nsw i32 [[TMP24]], [[TMP18]]			; CHECK-NEXT: [[TMP26:%.*]] = sub nsw <4 x i32> zeroinitializer, [[TMP21]]
	; CHECK-NEXT: [[TMP28:%.*]] = sub nsw i32 [[TMP18]], [[TMP24]]			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <4 x i32> [[TMP25]], <4 x i32> [[TMP26]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>
	; CHECK-NEXT: [[TMP29:%.*]] = add nsw i32 0, [[TMP25]]			; CHECK-NEXT: [[TMP28:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 3
	; CHECK-NEXT: [[TMP30:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 0			; CHECK-NEXT: [[TMP29:%.]] = bitcast i32 [[TMP22]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[TMP29]], i32* [[TMP30]], align 16			; CHECK-NEXT: store <4 x i32> [[TMP27]], <4 x i32>* [[TMP29]], align 16
	; CHECK-NEXT: [[TMP31:%.*]] = sub nsw i32 0, [[TMP27]]
	; CHECK-NEXT: [[TMP32:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 2
	; CHECK-NEXT: store i32 [[TMP31]], i32* [[TMP32]], align 8
	; CHECK-NEXT: [[TMP33:%.*]] = add nsw i32 0, [[TMP26]]
	; CHECK-NEXT: [[TMP34:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 1
	; CHECK-NEXT: store i32 [[TMP33]], i32* [[TMP34]], align 4
	; CHECK-NEXT: [[TMP35:%.*]] = sub nsw i32 0, [[TMP28]]
	; CHECK-NEXT: [[TMP36:%.]] = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]] undef, i64 0, i64 1, i64 3
	; CHECK-NEXT: store i32 [[TMP35]], i32* [[TMP36]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%1 = getelementptr inbounds i8, i8* undef, i64 4			%1 = getelementptr inbounds i8, i8* undef, i64 4
	%2 = load i8, i8* %1, align 1			%2 = load i8, i8* %1, align 1
	%3 = zext i8 %2 to i32			%3 = zext i8 %2 to i32
	%4 = sub nsw i32 0, %3			%4 = sub nsw i32 0, %3
	%5 = shl nsw i32 %4, 0			%5 = shl nsw i32 %4, 0
	%6 = add nsw i32 %5, 0			%6 = add nsw i32 %5, 0
	Show All 36 Lines