This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/4
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
spillcost-order.ll

Differential D82444

[SLP] Make sure instructions are ordered when computing spill cost.
ClosedPublic

Authored by fhahn on Jun 24 2020, 2:37 AM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
xbolva00
ABataev
spatel

Commits

rG0b774acf1189: [SLP] Make sure instructions are ordered when computing spill cost.
rGeb46137daa92: [SLP] Make sure instructions are ordered when computing spill cost.

Summary

The entries in VectorizableTree are not necessarily ordered by their
position in basic blocks. Collect them and order them by dominance so
later instructions are guaranteed to be visited first. For instructions
in different basic blocks, we only scan to the beginning of the block,
so their order does not matter, as long as all instructions in a basic
block are grouped together. Using dominance ensures a deterministic order.

The modified test case contains an example where we compute a wrong
spill cost (2) without this patch, even though there is no call between
any instruction in the bundle.

This seems to have limited practical impact, .e.g on X86 with a recent
Intel Xeon CPU with -O3 -march=native -flto on MultiSource,SPEC2000,SPEC2006
there are no binary changes.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Jun 24 2020, 2:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2020, 2:37 AM

Herald added subscribers: dexonsmith, hiraditya, qcolombet. · View Herald Transcript

Harbormaster failed remote builds in B61516: Diff 272943!Jun 24 2020, 2:40 AM

ABataev added inline comments.Jun 24 2020, 5:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3769–3778	Can we use a `set` instead of the vector with sort?

dexonsmith removed a subscriber: dexonsmith.Jun 24 2020, 3:21 PM

fhahn marked an inline comment as done.Jun 25 2020, 5:52 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3769–3778	We could use `set` with dominates() as comparator I think. I am not sure if there would be a benefit of doing so for the use locally in the function, but maybe we could use it to keep VectorizableTree in order to start with? Not entirely sure what impact that would have at other places, but it might be worth exploring as follow-up?

ping

This seems to have limited practical impact, .e.g on X86 with a recent
Intel Xeon CPU with -O3 -march=native -flto on MultiSource,SPEC2000,SPEC2006
there are no binary changes.

That's not particularly surprising, as the spill cost is always zero on X86. AArch64 is the only architecture that defines getCostOfKeepingLiveOverCall().

I had a different patch for this issue in D64523, but your solution to the problem is much simpler :)

This revision is now accepted and ready to land.Jul 3 2020, 7:05 AM

Closed by commit rGeb46137daa92: [SLP] Make sure instructions are ordered when computing spill cost. (authored by fhahn). · Explain WhyJul 3 2020, 9:40 AM

This revision was automatically updated to reflect the committed changes.

fhahn mentioned this in rG039145c72b81: [SLP] Precommit test for which spill cost is computed incorrectly..

vdmitrie added a subscriber: vdmitrie.Jul 14 2020, 9:36 AM

vdmitrie added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3777	There is a problem with this predicate function. MSFT STL implementation of stable_sort asserts that if predicate returned true then it must return false when operands are swapped. But (A dom B) == false does not necessarily mean that (B dom A) == true. Instead: (A dom B) ==true means that (B dom A) == false. Rewriting it like this solves this issue: return DT->dominates(B, A);

nikic mentioned this in D64523: [SLPVectorizer] Fix getSpillCost() calculation.Aug 8 2020, 5:47 AM

xbolva00 added inline comments.Aug 8 2020, 6:06 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3777	@fhahn try to fix it and reland. Seems like this is a reason why a win buildbot failed.

xbolva00 added a commit: rG0b774acf1189: [SLP] Make sure instructions are ordered when computing spill cost..Aug 11 2020, 2:18 AM

Seems like ClamAV has now compile time regression ~ 0.1-0.2 % (http://llvm-compile-time-tracker.com/compare.php?from=3ce57e012110519c1d3a49fc98959a64634d5d8f&to=36e1fc5f68e918ba69ccd9033b38240265617c4e&stat=instructions&details=on)

CMakeFiles/clamscan.dir/libclamav_mspack.c.o 6432M 6472M (+0.62%)
CMakeFiles/clamscan.dir/libclamav_nsis_LZMADecode.c.o 730M 738M (+1.09%)

CMakeFiles/clamscan.dir/libclamav_upx.c.o 1213M 1272M (+4.87%)

cc @nikic

dtemirbulatov added a subscriber: dtemirbulatov.Aug 11 2020, 7:05 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

13 lines

test/

Transforms/

SLPVectorizer/

AArch64/

spillcost-order.ll

23 lines

Diff 275417

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,754 Lines • ▼ Show 20 Lines	int BoUpSLP::getSpillCost() const {
// query TTI to see if there is a cost to keeping values live over it		// query TTI to see if there is a cost to keeping values live over it
// (for example, if spills and fills are required).		// (for example, if spills and fills are required).
unsigned BundleWidth = VectorizableTree.front()->Scalars.size();		unsigned BundleWidth = VectorizableTree.front()->Scalars.size();
int Cost = 0;		int Cost = 0;

SmallPtrSet<Instruction*, 4> LiveValues;		SmallPtrSet<Instruction*, 4> LiveValues;
Instruction *PrevInst = nullptr;		Instruction *PrevInst = nullptr;

		// The entries in VectorizableTree are not necessarily ordered by their
		// position in basic blocks. Collect them and order them by dominance so later
		// instructions are guaranteed to be visited first. For instructions in
		// different basic blocks, we only scan to the beginning of the block, so
		// their order does not matter, as long as all instructions in a basic block
		// are grouped together. Using dominance ensures a deterministic order.
		SmallVector<Instruction *, 16> OrderedScalars;
for (const auto &TEPtr : VectorizableTree) {		for (const auto &TEPtr : VectorizableTree) {
Instruction *Inst = dyn_cast<Instruction>(TEPtr->Scalars[0]);		Instruction *Inst = dyn_cast<Instruction>(TEPtr->Scalars[0]);
if (!Inst)		if (!Inst)
continue;		continue;
		OrderedScalars.push_back(Inst);
		}
		llvm::stable_sort(OrderedScalars, [this](Instruction A, Instruction B) {
		return !DT->dominates(A, B);
		vdmitrieUnsubmitted Not Done Reply Inline Actions There is a problem with this predicate function. MSFT STL implementation of stable_sort asserts that if predicate returned true then it must return false when operands are swapped. But (A dom B) == false does not necessarily mean that (B dom A) == true. Instead: (A dom B) ==true means that (B dom A) == false. Rewriting it like this solves this issue: return DT->dominates(B, A); vdmitrie: There is a problem with this predicate function. MSFT STL implementation of stable_sort asserts…
		xbolva00Unsubmitted Not Done Reply Inline Actions @fhahn try to fix it and reland. Seems like this is a reason why a win buildbot failed. xbolva00: @fhahn try to fix it and reland. Seems like this is a reason why a win buildbot failed.
		});
		ABataevUnsubmitted Not Done Reply Inline Actions Can we use a `set` instead of the vector with sort? ABataev: Can we use a `set` instead of the vector with sort?
		fhahnAuthorUnsubmitted Done Reply Inline Actions We could use `set` with dominates() as comparator I think. I am not sure if there would be a benefit of doing so for the use locally in the function, but maybe we could use it to keep VectorizableTree in order to start with? Not entirely sure what impact that would have at other places, but it might be worth exploring as follow-up? fhahn: We could use `set` with dominates() as comparator I think. I am not sure if there would be a…

		for (Instruction *Inst : OrderedScalars) {
if (!PrevInst) {		if (!PrevInst) {
PrevInst = Inst;		PrevInst = Inst;
continue;		continue;
}		}

// Update LiveValues.		// Update LiveValues.
LiveValues.erase(PrevInst);		LiveValues.erase(PrevInst);
for (auto &J : PrevInst->operands()) {		for (auto &J : PrevInst->operands()) {
▲ Show 20 Lines • Show All 3,842 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-order.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S %s \| FileCheck %s

	target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
	target triple = "arm64-apple-ios13.0.0"			target triple = "arm64-apple-ios13.0.0"

	declare i1 @cond()			declare i1 @cond()
	declare i32* @get_ptr()			declare i32* @get_ptr()

	define void @test(i64* %ptr, i64* noalias %res) {			define void @test(i64* %ptr, i64* noalias %res) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[CALL_I_I:%.]] = call i32 @get_ptr()			; CHECK-NEXT: [[CALL_I_I:%.]] = call i32 @get_ptr()
	; CHECK-NEXT: [[L_0_0:%.]] = load i32, i32 [[CALL_I_I]], align 2
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr i32, i32 [[CALL_I_I]], i32 2			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr i32, i32 [[CALL_I_I]], i32 2
	; CHECK-NEXT: [[L_1_0:%.]] = load i32, i32 [[GEP_1]], align 2
	; CHECK-NEXT: [[EXT_0_0:%.*]] = zext i32 [[L_0_0]] to i64
	; CHECK-NEXT: [[EXT_1_0:%.*]] = zext i32 [[L_1_0]] to i64
	; CHECK-NEXT: [[SUB_1:%.*]] = sub nsw i64 [[EXT_0_0]], [[EXT_1_0]]
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr i32, i32 [[CALL_I_I]], i32 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr i32, i32 [[CALL_I_I]], i32 1
	; CHECK-NEXT: [[L_0_1:%.]] = load i32, i32 [[GEP_2]], align 2			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[CALL_I_I]] to <2 x i32>*
				; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 2
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr i32, i32 [[CALL_I_I]], i32 3			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr i32, i32 [[CALL_I_I]], i32 3
	; CHECK-NEXT: [[L_1_1:%.]] = load i32, i32 [[GEP_3]], align 2			; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[GEP_1]] to <2 x i32>*
	; CHECK-NEXT: [[EXT_0_1:%.*]] = zext i32 [[L_0_1]] to i64			; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 2
	; CHECK-NEXT: [[EXT_1_1:%.*]] = zext i32 [[L_1_1]] to i64			; CHECK-NEXT: [[TMP4:%.*]] = zext <2 x i32> [[TMP1]] to <2 x i64>
	; CHECK-NEXT: [[SUB_2:%.*]] = sub nsw i64 [[EXT_0_1]], [[EXT_1_1]]			; CHECK-NEXT: [[TMP5:%.*]] = zext <2 x i32> [[TMP3]] to <2 x i64>
	; CHECK-NEXT: store i64 [[SUB_1]], i64* [[RES:%.*]], align 8			; CHECK-NEXT: [[TMP6:%.*]] = sub nsw <2 x i64> [[TMP4]], [[TMP5]]
	; CHECK-NEXT: [[RES_1:%.]] = getelementptr i64, i64 [[RES]], i64 1			; CHECK-NEXT: [[RES_1:%.]] = getelementptr i64, i64 [[RES:%.*]], i64 1
	; CHECK-NEXT: store i64 [[SUB_2]], i64* [[RES_1]], align 8			; CHECK-NEXT: [[TMP7:%.]] = bitcast i64 [[RES]] to <2 x i64>*
				; CHECK-NEXT: store <2 x i64> [[TMP6]], <2 x i64>* [[TMP7]], align 8
	; CHECK-NEXT: [[C:%.*]] = call i1 @cond()			; CHECK-NEXT: [[C:%.*]] = call i1 @cond()
	; CHECK-NEXT: br i1 [[C]], label [[FOR_BODY]], label [[EXIT:%.*]]			; CHECK-NEXT: br i1 [[C]], label [[FOR_BODY]], label [[EXIT:%.*]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	Show All 27 Lines