This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Transforms/Vectorize/
-
lib/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp

Differential D64523

[SLPVectorizer] Fix getSpillCost() calculation
AbandonedPublic

Authored by nikic on Jul 10 2019, 11:28 AM.

Download Raw Diff

Details

Reviewers

RKSimon
jmolloy
ABataev
vporpo

Summary

Original motivation here is a performance problem encountered with the following IR, for which getSpillCost() takes an unreasonable amount of time: https://gist.github.com/nikic/a5bbc62632d68f76dd8fa9a4e81d9594

However, it turns out that the source of the perf problem is a correctness problem as well: The current implementation works by going through the tree nodes, remembering the previous node we looked at and inspecting all instructions between the previous and current node. Of course, because this is a tree and not just a chain, the "previous" node in the tree vector may occur after the current one. In this case the implementation ends up scanning through the whole block (even going through the effort of wrapping around from the start of the block to the end and continuing from there). If this happens often enough, this is not just very slow, but also results in gigantic spill costs (I saw costs >500000 for the sample code).

I'm changing the implementation to compute the cost in a single pass from the root node to the start of the basic block, keeping track of which values are currently live along the way. This resolves both the performance and the correctness issues.

The downside here is that I'm stopping at the edge of the basic block. I'm not sure how I would extend this to work across multiple basic blocks and not sure if that would even be worthwhile in the context of what the SLP vectorizer does.

(Side note: Intrinsic calls probably shouldn't count towards the spill cost. In this example the "calls" are all fshr/fshl, which are certainly not going to cause any spilling...)

Diff Detail

Event Timeline

nikic created this revision.Jul 10 2019, 11:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2019, 11:28 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

You should be able to at least test the now-reasonable runtime with some huge-ish IR?

RKSimon added reviewers: ABataev, vporpo.Jul 12 2019, 12:06 AM

I guess it would be better if it visited all the BBs. You could traverse the VectorizableTree once and collect the BBs to be visited before you walk through them. Btw the TreeEntries now have operands, which can help you collect the BBs in the right order.
But I am not sure this is worth the trouble, because it looks to me that the spill cost estimation function is not very accurate in the first place. It only considers the spill cost over calls, and while doing so it is not comparing it against the spill cost of the scalar code, which if I am not mistaken could be equal to that of the vector code in many cases. Someone who knows more about the assumptions made here should probably comment on it.

Another point is that if we need to avoid this extra traversal, we could perhaps implement this within the scheduler. It already walks through all the blocks while building the dependence graph. For example we could update the LiveValues in tryScheduleBundle() and compute the cost over calls in some function like extendSchedulingRegion() ?

In D64523#1583624, @vporpo wrote:

I guess it would be better if it visited all the BBs. You could traverse the VectorizableTree once and collect the BBs to be visited before you walk through them. Btw the TreeEntries now have operands, which can help you collect the BBs in the right order.

I'm not familiar with the SLP vectorizer, is there a guarantee that the BBs form a chain of dominating BBs? I.e. is there always a single BB we can go to when the edge of the current one is reached?

But I am not sure this is worth the trouble, because it looks to me that the spill cost estimation function is not very accurate in the first place. It only considers the spill cost over calls, and while doing so it is not comparing it against the spill cost of the scalar code, which if I am not mistaken could be equal to that of the vector code in many cases. Someone who knows more about the assumptions made here should probably comment on it.

The spill cost calculation is only used on AArch64. I believe the intention is mainly to avoid some "trivial" vectorizations that would be detrimental because AArch64 has scalar, but not vector, callee-save registers.

The spill cost estimation is pretty inaccurate for multiple reasons. One is the implementation issue I try to address here, another is the fact that intrinsics are assumed to also cause spilling, another is that each call is considered independently, even though spill costs will likely be amortized across multiple calls. And, as you say, spill costs for the scalar code are not modeled.

No, there is no single BB. There can be more than one operand BBs that will have to be walked. For example, a TreeEntry in BB0 could have all its left operands in BB1 and all its right operands in BB2. In such case we would need to walk through both of them and perhaps keep the maximum cost? But anyway, given the lack of accuracy in the spill cost estimation I am not sure this is worth the trouble.

Rebase

No, there is no single BB. There can be more than one operand BBs that will have to be walked. For example, a TreeEntry in BB0 could have all its left operands in BB1 and all its right operands in BB2. In such case we would need to walk through both of them and perhaps keep the maximum cost? But anyway, given the lack of accuracy in the spill cost estimation I am not sure this is worth the trouble.

Hm, the approach used here doesn't really extend well to multiple BBs, or at least it's not obvious to me how I would manage the LiveValues in that case.

Not sure how to move forward here. The goal of this patch was mainly to make the calculation "not wrong" (in the sense of not getting completely astronomical results), but I don't think I know enough about the SLP vectorizer to actually make it "right".

nikic mentioned this in D82444: [SLP] Make sure instructions are ordered when computing spill cost..Jul 3 2020, 5:07 AM

This has since been addressed by D82444.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

72 lines

Diff 233970

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,641 Lines • ▼ Show 20 Lines	bool BoUpSLP::isTreeTinyAndNotFullyVectorizable() const {
return true;		return true;
}		}

int BoUpSLP::getSpillCost() const {		int BoUpSLP::getSpillCost() const {
// Walk from the bottom of the tree to the top, tracking which values are		// Walk from the bottom of the tree to the top, tracking which values are
// live. When we see a call instruction that is not part of our tree,		// live. When we see a call instruction that is not part of our tree,
// query TTI to see if there is a cost to keeping values live over it		// query TTI to see if there is a cost to keeping values live over it
// (for example, if spills and fills are required).		// (for example, if spills and fills are required).
unsigned BundleWidth = VectorizableTree.front()->Scalars.size();		TreeEntry Root = &VectorizableTree.front();
		unsigned BundleWidth = Root->Scalars.size();
int Cost = 0;		int Cost = 0;

SmallPtrSet<Instruction*, 4> LiveValues;		Instruction *RootInst = cast<Instruction>(Root->Scalars[0]);
Instruction *PrevInst = nullptr;		BasicBlock *BB = RootInst->getParent();

for (const auto &TEPtr : VectorizableTree) {		SmallPtrSet<Instruction*, 8> LiveValues;
Instruction *Inst = dyn_cast<Instruction>(TEPtr->Scalars[0]);		LiveValues.insert(RootInst);
if (!Inst)
continue;

if (!PrevInst) {		// We only check the basic block that contains the root instruction.
PrevInst = Inst;		BasicBlock::reverse_iterator It = RootInst->getIterator().getReverse();
continue;		for (; It != BB->rend() && !LiveValues.empty(); ++It) {
}		Instruction I = &It;

// Update LiveValues.		if (LiveValues.erase(I)) {
LiveValues.erase(PrevInst);		for (Value *Op : I->operands())
for (auto &J : PrevInst->operands()) {		if (isa<Instruction>(Op) && getTreeEntry(Op))
if (isa<Instruction>(&J) && getTreeEntry(&J))		LiveValues.insert(cast<Instruction>(Op));
LiveValues.insert(cast<Instruction>(&*J));		continue;
}		}

		// Debug informations doesn't impact spill cost.
		if (isa<CallInst>(I) && !isa<DbgInfoIntrinsic>(I)) {
LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "SLP: #LV: " << LiveValues.size();		dbgs() << "SLP: #LV: " << LiveValues.size();
for (auto *X : LiveValues)		for (auto *X : LiveValues)
dbgs() << " " << X->getName();		dbgs() << " " << X->getName();
dbgs() << ", Looking at ";		dbgs() << ", Looking at ";
Inst->dump();		I->dump();
});		});

// Now find the sequence of instructions between PrevInst and Inst.
unsigned NumCalls = 0;
BasicBlock::reverse_iterator InstIt = ++Inst->getIterator().getReverse(),
PrevInstIt =
PrevInst->getIterator().getReverse();
while (InstIt != PrevInstIt) {
if (PrevInstIt == PrevInst->getParent()->rend()) {
PrevInstIt = Inst->getParent()->rbegin();
continue;
}

// Debug information does not impact spill cost.
if ((isa<CallInst>(&*PrevInstIt) &&
!isa<DbgInfoIntrinsic>(&*PrevInstIt)) &&
&*PrevInstIt != PrevInst)
NumCalls++;

++PrevInstIt;
}

if (NumCalls) {
SmallVector<Type*, 4> V;		SmallVector<Type*, 4> V;
for (auto *II : LiveValues)		for (auto *II : LiveValues)
V.push_back(VectorType::get(II->getType(), BundleWidth));		V.push_back(VectorType::get(II->getType(), BundleWidth));
Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(V);		Cost += TTI->getCostOfKeepingLiveOverCall(V);
}		}

PrevInst = Inst;
}		}

return Cost;		return Cost;
}		}

int BoUpSLP::getTreeCost() {		int BoUpSLP::getTreeCost() {
int Cost = 0;		int Cost = 0;
LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "		LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "
▲ Show 20 Lines • Show All 3,752 Lines • Show Last 20 Lines