Download Raw Diff

Details

Reviewers

Commits

rGe4e5923ef17c: [SLP] Improve comments and naming of functions/variables/members, NFC.
rG2c08fde9e570: [SLP] Improve comments and naming of functions/variables/members, NFC.
rL304616: [SLP] Improve comments and naming of functions/variables/members, NFC.
rL304593: [SLP] Improve comments and naming of functions/variables/members, NFC.

Summary

Fixed some comments, added an additional description of the algorithms,
improved readability of the code.

Diff Detail

Repository: rL LLVM

Event Timeline

ABataev created this revision.May 18 2017, 9:33 AM

Harbormaster completed remote builds in B6546: Diff 99445.May 18 2017, 9:33 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMay 18 2017, 9:33 AM

Thanks for improving this!

lib/Transforms/Vectorize/SLPVectorizer.cpp
4772 ↗	(On Diff #99445)	First word needs to be a verb, how about analyzingInstruction.
4774 ↗	(On Diff #99445)	setAnalyzingOperands
4776 ↗	(On Diff #99445)	areOperandsAnalyzed
4785 ↗	(On Diff #99445)	getNextOperand
4820–4827 ↗	(On Diff #99445)	Please mention the actual traversal used. It seems this is preorder. If that is true why do you even need to track the operand index? Iterative pre-order is: while (!Stack.empty()) { Value V = Stack.back(); Stack.pop_back(); visit(V); for (auto O = V->rbegin(); O != V->rend(); ++O) Stack.push_back(*O); }

Address comments from Adam.

Harbormaster completed remote builds in B6722: Diff 100082.May 24 2017, 7:13 AM

anemet added inline comments.May 24 2017, 9:53 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	Again, which order?

ABataev added inline comments.May 24 2017, 10:11 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	DFS

anemet added inline comments.May 24 2017, 10:12 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	pre/in/post?

ABataev added inline comments.May 24 2017, 10:15 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	pre

anemet added inline comments.May 24 2017, 10:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	Okay, then please update the comment. Also please answer my question below why we chose to track the edges with an iterative pre-order traversal. If it's unnecessary, please fix or add a FIXME. Thank you.

ABataev added inline comments.May 24 2017, 11:01 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	It simplifies limiting the tree traversal level of the function. We can use simple pre-order traversal, but still need to keep the tree node level within the Stack, for example, to limit the recursion level.

anemet added inline comments.May 25 2017, 2:26 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	I am not sure I see a major difference, do you have an example?
4861 ↗	(On Diff #100082)	It would be good to add a comment why we need to clear P beyond the first level.

ABataev added inline comments.May 26 2017, 6:37 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	I'm just saying that there is no major difference. We won't win anything by using of pre-order traversal comparing with DFS preorder traversal. We will need to track the current tree depth to be able to cut off the vectorization process if the tree depth is more than `RecursionMaxDepth` (see line 4879).
4861 ↗	(On Diff #100082)	Ok

Added a comment regarding assigning nullptr value to P

anemet added inline comments.May 26 2017, 10:44 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	I'm just saying that there is no major difference. We won't win anything by using of pre-order traversal comparing with DFS preorder traversal. Now I am confused. pre-order traversal is DFS pre-order traversal.

ABataev added inline comments.May 26 2017, 10:53 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	I'm sorry, I meant you're suggesting to change traversal to BFS.

anemet added inline comments.May 26 2017, 11:00 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	No it's still preorder. My point was that for preorder unlike postorder you don't need to keep track of level of processing that has been performed on a node (i.e. getOperandIndex()). See the code one comment down.

ABataev added inline comments.May 26 2017, 12:03 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	Yes, I understand this. But how to track the tree depth in your code? We need to limit the depth by the `RecusrsionMaxDepth`.

anemet added inline comments.May 31 2017, 8:35 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	The depth could be stored with each stack element -- that would still be simpler than keeping track of the operand. It would also properly describe the intent; the depth is only maintained to control the recursion limit.

ABataev added inline comments.Jun 1 2017, 6:07 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	I believe this is a matter of taste. Besides, the same approach is used in `matchAssociativeReduction` function and I believe it is better to use the same approach in all functions, instead of using the different algorithms for implementation of the same idea.

anemet added inline comments.Jun 1 2017, 11:45 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗	(On Diff #100082)	But it is not the same idea. matchAssociativeReduction performs post-order traversal. So it's a wrong example to support your case! I did see that you were copying the code from there (already a bad idea). I think that both of these should be rewritten with df_iterator and po_iterator respectively. Let's just add a FIXME in the code and move on: "This performs pre-order traversal. FIXME: can we do this with df_iterator?"

Updates after comments.

Thanks very much for rewriting the loop. This is way more intuitive now.

A few more nits/questions below but feel free to commit either way.

lib/Transforms/Vectorize/SLPVectorizer.cpp
4783 ↗	(On Diff #101222)	Capitalize sentence.
4784 ↗	(On Diff #101222)	I would start the level from 0, that's more canonical and then later would check like this: if (++Level < MaxDepth)
4801–4802 ↗	(On Diff #101222)	Rather than saying "on the next iteration", isn't it the desire that we don't analyze the phi node unless this is the root node? If yes it's better to say so, i.e. "unless this is the root node".

This revision is now accepted and ready to land.Jun 2 2017, 9:48 AM

Update after review

Harbormaster completed remote builds in B6954: Diff 101265.Jun 2 2017, 12:49 PM

LGTM.

lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗	(On Diff #101265)	Something got messed up with upper/lowercase here.

ABataev added inline comments.Jun 2 2017, 1:08 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗	(On Diff #101265)	I tried to keep the original names of the variables. Should I capitalize them too?

anemet added inline comments.Jun 2 2017, 1:15 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗	(On Diff #101265)	Oh sorry, no I only meant the capitalize the first letter of the sentence. Looking at it again, I think I misread this and thought that "subtrees" was the beginning of a sentence. It's not so just go back to the original version. Sorry about the confusion!

ABataev added inline comments.Jun 2 2017, 1:20 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗	(On Diff #101265)	Ok, no problems. Then I'll restore it back before commit

Closed by commit rL304593: [SLP] Improve comments and naming of functions/variables/members, NFC. (authored by ABataev). · Explain WhyJun 2 2017, 1:39 PM

This revision was automatically updated to reflect the committed changes.

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
4828	Could you please explain how this is NFC wrt the previous code? In the code on the LHS, we are checking that the operand is in the same basic block as the root before placing it on the stack. Here we are unconditionally placing all operands on the stack.

In D33320#795834, @anna wrote:

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

It does not limits the number of processed nodes, it limits the tree height just like it was before.

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
4828	Yes, missed these checks, will add them

In D33320#796679, @ABataev wrote:

In D33320#795834, @anna wrote:

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

It does not limits the number of processed nodes, it limits the tree height just like it was before.

Yes, but limiting the tree height itself is not enough right? Now, in the worst case, 2^12 nodes being processed in the tryToVectorizeHorReductionOrInstOperands, when earlier it was just a single node (i.e. before this change: https://reviews.llvm.org/D25517).

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
4828	Yeah, we saw huge compile time degradations in tryToVectorizeHorReductionOrInstOperands.

In D33320#796697, @anna wrote:

In D33320#796679, @ABataev wrote:

In D33320#795834, @anna wrote:

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

It does not limits the number of processed nodes, it limits the tree height just like it was before.

Yes, but limiting the tree height itself is not enough right? Now, in the worst case, 2^12 nodes being processed in the tryToVectorizeHorReductionOrInstOperands, when earlier it was just a single node (i.e. before this change: https://reviews.llvm.org/D25517).

@anna, I am confused whether you're complaining about the additional overhead in the original change (D25517) or the algorithmic change in this refinement (D33320). Your comparison above seems to suggest your baseline is *before* the original changes.

In this refinement change, we can have more nodes on the stack compared to the original change but the number of nodes processed should remain unchanged.

@anemet: We saw the degradations with respect to this change (D33320) itself, not before the original change. Here's the code change as I see it. Please correct if this is wrong:

we were initially processing exactly one node in tryToVectorizeHorReductionOrInstOperands equivalent function, before any changes below.
after D25517, we are processing at most 2 ^ 12 nodes in tryToVectorizeHorReductionOrInstOperands, but they are limited to a single basic block.
in this refinement change (D33320), we are processing at most 2 ^ 12 nodes, but they are no longer limited to a single basic block.

@ABataev is fixing #3 by limiting it to the same basic block so that it's actually an NFC wrt #2. That might fix the compile time regressions. However, my concern is that aren't we still having the possibility of high compile time impact in #2 when we have a single large basic block with 2^10 binary instructions for example (because we are not limiting the number of nodes, but rather the depth of the tree)?
Should we perhaps have a threshold cutoff for the number of stack nodes? Or reduce the MaxDepthRecursion from 12 to 6?

I bailed out of this loop (return false in tryToVectorizeHorReductionOrInstOperands) when the stack size is too large, and that fixed our compile time regression temporarily.

I'm working on the patch that stops vectorization if the parent basic block of the instruction is not BB or the instruction was processed already, but it's quite hard to add a test for this change. I can publish it as NFC, because we just limiting the number of analyzed instructions if this is acceptable.

In D33320#796821, @ABataev wrote:

I'm working on the patch that stops vectorization if the parent basic block of the instruction is not BB or the instruction was processed already, but it's quite hard to add a test for this change. I can publish it as NFC, because we just limiting the number of analyzed instructions if this is acceptable.

Thanks Alexey. That should fix the current regressions we are seeing. Could you please add me as a reviewer to the patch?

The second concern I have (as mentioned in above comment) is that with D25517, we are increasing the complexity of tryToVectorizeHorReductionOrInstOperands from a single node being processed to an exponential number of nodes (2^12). Is this a correct analysis?

It may not be an issue in practice once we limit to the current basic block (the fix you're working on). However, this function is called for every instruction in the IR as part of vectorizeChainsInBlock, so as the basic block size increases, we may see a compile time impact.

We just identified this commit as the cause of a 10x slowdown when compiling shared_sha256.c in the llvm test-suite/CTMark (x86, no special flags should get you the default -O3) showing up on our performance tracking.

Did any of the planned improvements make it to ToT yet?

In D33320#824506, @MatzeB wrote:

We just identified this commit as the cause of a 10x slowdown when compiling shared_sha256.c in the llvm test-suite/CTMark (x86, no special flags should get you the default -O3) showing up on our performance tracking.

Did any of the planned improvements make it to ToT yet?

Seems like this has recovered on ToT.

In D33320#824538, @MatzeB wrote:

In D33320#824506, @MatzeB wrote:

We just identified this commit as the cause of a 10x slowdown when compiling shared_sha256.c in the llvm test-suite/CTMark (x86, no special flags should get you the default -O3) showing up on our performance tracking.

Did any of the planned improvements make it to ToT yet?

Seems like this has recovered on ToT.

This was the review thread: https://reviews.llvm.org/D34881

Diff 101269

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 4,743 Lines • ▼ Show 20 Lines	static Value getReductionValue(const DominatorTree DT, PHINode *P,
}		}

if (Rdx && DominatedReduxValue(Rdx))		if (Rdx && DominatedReduxValue(Rdx))
return Rdx;		return Rdx;

return nullptr;		return nullptr;
}		}

namespace {		/// Attempt to reduce a horizontal reduction.
/// Tracks instructons and its children.		/// If it is legal to match a horizontal reduction feeding the phi node \a P
class WeakTrackingVHWithLevel final : public CallbackVH {		/// with reduction operators \a Root (or one of its operands) in a basic block
/// Operand index of the instruction currently beeing analized.		/// \a BB, then check if it can be done. If horizontal reduction is not found
unsigned Level = 0;		/// and root instruction is a binary operation, vectorization of the operands is
/// Is this the instruction that should be vectorized, or are we now		/// attempted.
/// processing children (i.e. operands of this instruction) for potential		/// \returns true if a horizontal reduction was matched and reduced or operands
/// vectorization?		/// of one of the binary instruction were vectorized.
bool IsInitial = true;		/// \returns false if a horizontal reduction was not matched (or not possible)
		/// or no vectorization of any binary operation feeding \a Root instruction was
public:		/// performed.
explicit WeakTrackingVHWithLevel() = default;		static bool tryToVectorizeHorReductionOrInstOperands(
WeakTrackingVHWithLevel(Value *V) : CallbackVH(V){};
/// Restart children analysis each time it is repaced by the new instruction.
void allUsesReplacedWith(Value *New) override {
setValPtr(New);
Level = 0;
IsInitial = true;
}
/// Check if the instruction was not deleted during vectorization.
bool isValid() const { return !getValPtr(); }
/// Is the istruction itself must be vectorized?
bool isInitial() const { return IsInitial; }
/// Try to vectorize children.
void clearInitial() { IsInitial = false; }
/// Are all children processed already?
bool isFinal() const {
assert(getValPtr() &&
(isa<Instruction>(getValPtr()) &&
cast<Instruction>(getValPtr())->getNumOperands() >= Level));
return getValPtr() &&
cast<Instruction>(getValPtr())->getNumOperands() == Level;
}
/// Get next child operation.
Value *nextOperand() {
assert(getValPtr() && isa<Instruction>(getValPtr()) &&
cast<Instruction>(getValPtr())->getNumOperands() > Level);
return cast<Instruction>(getValPtr())->getOperand(Level++);
}
virtual ~WeakTrackingVHWithLevel() = default;
};
} // namespace

/// \brief Attempt to reduce a horizontal reduction.
/// If it is legal to match a horizontal reduction feeding
/// the phi node P with reduction operators Root in a basic block BB, then check
/// if it can be done.
/// \returns true if a horizontal reduction was matched and reduced.
/// \returns false if a horizontal reduction was not matched.
static bool canBeVectorized(
PHINode P, Instruction Root, BasicBlock *BB, BoUpSLP &R,		PHINode P, Instruction Root, BasicBlock *BB, BoUpSLP &R,
TargetTransformInfo *TTI,		TargetTransformInfo *TTI,
const function_ref<bool(BinaryOperator *, BoUpSLP &)> Vectorize) {		const function_ref<bool(BinaryOperator *, BoUpSLP &)> Vectorize) {
if (!ShouldVectorizeHor)		if (!ShouldVectorizeHor)
return false;		return false;

if (!Root)		if (!Root)
return false;		return false;

if (Root->getParent() != BB)		if (Root->getParent() != BB)
return false;		return false;
SmallVector<WeakTrackingVHWithLevel, 8> Stack(1, Root);		// Start analysis starting from Root instruction. If horizontal reduction is
		// found, try to vectorize it. If it is not a horizontal reduction or
		// vectorization is not possible or not effective, and currently analyzed
		// instruction is a binary operation, try to vectorize the operands, using
		// pre-order DFS traversal order. If the operands were not vectorized, repeat
		// the same procedure considering each operand as a possible root of the
		// horizontal reduction.
		// Interrupt the process if the Root instruction itself was vectorized or all
		// sub-trees not higher that RecursionMaxDepth were analyzed/vectorized.
		SmallVector<std::pair<WeakVH, unsigned>, 8> Stack(1, {Root, 0});
SmallSet<Value *, 8> VisitedInstrs;		SmallSet<Value *, 8> VisitedInstrs;
bool Res = false;		bool Res = false;
while (!Stack.empty()) {		while (!Stack.empty()) {
Value *V = Stack.back();		Value *V;
if (!V) {		unsigned Level;
Stack.pop_back();		std::tie(V, Level) = Stack.pop_back_val();
		if (!V)
continue;		continue;
}
auto *Inst = dyn_cast<Instruction>(V);		auto *Inst = dyn_cast<Instruction>(V);
if (!Inst \|\| isa<PHINode>(Inst)) {		if (!Inst \|\| isa<PHINode>(Inst))
Stack.pop_back();
continue;		continue;
}
if (Stack.back().isInitial()) {
Stack.back().clearInitial();
if (auto *BI = dyn_cast<BinaryOperator>(Inst)) {		if (auto *BI = dyn_cast<BinaryOperator>(Inst)) {
HorizontalReduction HorRdx;		HorizontalReduction HorRdx;
if (HorRdx.matchAssociativeReduction(P, BI)) {		if (HorRdx.matchAssociativeReduction(P, BI)) {
if (HorRdx.tryToReduce(R, TTI)) {		if (HorRdx.tryToReduce(R, TTI)) {
Res = true;		Res = true;
		// Set P to nullptr to avoid re-analysis of phi node in
		// matchAssociativeReduction function unless this is the root node.
P = nullptr;		P = nullptr;
continue;		continue;
}		}
}		}
if (P) {		if (P) {
Inst = dyn_cast<Instruction>(BI->getOperand(0));		Inst = dyn_cast<Instruction>(BI->getOperand(0));
if (Inst == P)		if (Inst == P)
Inst = dyn_cast<Instruction>(BI->getOperand(1));		Inst = dyn_cast<Instruction>(BI->getOperand(1));
if (!Inst) {		if (!Inst) {
		// Set P to nullptr to avoid re-analysis of phi node in
		// matchAssociativeReduction function unless this is the root node.
P = nullptr;		P = nullptr;
continue;		continue;
}		}
}		}
}		}
		// Set P to nullptr to avoid re-analysis of phi node in
		// matchAssociativeReduction function unless this is the root node.
P = nullptr;		P = nullptr;
if (Vectorize(dyn_cast<BinaryOperator>(Inst), R)) {		if (Vectorize(dyn_cast<BinaryOperator>(Inst), R)) {
Res = true;		Res = true;
continue;		continue;
}		}
}
if (Stack.back().isFinal()) {
Stack.pop_back();
continue;
}

if (auto *NextV = dyn_cast<Instruction>(Stack.back().nextOperand()))		// Try to vectorize operands.
if (NextV->getParent() == BB && VisitedInstrs.insert(NextV).second &&		if (++Level < RecursionMaxDepth)
		annaUnsubmitted Not Done Reply Inline Actions Could you please explain how this is NFC wrt the previous code? In the code on the LHS, we are checking that the operand is in the same basic block as the root before placing it on the stack. Here we are unconditionally placing all operands on the stack. anna: Could you please explain how this is NFC wrt the previous code? In the code on the LHS, we are…
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Yes, missed these checks, will add them ABataev: Yes, missed these checks, will add them
		annaUnsubmitted Not Done Reply Inline Actions Yeah, we saw huge compile time degradations in tryToVectorizeHorReductionOrInstOperands. anna: Yeah, we saw huge compile time degradations in tryToVectorizeHorReductionOrInstOperands.
Stack.size() < RecursionMaxDepth)		for (auto *Op : Inst->operand_values())
Stack.push_back(NextV);		Stack.emplace_back(Op, Level);
}		}
return Res;		return Res;
}		}

bool SLPVectorizerPass::vectorizeRootInstruction(PHINode P, Value V,		bool SLPVectorizerPass::vectorizeRootInstruction(PHINode P, Value V,
BasicBlock *BB, BoUpSLP &R,		BasicBlock *BB, BoUpSLP &R,
TargetTransformInfo *TTI) {		TargetTransformInfo *TTI) {
if (!V)		if (!V)
return false;		return false;
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
if (!I)		if (!I)
return false;		return false;

if (!isa<BinaryOperator>(I))		if (!isa<BinaryOperator>(I))
P = nullptr;		P = nullptr;
// Try to match and vectorize a horizontal reduction.		// Try to match and vectorize a horizontal reduction.
return canBeVectorized(P, I, BB, R, TTI,		return tryToVectorizeHorReductionOrInstOperands(
[this](BinaryOperator *BI, BoUpSLP &R) -> bool {		P, I, BB, R, TTI, [this](BinaryOperator *BI, BoUpSLP &R) -> bool {
return tryToVectorize(BI, R);		return tryToVectorize(BI, R);
});		});
}		}

bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {		bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {
bool Changed = false;		bool Changed = false;
SmallVector<Value *, 4> Incoming;		SmallVector<Value *, 4> Incoming;
SmallSet<Value *, 16> VisitedInstrs;		SmallSet<Value *, 16> VisitedInstrs;

bool HaveVectorizedPhiNodes = true;		bool HaveVectorizedPhiNodes = true;
▲ Show 20 Lines • Show All 295 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Improve comments and naming of functions/variables/members, NFC.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 101269

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Improve comments and naming of functions/variables/members, NFC.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 101269

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

[SLP] Improve comments and naming of functions/variables/members, NFC.
ClosedPublic