This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize loads in horizontal reductions when they happen to be in the reverse order
ClosedPublic

Authored by mkuper on Jul 19 2016, 5:05 PM.

Download Raw Diff

Details

Reviewers

mzolotukhin
mssimpso

Commits

rG38e72980936c: [SLPVectorizer] Vectorize reverse-order loads in horizontal reductions
rL276477: [SLPVectorizer] Vectorize reverse-order loads in horizontal reductions

Summary

This fixes PR28474.

Admittedly, this is probably not the "right way" to fix this. We already had one case we handle by rebuilding the entire tree with the roots reversed (line 3816), and this adds another one, but this does a lot of redundant work, and obviously can't work for arbitrary orders of the loads.

What we'd really want is to sort the loads "on-the-fly" while building the tree, at least in the cases where the order of the scalars doesn't matter (i.e. reductions). In theory, we could also have a load + shuffle when the order does matter, or even prefer sorting loads to sorting stores when we have a store-rooted reduction, but that'd require additional cost modeling. Unfortunately, I still don't grok the SLP vectorizer enough to understand how to do that correctly. It's not just a question of making the tree mutable - it seems like there are plenty of places that assume the order doesn't change (treating VL[0] as special, isSame() that cares about the order, external users, etc.).

Advice will be appreciated.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 64606.Jul 19 2016, 5:05 PM

mkuper retitled this revision from to [SLP] Vectorize loads in horizontal reductions when they happen to be in the reverse order.

mkuper updated this object.

mkuper added reviewers: mzolotukhin, mssimpso.

mkuper added a subscriber: llvm-commits.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptJul 19 2016, 5:05 PM

Hi Michael,

Thanks for working on this. The issue of unordered loads seems like an arbitrary restriction to me that needs fixing. The patch basically looks okay; I inlined a few comments.

I agree with most of your high-level points. We shouldn't really need to pay the cost of rebuilding the entire tree when we find that the loads are unordered. In fact, except for loads and stores, the order of the instructions shouldn't really matter at all. The fact that we assume an order in some cases is a problem that should be fixable. For example, I think VL[0] is commonly used as a representative of the bundle for things like getting the type, not to imply an ordering. I bet we could eliminate most instances of VL[0] with appropriate additions to TreeEntry. And we keep track of a Lane index for ExternalUsers just so we can extract the correct value from the vectors. But this is already kind of redundant since ExternalUser also holds a pointer to the scalar value and we maintain a list of the scalar values in TreeEntry.

So The difficult case seems to be the store-rooted trees with unordered loads. For that, we could use shuffles like you mentioned.

I think it's fine to only check for the reverse-consecutive case since that's all we are currently doing, but a more general solution would eventually be nice to have. Compile-time might be a concern there, though.

Matt.

lib/Transforms/Vectorize/SLPVectorizer.cpp
884 ↗	(On Diff #64606)	Please update the comment since we now track reversed bundles of size greater than 2.
1158 ↗	(On Diff #64606)	We also check for the isSimple case here as well. It would be nice to make the comment more detailed since you're already changing it.
1174 ↗	(On Diff #64606)	We can break out of the loop when setting Consecutive to false to avoid unneeded calls to isConsecutiveAccess. But I'm wondering if there other ways to make this faster. We are looking for the all consecutive or all reverse-consecutive cases. So for example if we find at least one consecutive pair in this first loop, there's no chance the list can be all reverse-consecutive.
3818 ↗	(On Diff #64606)	This assert is confusing now. I think it only makes sense as-is because allowReorder is only true when coming from tryToVectorizePair. But you've removed the size-equal-2 restriction when checking for the reverse consecutive case. I think we should add a more explanatory comment. What do you think?
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4–6 ↗	(On Diff #64606)	Please use regular expressions for the unnamed instructions to avoid future breakage.

Thanks a lot for the review, Matt!

In D22554#491805, @mssimpso wrote:

I agree with most of your high-level points. We shouldn't really need to pay the cost of rebuilding the entire tree when we find that the loads are unordered. In fact, except for loads and stores, the order of the instructions shouldn't really matter at all. The fact that we assume an order in some cases is a problem that should be fixable. For example, I think VL[0] is commonly used as a representative of the bundle for things like getting the type, not to imply an ordering. I bet we could eliminate most instances of VL[0] with appropriate additions to TreeEntry. And we keep track of a Lane index for ExternalUsers just so we can extract the correct value from the vectors. But this is already kind of redundant since ExternalUser also holds a pointer to the scalar value and we maintain a list of the scalar values in TreeEntry.

Yes, that seems fairly simple in theory, but I tried to actually implement something like this - without rewriting the whole thing from scratch :-) - and ran into more trouble than I expected.
Then again, it may just be my lack of familiarity with the code.

So The difficult case seems to be the store-rooted trees with unordered loads. For that, we could use shuffles like you mentioned.

I think It's a bit more complex than that.
I agree the most difficult case is store-rooted trees with unordered loads, but there are at least two other issues that aren't related to the technical problems with getting in-tree sorting to work:

Even for non-store-rooted trees, we may have several load leaves in the tree, and they may need different orders.
There may be a cost modeling issue - I'm not sure a wide load + shuffles is always better than scalar loads. The decision may depend on the width of the vector and on how expensive the specific shuffle is.

lib/Transforms/Vectorize/SLPVectorizer.cpp
884 ↗	(On Diff #64606)	Right, thanks!
1158 ↗	(On Diff #64606)	Sure.
1174 ↗	(On Diff #64606)	Good point. I'll try to make it a bit more efficient.
3818 ↗	(On Diff #64606)	Right. At first, I removed the assert, and replaced this code with std::reverse, like the other case, but decided to keep the assert because it made which cases are currently supposed to get here clear. I'll add a comment explaining this.
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4–6 ↗	(On Diff #64606)	Sure.

Addressed Matt's comments.

LGTM after a few more changes. Thanks!

lib/Transforms/Vectorize/SLPVectorizer.cpp
1206–1209 ↗	(On Diff #65000)	I think this would be more clear if instead of the !Consecutive check, we have the following after setting Consecutive/ReverseConsecutive if (Consecutive) { ++NumLoadsWantToKeepOrder; newTreeEntry(VL, true); DEBUG(dbgs() << "SLP: added a vector of loads.\n"); return; } // If none of the load pairs...
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
1 ↗	(On Diff #65000)	You don't need the basicaa or dce flags for this test.

This revision is now accepted and ready to land.Jul 22 2016, 1:50 PM

Closed by commit rL276477: [SLPVectorizer] Vectorize reverse-order loads in horizontal reductions (authored by mkuper). · Explain WhyJul 22 2016, 2:36 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

69 lines

test/

Transforms/

SLPVectorizer/

X86/

reduction_loads.ll

49 lines

Diff 65160

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 872 Lines • ▼ Show 20 Lines	#endif

/// Performs the "real" scheduling. Done before vectorization is actually		/// Performs the "real" scheduling. Done before vectorization is actually
/// performed in a basic block.		/// performed in a basic block.
void scheduleBlock(BlockScheduling *BS);		void scheduleBlock(BlockScheduling *BS);

/// List of users to ignore during scheduling and that don't need extracting.		/// List of users to ignore during scheduling and that don't need extracting.
ArrayRef<Value *> UserIgnoreList;		ArrayRef<Value *> UserIgnoreList;

// Number of load-bundles, which contain consecutive loads.		// Number of load bundles that contain consecutive loads.
int NumLoadsWantToKeepOrder;		int NumLoadsWantToKeepOrder;

// Number of load-bundles of size 2, which are consecutive loads if reversed.		// Number of load bundles that contain consecutive loads in reversed order.
int NumLoadsWantToChangeOrder;		int NumLoadsWantToChangeOrder;

// Analysis and block reference.		// Analysis and block reference.
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AliasAnalysis *AA;		AliasAnalysis *AA;
▲ Show 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	case Instruction::Load: {

if (DL->getTypeSizeInBits(ScalarTy) !=		if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {		DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");		DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;		return;
}		}
// Check if the loads are consecutive or of we need to swizzle them.
		// Make sure all loads in the bundle are simple - we can't vectorize
		// atomic or volatile loads.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
LoadInst *L = cast<LoadInst>(VL[i]);		LoadInst *L = cast<LoadInst>(VL[i]);
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
		}

		// Check if the loads are consecutive, reversed, or neither.
		// TODO: What we really want is to sort the loads, but for now, check
		// the two likely directions.
		bool Consecutive = true;
		bool ReverseConsecutive = true;
		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
if (VL.size() == 2 && isConsecutiveAccess(VL[1], VL[0], DL, SE)) {		Consecutive = false;
++NumLoadsWantToChangeOrder;		break;
}		} else {
BS.cancelScheduling(VL);		ReverseConsecutive = false;
newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
return;
}		}
}		}

		if (Consecutive) {
++NumLoadsWantToKeepOrder;		++NumLoadsWantToKeepOrder;
newTreeEntry(VL, true);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of loads.\n");		DEBUG(dbgs() << "SLP: added a vector of loads.\n");
return;		return;
}		}

		// If none of the load pairs were consecutive when checked in order,
		// check the reverse order.
		if (ReverseConsecutive)
		for (unsigned i = VL.size() - 1; i > 0; --i)
		if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
		ReverseConsecutive = false;
		break;
		}

		BS.cancelScheduling(VL);
		newTreeEntry(VL, false);

		if (ReverseConsecutive) {
		++NumLoadsWantToChangeOrder;
		DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");
		} else {
		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
		}
		return;
		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
case Instruction::SIToFP:		case Instruction::SIToFP:
▲ Show 20 Lines • Show All 2,603 Lines • ▼ Show 20 Lines	DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");
ArrayRef<Value *> Ops = VL.slice(i, OpsWidth);		ArrayRef<Value *> Ops = VL.slice(i, OpsWidth);

ArrayRef<Value *> BuildVectorSlice;		ArrayRef<Value *> BuildVectorSlice;
if (!BuildVector.empty())		if (!BuildVector.empty())
BuildVectorSlice = BuildVector.slice(i, OpsWidth);		BuildVectorSlice = BuildVector.slice(i, OpsWidth);

R.buildTree(Ops, BuildVectorSlice);		R.buildTree(Ops, BuildVectorSlice);
// TODO: check if we can allow reordering also for other cases than		// TODO: check if we can allow reordering for more cases.
// tryToVectorizePair()
if (allowReorder && R.shouldReorder()) {		if (allowReorder && R.shouldReorder()) {
		// Conceptually, there is nothing actually preventing us from trying to
		// reorder a larger list. In fact, we do exactly this when vectorizing
		// reductions. However, at this point, we only expect to get here from
		// tryToVectorizePair().
assert(Ops.size() == 2);		assert(Ops.size() == 2);
assert(BuildVectorSlice.empty());		assert(BuildVectorSlice.empty());
Value *ReorderedOps[] = { Ops[1], Ops[0] };		Value *ReorderedOps[] = { Ops[1], Ops[0] };
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, None);
}		}
R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
int Cost = R.getTreeCost();		int Cost = R.getTreeCost();

▲ Show 20 Lines • Show All 261 Lines • ▼ Show 20 Lines	bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
Value *VectorizedTree = nullptr;		Value *VectorizedTree = nullptr;
IRBuilder<> Builder(ReductionRoot);		IRBuilder<> Builder(ReductionRoot);
FastMathFlags Unsafe;		FastMathFlags Unsafe;
Unsafe.setUnsafeAlgebra();		Unsafe.setUnsafeAlgebra();
Builder.setFastMathFlags(Unsafe);		Builder.setFastMathFlags(Unsafe);
unsigned i = 0;		unsigned i = 0;

for (; i < NumReducedVals - ReduxWidth + 1; i += ReduxWidth) {		for (; i < NumReducedVals - ReduxWidth + 1; i += ReduxWidth) {
V.buildTree(makeArrayRef(&ReducedVals[i], ReduxWidth), ReductionOps);		auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);
		V.buildTree(VL, ReductionOps);
		if (V.shouldReorder()) {
		SmallVector<Value *, 8> Reversed(VL.rbegin(), VL.rend());
		V.buildTree(Reversed, ReductionOps);
		}
V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int Cost = V.getTreeCost() + getReductionCost(TTI, ReducedVals[i]);		int Cost = V.getTreeCost() + getReductionCost(TTI, ReducedVals[i]);
if (Cost >= -SLPCostThreshold)		if (Cost >= -SLPCostThreshold)
break;		break;

DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:" << Cost		DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:" << Cost
▲ Show 20 Lines • Show All 561 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/reduction_loads.ll

				; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-apple-macosx10.10.0 -mattr=+sse4.2 \| FileCheck %s

				; CHECK-LABEL: @test
				; CHECK: [[CAST:%.]] = bitcast i32 %p to <8 x i32>*
				; CHECK: [[LOAD:%.]] = load <8 x i32>, <8 x i32> [[CAST]], align 4
				; CHECK: mul <8 x i32> <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>, [[LOAD]]

				define i32 @test(i32* nocapture readonly %p) {
				entry:
				%arrayidx.1 = getelementptr inbounds i32, i32* %p, i64 1
				%arrayidx.2 = getelementptr inbounds i32, i32* %p, i64 2
				%arrayidx.3 = getelementptr inbounds i32, i32* %p, i64 3
				%arrayidx.4 = getelementptr inbounds i32, i32* %p, i64 4
				%arrayidx.5 = getelementptr inbounds i32, i32* %p, i64 5
				%arrayidx.6 = getelementptr inbounds i32, i32* %p, i64 6
				%arrayidx.7 = getelementptr inbounds i32, i32* %p, i64 7
				br label %for.body

				for.body:
				%sum = phi i32 [ 0, %entry ], [ %add.7, %for.body ]
				%tmp = load i32, i32* %p, align 4
				%mul = mul i32 %tmp, 42
				%add = add i32 %mul, %sum
				%tmp5 = load i32, i32* %arrayidx.1, align 4
				%mul.1 = mul i32 %tmp5, 42
				%add.1 = add i32 %mul.1, %add
				%tmp6 = load i32, i32* %arrayidx.2, align 4
				%mul.2 = mul i32 %tmp6, 42
				%add.2 = add i32 %mul.2, %add.1
				%tmp7 = load i32, i32* %arrayidx.3, align 4
				%mul.3 = mul i32 %tmp7, 42
				%add.3 = add i32 %mul.3, %add.2
				%tmp8 = load i32, i32* %arrayidx.4, align 4
				%mul.4 = mul i32 %tmp8, 42
				%add.4 = add i32 %mul.4, %add.3
				%tmp9 = load i32, i32* %arrayidx.5, align 4
				%mul.5 = mul i32 %tmp9, 42
				%add.5 = add i32 %mul.5, %add.4
				%tmp10 = load i32, i32* %arrayidx.6, align 4
				%mul.6 = mul i32 %tmp10, 42
				%add.6 = add i32 %mul.6, %add.5
				%tmp11 = load i32, i32* %arrayidx.7, align 4
				%mul.7 = mul i32 %tmp11, 42
				%add.7 = add i32 %mul.7, %add.6
				br i1 true, label %for.end, label %for.body

				for.end:
				ret i32 %add.7
				}