This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Treat SelectInsts as reduction values.
ClosedPublic

Authored by • chatur01 on Oct 21 2015, 10:12 AM.

Download Raw Diff

Details

Reviewers

spatel
nadav
mzolotukhin
jmolloy

Commits

rG74c387feb702: [SLP] Treat SelectInsts as reduction values.
rL251424: [SLP] Treat SelectInsts as reduction values.

Summary

Certain workloads, in particular sum-of-absdiff loops, can be vectorized using SLP if it can treat select instructions as reduction values.

The test case is a bit awkward. The AArch64 cost model needs some tuning to not be so pessimistic about selects. I've had to tweak the SLP threshold here.

Diff Detail

Repository: rL LLVM

Event Timeline

• chatur01 updated this revision to Diff 38026.Oct 21 2015, 10:12 AM

• chatur01 retitled this revision from to [SLP] Treat SelectInsts as reduction values..

• chatur01 updated this object.

• chatur01 added reviewers: jmolloy, nadav, mzolotukhin, spatel.

• chatur01 set the repository for this revision to rL LLVM.

• chatur01 added a subscriber: llvm-commits.

Herald added a subscriber: aemerson. · View Herald TranscriptOct 21 2015, 10:12 AM

The patch looks okay. The problem of adding new instructions to the list of ‘reductions’ is that it increases the compile time of the SLP-vectorizer. Do you know how effective is this pattern? Do you know how many times it hits when you compile the llvm test suite?

Do you know how effective is this pattern? Do you know how many times it hits when you compile the llvm test suite?

My testing has been on ARM and AArch64 targets, this patch has no effect on performance yet for the benchmarks I have access to. To vectorize the more general pattern, there are few more changes I need to propose, including searching for reductions in multi-block loops, and trying different reduction widths. I was going to go through them one at a time.

With the other changes, the biggest improvements are in MPEG encoders in third party benchmarks. I haven't yet measured the compile time impact of this particular change, I will get some numbers for this. Thanks for your review!

--Charlie.

mcrosier added a subscriber: mssimpso.Oct 21 2015, 11:27 AM

Hi Nadav, I've anaysed my patches in more detail now. Sorry for how long it took me to do it!

This patch doesn't affect compile-time performance in any significant way.

My testing methodology for LNT was to run the following with and without my changes, pegged on just one A57 CPU:

$ taskset -c 5 ./bin/lnt runtest nt --sandbox $SANDBOX --cc=$COMPILER --cxx=$COMPILER '--cflag=-mcpu=cortex-a57 -mllvm -slp-vectorize-hor -mllvm -slp-vectorize-hor-store -Wl,--allow-multiple-definition' --test-suite $TEST_SUITE_DIR --no-timestamp --make-param=ARCH=AArch64 --make-param=ENDIAN=little --make-param=RUNTIMELIMIT=7200 '--make-param=RUNUNDER=taskset -c 5' --multisample=1 --threads 1 --build-threads 1 --benchmarking-only --use-perf 1

I then collected all the compile time results from the two runs and compared them. The differences were all within noise.

The methodology for SPEC{2000,2006} was to use their runspec drivers, and to time each benchmark's build "action" from a clean installation with and without my patch. There were also no significant differences in these benchmarks.

There are no significant differences in performance with this patch across SPEC{2000,2006} and LNT.

• chatur01 mentioned this in D14063: [SLP] Try a bit harder to find reduction PHIs.Oct 26 2015, 6:35 AM

Okay. Please go ahead and commit this patch. I do have one minor comment:

BinaryOperator *Next = dyn_cast<BinaryOperator>(NextV);
if (Next)
  Stack.push_back(std::make_pair(Next, 0));

+ else if (SelectInst *SelI = dyn_cast<SelectInst>(NextV))
+ Stack.push_back(std::make_pair(SelI, 0));

  else if (NextV != Phi)
    return false;
}

You can rewrite this code using ‘isa<BinaryOperator>(I) || isa<SelectInst>(I)’ and use a single push_back instruction.

It is also a good idea to add some comments to this code.

-Nadav

Address review comments.

I have another change to pass by review before I start landing these changes.
It's about choosing reduction widths. I'll put it up for review tomorrow. It
might be the wrong approach, so I don't want land half-finished chains of
thought :-)

Thanks a lot for your time!

--Charlie.

LGTM!

• chatur01 mentioned this in D14116: [SLP] Be more aggressive about reduction width selection..Oct 27 2015, 6:52 AM

Closed by commit rL251424: [SLP] Treat SelectInsts as reduction values. (authored by • chatur01). · Explain WhyOct 27 2015, 10:51 AM

This revision was automatically updated to reflect the committed changes.

• chatur01 mentioned this in rL251425: [SLP] Try a bit harder to find reduction PHIs.Oct 27 2015, 10:56 AM

• chatur01 mentioned this in rL251428: [SLP] Be more aggressive about reduction width selection..Oct 27 2015, 11:01 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

13 lines

test/

Transforms/

SLPVectorizer/

AArch64/

horizontal.ll

73 lines

Diff 38560

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 3,708 Lines • ▼ Show 20 Lines	if (ReduxWidth < 4)
return false;		return false;

// We currently only support adds.		// We currently only support adds.
if (ReductionOpcode != Instruction::Add &&		if (ReductionOpcode != Instruction::Add &&
ReductionOpcode != Instruction::FAdd)		ReductionOpcode != Instruction::FAdd)
return false;		return false;

// Post order traverse the reduction tree starting at B. We only handle true		// Post order traverse the reduction tree starting at B. We only handle true
// trees containing only binary operators.		// trees containing only binary operators or selects.
SmallVector<std::pair<BinaryOperator *, unsigned>, 32> Stack;		SmallVector<std::pair<Instruction *, unsigned>, 32> Stack;
Stack.push_back(std::make_pair(B, 0));		Stack.push_back(std::make_pair(B, 0));
while (!Stack.empty()) {		while (!Stack.empty()) {
BinaryOperator *TreeN = Stack.back().first;		Instruction *TreeN = Stack.back().first;
unsigned EdgeToVist = Stack.back().second++;		unsigned EdgeToVist = Stack.back().second++;
bool IsReducedValue = TreeN->getOpcode() != ReductionOpcode;		bool IsReducedValue = TreeN->getOpcode() != ReductionOpcode;

// Only handle trees in the current basic block.		// Only handle trees in the current basic block.
if (TreeN->getParent() != B->getParent())		if (TreeN->getParent() != B->getParent())
return false;		return false;

// Each tree node needs to have one user except for the ultimate		// Each tree node needs to have one user except for the ultimate
Show All 19 Lines	while (!Stack.empty()) {
}		}
// Retract.		// Retract.
Stack.pop_back();		Stack.pop_back();
continue;		continue;
}		}

// Visit left or right.		// Visit left or right.
Value *NextV = TreeN->getOperand(EdgeToVist);		Value *NextV = TreeN->getOperand(EdgeToVist);
BinaryOperator *Next = dyn_cast<BinaryOperator>(NextV);		// We currently only allow BinaryOperator's and SelectInst's as reduction
if (Next)		// values in our tree.
Stack.push_back(std::make_pair(Next, 0));		if (isa<BinaryOperator>(NextV) \|\| isa<SelectInst>(NextV))
		Stack.push_back(std::make_pair(cast<Instruction>(NextV), 0));
else if (NextV != Phi)		else if (NextV != Phi)
return false;		return false;
}		}
return true;		return true;
}		}

/// \brief Attempt to vectorize the tree found by		/// \brief Attempt to vectorize the tree found by
/// matchAssociativeReduction.		/// matchAssociativeReduction.
▲ Show 20 Lines • Show All 389 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/AArch64/horizontal.ll

				; RUN: opt -slp-vectorizer -slp-threshold=-6 -slp-vectorize-hor -S < %s \| FileCheck %s

				; FIXME: The threshold is changed to keep this test case a bit smaller.
				; The AArch64 cost model should not give such high costs to select statements.

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux"

				; CHECK-LABEL: test_select
				; CHECK: load <4 x i32>
				; CHECK: load <4 x i32>
				; CHECK: select <4 x i1>
				define i32 @test_select(i32* noalias nocapture readonly %blk1, i32* noalias nocapture readonly %blk2, i32 %lx, i32 %h) {
				entry:
				%cmp.22 = icmp sgt i32 %h, 0
				br i1 %cmp.22, label %for.body.lr.ph, label %for.end

				for.body.lr.ph: ; preds = %entry
				%idx.ext = sext i32 %lx to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.body.lr.ph
				%s.026 = phi i32 [ 0, %for.body.lr.ph ], [ %add27, %for.body ]
				%j.025 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
				%p2.024 = phi i32* [ %blk2, %for.body.lr.ph ], [ %add.ptr29, %for.body ]
				%p1.023 = phi i32* [ %blk1, %for.body.lr.ph ], [ %add.ptr, %for.body ]
				%0 = load i32, i32* %p1.023, align 4
				%1 = load i32, i32* %p2.024, align 4
				%sub = sub nsw i32 %0, %1
				%cmp2 = icmp slt i32 %sub, 0
				%sub3 = sub nsw i32 0, %sub
				%sub3.sub = select i1 %cmp2, i32 %sub3, i32 %sub
				%add = add nsw i32 %sub3.sub, %s.026
				%arrayidx4 = getelementptr inbounds i32, i32* %p1.023, i64 1
				%2 = load i32, i32* %arrayidx4, align 4
				%arrayidx5 = getelementptr inbounds i32, i32* %p2.024, i64 1
				%3 = load i32, i32* %arrayidx5, align 4
				%sub6 = sub nsw i32 %2, %3
				%cmp7 = icmp slt i32 %sub6, 0
				%sub9 = sub nsw i32 0, %sub6
				%v.1 = select i1 %cmp7, i32 %sub9, i32 %sub6
				%add11 = add nsw i32 %add, %v.1
				%arrayidx12 = getelementptr inbounds i32, i32* %p1.023, i64 2
				%4 = load i32, i32* %arrayidx12, align 4
				%arrayidx13 = getelementptr inbounds i32, i32* %p2.024, i64 2
				%5 = load i32, i32* %arrayidx13, align 4
				%sub14 = sub nsw i32 %4, %5
				%cmp15 = icmp slt i32 %sub14, 0
				%sub17 = sub nsw i32 0, %sub14
				%sub17.sub14 = select i1 %cmp15, i32 %sub17, i32 %sub14
				%add19 = add nsw i32 %add11, %sub17.sub14
				%arrayidx20 = getelementptr inbounds i32, i32* %p1.023, i64 3
				%6 = load i32, i32* %arrayidx20, align 4
				%arrayidx21 = getelementptr inbounds i32, i32* %p2.024, i64 3
				%7 = load i32, i32* %arrayidx21, align 4
				%sub22 = sub nsw i32 %6, %7
				%cmp23 = icmp slt i32 %sub22, 0
				%sub25 = sub nsw i32 0, %sub22
				%v.3 = select i1 %cmp23, i32 %sub25, i32 %sub22
				%add27 = add nsw i32 %add19, %v.3
				%add.ptr = getelementptr inbounds i32, i32* %p1.023, i64 %idx.ext
				%add.ptr29 = getelementptr inbounds i32, i32* %p2.024, i64 %idx.ext
				%inc = add nuw nsw i32 %j.025, 1
				%exitcond = icmp eq i32 %inc, %h
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%s.0.lcssa = phi i32 [ 0, %entry ], [ %add27, %for.end.loopexit ]
				ret i32 %s.0.lcssa
				}