This is an archive of the discontinued LLVM Phabricator instance.

Don't punish vectorized arithmetic instruction whose type will be split to multiple registers.
ClosedPublic

Authored by congh on Dec 2 2015, 11:43 AM.

Download Raw Diff

Details

Reviewers

spatel
echristo
hfinkel
dexonsmith

Commits

rG94620278a448: Don't punish vectorized arithmetic instruction whose type will be split to…
rL254671: Don't punish vectorized arithmetic instruction whose type will be split to…

Summary

Currently in LLVM's cost model, a vectorized arithmetic instruction will have high cost if its type is split into multiple registers. However, this punishment is too heavy and unnecessary. The overhead of the split should not be on arithmetic instructions but instructions that implement the split. Note that during vectorization we have calculated the register pressure, and we only choose proper interleaving factor (and also vectorization factor) so that we don't use more registers than the maximum number.

Here is a very simple example: if a vadd has the cost 1, and if we double VF so that we need two registers to perform it, then its cost will become 4 with the current implementation, which will prevent us to use larger VF.

Diff Detail

Repository: rL LLVM

Event Timeline

congh updated this revision to Diff 41652.Dec 2 2015, 11:43 AM

congh retitled this revision from to Don't punish vectorized arithmetic instruction whose type will be split to multiple registers..

congh updated this object.

congh added reviewers: spatel, echristo, dexonsmith, hfinkel.

congh added a subscriber: davidxl.

jevinskie added a subscriber: jevinskie.Dec 2 2015, 3:09 PM

I know it's rather hand-wavy heuristics at this level, but I assume someone put that extra fudge factor in there to prevent a vectorization disaster (eg, https://llvm.org/bugs/show_bug.cgi?id=20225 ).

Do you see any differences in benchmarks after this change?

Can you provide a motivating example (PSAD?) to show how this change helps?

In D15159#301621, @spatel wrote:

I know it's rather hand-wavy heuristics at this level, but I assume someone put that extra fudge factor in there to prevent a vectorization disaster (eg, https://llvm.org/bugs/show_bug.cgi?id=20225 ).

Currently the VF in LLVM's loop vectorizer is selected based on the widest type in the loop. Therefore it is unlikely that any vector needs a split during type legalization. As now we have a flag to control weather to select VF based on smallest or widest type in the loop, this extra factor becomes an issue. IMO, we should have better way to prevent vectorization disaster instead of blaming arithmetic operations. In fact, we can get better performance from more parallelized arithmetics.

(BTW, I have commented the bug mentioned by you).

Do you see any differences in benchmarks after this change?

This change has limited influence on performance now due to the reason above. We need to observe the difference when -vectorizer-maximize-bandwidth is turned on by default. But I am glad to test it (ongoing).

Can you provide a motivating example (PSAD?) to show how this change helps?

As I am working on PSAD, it is certainly a good example. For PSAD loop, we want the VF to be 16, and also want the cost to be the smallest then. There are many issues to resolve now. The VF 16 is not chosen when -vectorizer-maximize-bandwidth is turned on for several reasons:

The register usages when VF=16 exceeds the maximum. This is actually a problem how to calculate register usage, and I have proposed a patch for it (http://reviews.llvm.org/D15177).

The cost when VF = 16 is higher than the cost when VF = 4 due to two reasons: a. The cost of zext vXi8 to vXi32 is too high when X >= 8. This is due to missing entries in the cost table. I have proposed a patch for it (http://reviews.llvm.org/D15132). b. The issue reported in this patch. Below is the output from loop vectorizer. The cost of zext v8i8 to v8i32 can be optimized to 3, but we still get cost 26 for VF=8 vs 11 for VF=4. If we reduce costs of those arithmetic operations from 4 to 2, we can get the total cost 20 for VF=8, beating the cost when VF=4 (zext from v4i8 to v4i32 should have cost 2 instead of 1 so we should have total cost 13 when VF=4).

LV: Found an estimated cost of 0 for VF 4 For instruction: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction: %s.015 = phi i32 [ 0, %entry ], [ %add, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 4 For instruction: %0 = load i8, i8* %arrayidx, align 1, !tbaa !1
LV: Found an estimated cost of 1 for VF 4 For instruction: %conv = zext i8 %0 to i32
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx2 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 4 For instruction: %1 = load i8, i8* %arrayidx2, align 1, !tbaa !1
LV: Found an estimated cost of 1 for VF 4 For instruction: %conv3 = zext i8 %1 to i32
LV: Found an estimated cost of 1 for VF 4 For instruction: %sub = sub nsw i32 %conv, %conv3
LV: Found an estimated cost of 1 for VF 4 For instruction: %ispos = icmp sgt i32 %sub, -1
LV: Found an estimated cost of 1 for VF 4 For instruction: %neg = sub nsw i32 0, %sub
LV: Found an estimated cost of 1 for VF 4 For instruction: %2 = select i1 %ispos, i32 %sub, i32 %neg
LV: Found an estimated cost of 1 for VF 4 For instruction: %add = add nsw i32 %2, %s.015
LV: Found an estimated cost of 1 for VF 4 For instruction: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction: %exitcond = icmp eq i64 %indvars.iv.next, 1024
LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %exitcond, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 4 costs: 2.
LV: Found an estimated cost of 0 for VF 8 For instruction: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 8 For instruction: %s.015 = phi i32 [ 0, %entry ], [ %add, %for.body ]
LV: Found an estimated cost of 0 for VF 8 For instruction: %arrayidx = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 8 For instruction: %0 = load i8, i8* %arrayidx, align 1, !tbaa !1
LV: Found an estimated cost of 24 for VF 8 For instruction: %conv = zext i8 %0 to i32
LV: Found an estimated cost of 0 for VF 8 For instruction: %arrayidx2 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 8 For instruction: %1 = load i8, i8* %arrayidx2, align 1, !tbaa !1
LV: Found an estimated cost of 24 for VF 8 For instruction: %conv3 = zext i8 %1 to i32
LV: Found an estimated cost of 4 for VF 8 For instruction: %sub = sub nsw i32 %conv, %conv3
LV: Found an estimated cost of 2 for VF 8 For instruction: %ispos = icmp sgt i32 %sub, -1
LV: Found an estimated cost of 4 for VF 8 For instruction: %neg = sub nsw i32 0, %sub
LV: Found an estimated cost of 2 for VF 8 For instruction: %2 = select i1 %ispos, i32 %sub, i32 %neg
LV: Found an estimated cost of 4 for VF 8 For instruction: %add = add nsw i32 %2, %s.015
LV: Found an estimated cost of 1 for VF 8 For instruction: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 8 For instruction: %exitcond = icmp eq i64 %indvars.iv.next, 1024
LV: Found an estimated cost of 0 for VF 8 For instruction: br i1 %exitcond, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 8 costs: 8.
LV: Selecting VF: 4.

In D15159#301837, @congh wrote:

IMO, we should have better way to prevent vectorization disaster instead of blaming arithmetic operations.

It's probably related to the comment here. :)

// TODO: Once we have extract/insert subvector cost we need to use them.

a. The cost of zext vXi8 to vXi32 is too high when X >= 8. This is due to missing entries in the cost table. I have proposed a patch for it (http://reviews.llvm.org/D15132).

Just as another reference of the vectorizer's cost model, this reminds me of another bug:
https://llvm.org/bugs/show_bug.cgi?id=21356

I don't see anything fundamentally wrong with removing this fudge factor if nothing is broken and it will help your PSAD work, so LGTM.

This revision is now accepted and ready to land.Dec 3 2015, 2:31 PM

In D15159#301957, @spatel wrote:
In D15159#301837, @congh wrote:

IMO, we should have better way to prevent vectorization disaster instead of blaming arithmetic operations.

It's probably related to the comment here. :)
// TODO: Once we have extract/insert subvector cost we need to use them.
a. The cost of zext vXi8 to vXi32 is too high when X >= 8. This is due to missing entries in the cost table. I have proposed a patch for it (http://reviews.llvm.org/D15132).

Just as another reference of the vectorizer's cost model, this reminds me of another bug:
https://llvm.org/bugs/show_bug.cgi?id=21356

As you commented, this is due to excessive costs for vectorized integer-float conversions. I have left my comment there, and would like to work on it later.

I don't see anything fundamentally wrong with removing this fudge factor if nothing is broken and it will help your PSAD work, so LGTM.

Thank you very much for the review!

Closed by commit rL254671: Don't punish vectorized arithmetic instruction whose type will be split to… (authored by conghou). · Explain WhyDec 3 2015, 4:39 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

CodeGen/

BasicTTIImpl.h

6 lines

test/

Analysis/

CostModel/

X86/

reduction.ll

2 lines

Diff 41829

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 296 Lines • ▼ Show 20 Lines	unsigned getArithmeticInstrCost(

bool IsFloat = Ty->getScalarType()->isFloatingPointTy();		bool IsFloat = Ty->getScalarType()->isFloatingPointTy();
// Assume that floating point arithmetic operations cost twice as much as		// Assume that floating point arithmetic operations cost twice as much as
// integer operations.		// integer operations.
unsigned OpCost = (IsFloat ? 2 : 1);		unsigned OpCost = (IsFloat ? 2 : 1);

if (TLI->isOperationLegalOrPromote(ISD, LT.second)) {		if (TLI->isOperationLegalOrPromote(ISD, LT.second)) {
// The operation is legal. Assume it costs 1.		// The operation is legal. Assume it costs 1.
// If the type is split to multiple registers, assume that there is some
// overhead to this.
// TODO: Once we have extract/insert subvector cost we need to use them.		// TODO: Once we have extract/insert subvector cost we need to use them.
if (LT.first > 1)		return LT.first * OpCost;
return LT.first * 2 * OpCost;
return LT.first * 1 * OpCost;
}		}

if (!TLI->isOperationExpand(ISD, LT.second)) {		if (!TLI->isOperationExpand(ISD, LT.second)) {
// If the operation is custom lowered then assume		// If the operation is custom lowered then assume
// thare the code is twice as expensive.		// thare the code is twice as expensive.
return LT.first * 2 * OpCost;		return LT.first * 2 * OpCost;
}		}

▲ Show 20 Lines • Show All 499 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/X86/reduction.ll

Show All 27 Lines	<8 x i32> <i32 2 , i32 3, i32 undef, i32 undef,
i32 undef, i32 undef, i32 undef, i32 undef>		i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx.2 = add <8 x i32> %bin.rdx, %rdx.shuf.2		%bin.rdx.2 = add <8 x i32> %bin.rdx, %rdx.shuf.2
%rdx.shuf.3 = shufflevector <8 x i32> %bin.rdx.2, <8 x i32> undef,		%rdx.shuf.3 = shufflevector <8 x i32> %bin.rdx.2, <8 x i32> undef,
<8 x i32> <i32 1 , i32 undef, i32 undef, i32 undef,		<8 x i32> <i32 1 , i32 undef, i32 undef, i32 undef,
i32 undef, i32 undef, i32 undef, i32 undef>		i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx.3 = add <8 x i32> %bin.rdx.2, %rdx.shuf.3		%bin.rdx.3 = add <8 x i32> %bin.rdx.2, %rdx.shuf.3

; CHECK-LABEL: reduction_cost_int		; CHECK-LABEL: reduction_cost_int
; CHECK: cost of 23 {{.*}} extractelement		; CHECK: cost of 17 {{.*}} extractelement

%r = extractelement <8 x i32> %bin.rdx.3, i32 0		%r = extractelement <8 x i32> %bin.rdx.3, i32 0
ret i32 %r		ret i32 %r
}		}

define fastcc float @pairwise_hadd(<4 x float> %rdx, float %f1) {		define fastcc float @pairwise_hadd(<4 x float> %rdx, float %f1) {
%rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,		%rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
<4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>		<4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>
▲ Show 20 Lines • Show All 321 Lines • Show Last 20 Lines