This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
uniform_load.ll

Differential D18940

Loop vectorization with uniform load
ClosedPublic

Authored by delena on Apr 10 2016, 2:55 AM.

Download Raw Diff

Details

Reviewers

• HaoLiu
Ayal
aschwaighofer
hfinkel

Commits

rG751ed0a06a9b: Loop vectorization with uniform load
rL265901: Loop vectorization with uniform load

Summary

Considered a loop with uniform load:

float inc = 0.5;
void foo(float *A, unsigned N) {

for (int i=0;i<N;i++){
  A[i] += inc; // Uniform load of inc
}

}
If the "uniform load" is not hoisted before vectorization, the cost of the uniform load is "scalar load + broadcast".
It is not correctly calculated in the current version and a huge cost for one splat vector prevents loop vectorization.

Diff Detail

Repository: rL LLVM

Event Timeline

delena updated this revision to Diff 53168.Apr 10 2016, 2:55 AM

delena retitled this revision from to Loop vectorization with uniform load.

delena updated this object.

delena added reviewers: Ayal, aschwaighofer, • HaoLiu.

delena set the repository for this revision to rL LLVM.

delena added a subscriber: llvm-commits.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptApr 10 2016, 2:55 AM

This seems like the right cost to attribute to a uniform load.

In general, as in this testcase, uniform loads that are loop-invariant better be hoisted before the vectorizer, yielding potentially even lower costs. But that's an orthogonal issue.

It would be good to align the instructions emitted when vectorizing such a load (i.e., scalarizeInstruction's insertElements) with this cost attributed to it.

../test/Transforms/LoopVectorize/X86/uniform_load.ll
1 ↗	(On Diff #53168)	Suffice to invoke -loop-vectorize rather than -O2?

Updated test file, according to Ayal's comments.

hfinkel accepted this revision.Apr 10 2016, 6:39 AM

hfinkel added a reviewer: hfinkel.

hfinkel added a subscriber: hfinkel.

hfinkel added inline comments.

../lib/Transforms/Vectorize/LoopVectorize.cpp
5827 ↗	(On Diff #53172)	This can just be `+` instead of `+=`, otherwise, this LGTM.

This revision is now accepted and ready to land.Apr 10 2016, 6:39 AM

delena added inline comments.Apr 10 2016, 9:50 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
5827 ↗	(On Diff #53172)	Yes, of course. Thank you.

Closed by commit rL265901: Loop vectorization with uniform load (authored by delena). · Explain WhyApr 10 2016, 9:58 AM

This revision was automatically updated to reflect the committed changes.

If the "uniform load" is not hoisted before vectorization, the cost of the uniform load is "scalar load + broadcast".

This is not necessarily a criticism of your change but in most cases this cost is still too conservative.

If we need memchecks to disambiguate the the uniform access against the other memory accesses, the uniform load becomes loop-invariant after vectorization and a subsequent LICM will hoist it out of the loop (D17191). Thus the cost is really zero.

I also think that this is the common case, like your testcase. Just schedule a licm afterward and the load+shuffles will be hoisted out of the loop.

It is not correctly calculated in the current version and a huge cost for one splat vector prevents loop vectorization.

Can you please elaborate, how was the cost computed before?

Thanks,
Adam

spatel added a subscriber: spatel.Apr 11 2016, 11:33 AM

Can you please elaborate, how was the cost computed before?

25 for VF=2, 51 for VF=4 and 103 for VF=8

Thus the cost is really zero.

I know that the actual cost is 0, but I can't put 0 when the load is inside the loop.

the uniform load becomes loop-invariant after vectorization and a subsequent LICM will hoist it out of the loop

Do you know why the load wasn't hoisted before vectorization?

In D18940#397500, @delena wrote:

Can you please elaborate, how was the cost computed before?

25 for VF=2, 51 for VF=4 and 103 for VF=8

I don't mean the actual number.

Did we assume that we needed VF number of loads for each element rather than a single one with a shuffle/broadcast?

I am just trying to understand the before-picture. You only said that we were building a splat but that is true even after.

Thus the cost is really zero.

I know that the actual cost is 0, but I can't put 0 when the load is inside the loop.

Why not, if we know that it will be hoisted out? I don't see a way how this load wouldn't be loop-invariant if it's legal to vectorize the loop. For example, in:

for (i = 0; i < 10; i++) {

 .. = a[5]
a[i] = ...

}

a[5] is loop-variant but dependence analysis would not allow this loop to be vectorized because the dependence distance between a[5] and a[i] is not constant.

the uniform load becomes loop-invariant after vectorization and a subsequent LICM will hoist it out of the loop

Do you know why the load wasn't hoisted before vectorization?

Because it requires multi-versioning of the loop with memchecks because we couldn't disambiguate the invariant load against the stores in the loop at compile time.

LICM does not currently perform multiversioning by default.

I don't mean the actual number.
Did we assume that we needed VF number of loads for each element rather than a single one with a shuffle/broadcast?

Yes
The address computation cost is taken as VF * getAddressComputationCost(Ty, IsComplex)
IsComplex is true.

Why not, if we know that it will be hoisted out?

I can change the cost to 0 and add comments.

Indeed, "isUniform" actually means "versionably invariant". Note that invariance implies uniformity but the converse may not always hold in general, in which case a positive shuffle cost inside the loop would be right. However in our current innermost-loop (predicated-)scev-based analysis case, they are synonymous, hence such loads should always be hoistable.

Unfortunate that such a naïve testcase evades alias analysis. Add "float * __restrict A" to see licm hoist inc.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

9 lines

test/

Transforms/

LoopVectorize/

X86/

uniform_load.ll

47 lines

Diff 53180

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 5,813 Lines • ▼ Show 20 Lines	case Instruction::Load: {
Value *Ptr = SI ? SI->getPointerOperand() : LI->getPointerOperand();		Value *Ptr = SI ? SI->getPointerOperand() : LI->getPointerOperand();
// We add the cost of address computation here instead of with the gep		// We add the cost of address computation here instead of with the gep
// instruction because only here we know whether the operation is		// instruction because only here we know whether the operation is
// scalarized.		// scalarized.
if (VF == 1)		if (VF == 1)
return TTI.getAddressComputationCost(VectorTy) +		return TTI.getAddressComputationCost(VectorTy) +
TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);		TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);

		if (LI && Legal->isUniform(Ptr)) {
		// Scalar load + broadcast
		unsigned Cost = TTI.getAddressComputationCost(ValTy->getScalarType());
		Cost += TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),
		Alignment, AS);
		return Cost + TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast,
		ValTy);
		}

// For an interleaved access, calculate the total cost of the whole		// For an interleaved access, calculate the total cost of the whole
// interleave group.		// interleave group.
if (Legal->isAccessInterleaved(I)) {		if (Legal->isAccessInterleaved(I)) {
auto Group = Legal->getInterleavedAccessGroup(I);		auto Group = Legal->getInterleavedAccessGroup(I);
assert(Group && "Fail to get an interleaved access group.");		assert(Group && "Fail to get an interleaved access group.");

// Only calculate the cost once at the insert position.		// Only calculate the cost once at the insert position.
if (Group->getInsertPos() != I)		if (Group->getInsertPos() != I)
▲ Show 20 Lines • Show All 317 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/uniform_load.ll

				; RUN: opt -basicaa -loop-vectorize -S -mcpu=core-avx2 < %s \| FileCheck %s

				;float inc = 0.5;
				;void foo(float *A, unsigned N) {
				;
				; for (unsigned i=0; i<N; i++){
				; A[i] += inc;
				; }
				;}

				; CHECK-LABEL: foo
				; CHECK: vector.body
				; CHECK: load <8 x float>
				; CHECK: fadd <8 x float>
				; CHECK: store <8 x float>

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@inc = global float 5.000000e-01, align 4

				define void @foo(float* nocapture %A, i32 %N) #0 {
				entry:
				%cmp3 = icmp eq i32 %N, 0
				br i1 %cmp3, label %for.end, label %for.body.preheader

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.preheader ]
				%0 = load float, float* @inc, align 4
				%arrayidx = getelementptr inbounds float, float* %A, i64 %indvars.iv
				%1 = load float, float* %arrayidx, align 4
				%add = fadd float %0, %1
				store float %add, float* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}