This is an archive of the discontinued LLVM Phabricator instance.

Loop vectorization with uniform load
ClosedPublic

Authored by delena on Apr 10 2016, 2:55 AM.

Details

Summary

Considered a loop with uniform load:

float inc = 0.5;
void foo(float *A, unsigned N) {

for (int i=0;i<N;i++){
  A[i] += inc; // Uniform load of inc
}

}
If the "uniform load" is not hoisted before vectorization, the cost of the uniform load is "scalar load + broadcast".
It is not correctly calculated in the current version and a huge cost for one splat vector prevents loop vectorization.

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 53168.Apr 10 2016, 2:55 AM
delena retitled this revision from to Loop vectorization with uniform load.
delena updated this object.
delena added reviewers: Ayal, aschwaighofer, HaoLiu.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: llvm-commits.
Ayal edited edge metadata.Apr 10 2016, 4:54 AM

This seems like the right cost to attribute to a uniform load.

In general, as in this testcase, uniform loads that are loop-invariant better be hoisted before the vectorizer, yielding potentially even lower costs. But that's an orthogonal issue.

It would be good to align the instructions emitted when vectorizing such a load (i.e., scalarizeInstruction's insertElements) with this cost attributed to it.

../test/Transforms/LoopVectorize/X86/uniform_load.ll
1 ↗(On Diff #53168)

Suffice to invoke -loop-vectorize rather than -O2?

delena updated this revision to Diff 53172.Apr 10 2016, 5:51 AM
delena edited edge metadata.

Updated test file, according to Ayal's comments.

hfinkel accepted this revision.Apr 10 2016, 6:39 AM
hfinkel added a reviewer: hfinkel.
hfinkel added a subscriber: hfinkel.
hfinkel added inline comments.
../lib/Transforms/Vectorize/LoopVectorize.cpp
5827 ↗(On Diff #53172)

This can just be + instead of +=, otherwise, this LGTM.

This revision is now accepted and ready to land.Apr 10 2016, 6:39 AM
delena added inline comments.Apr 10 2016, 9:50 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
5827 ↗(On Diff #53172)

Yes, of course. Thank you.

This revision was automatically updated to reflect the committed changes.
anemet added a subscriber: anemet.Apr 11 2016, 9:55 AM

If the "uniform load" is not hoisted before vectorization, the cost of the uniform load is "scalar load + broadcast".

This is not necessarily a criticism of your change but in most cases this cost is still too conservative.

If we need memchecks to disambiguate the the uniform access against the other memory accesses, the uniform load becomes loop-invariant after vectorization and a subsequent LICM will hoist it out of the loop (D17191). Thus the cost is really zero.

I also think that this is the common case, like your testcase. Just schedule a licm afterward and the load+shuffles will be hoisted out of the loop.

It is not correctly calculated in the current version and a huge cost for one splat vector prevents loop vectorization.

Can you please elaborate, how was the cost computed before?

Thanks,
Adam

spatel added a subscriber: spatel.Apr 11 2016, 11:33 AM

Can you please elaborate, how was the cost computed before?

25 for VF=2, 51 for VF=4 and 103 for VF=8

Thus the cost is really zero.

I know that the actual cost is 0, but I can't put 0 when the load is inside the loop.

the uniform load becomes loop-invariant after vectorization and a subsequent LICM will hoist it out of the loop

Do you know why the load wasn't hoisted before vectorization?

Can you please elaborate, how was the cost computed before?

25 for VF=2, 51 for VF=4 and 103 for VF=8

I don't mean the actual number.

Did we assume that we needed VF number of loads for each element rather than a single one with a shuffle/broadcast?

I am just trying to understand the before-picture. You only said that we were building a splat but that is true even after.

Thus the cost is really zero.

I know that the actual cost is 0, but I can't put 0 when the load is inside the loop.

Why not, if we know that it will be hoisted out? I don't see a way how this load wouldn't be loop-invariant if it's legal to vectorize the loop. For example, in:

for (i = 0; i < 10; i++) {

 .. = a[5]
a[i] = ...

}

a[5] is loop-variant but dependence analysis would not allow this loop to be vectorized because the dependence distance between a[5] and a[i] is not constant.

the uniform load becomes loop-invariant after vectorization and a subsequent LICM will hoist it out of the loop

Do you know why the load wasn't hoisted before vectorization?

Because it requires multi-versioning of the loop with memchecks because we couldn't disambiguate the invariant load against the stores in the loop at compile time.

LICM does not currently perform multiversioning by default.

I don't mean the actual number.
Did we assume that we needed VF number of loads for each element rather than a single one with a shuffle/broadcast?

Yes
The address computation cost is taken as VF * getAddressComputationCost(Ty, IsComplex)
IsComplex is true.

Why not, if we know that it will be hoisted out?

I can change the cost to 0 and add comments.

Ayal added a comment.Apr 18 2016, 9:48 AM

Indeed, "isUniform" actually means "versionably invariant". Note that invariance implies uniformity but the converse may not always hold in general, in which case a positive shuffle cost inside the loop would be right. However in our current innermost-loop (predicated-)scev-based analysis case, they are synonymous, hence such loads should always be hoistable.

Unfortunate that such a naïve testcase evades alias analysis. Add "float * __restrict A" to see licm hoist inc.