Page MenuHomePhabricator

[LV] Support widened induction variables in epilogue vectorization.
Needs ReviewPublic

Authored by fhahn on Nov 25 2020, 1:23 PM.

Details

Summary

Code generation now uses the start VPValue of induction recipes.

This makes it possible to adjust the start value of the epilogue
vector loop to use the 'resume' value of the main vector loop.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2020, 1:23 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
fhahn requested review of this revision.Nov 25 2020, 1:23 PM
fhahn updated this revision to Diff 377997.Oct 7 2021, 1:47 PM

rebased now that all required dependencies landed a while ago.

fhahn retitled this revision from [LV] Support widened induction variables in epilogue vectorization (WIP). to [LV] Support widened induction variables in epilogue vectorization..Oct 7 2021, 1:53 PM
fhahn edited the summary of this revision. (Show Details)
fhahn added reviewers: bmahjour, Ayal, gilr, rengolin.

Instead of changing the generic interface for skeleton creation, how about adding a field to hold the resume value inside EpilogueVectorizerEpilogueLoop? Then inside executePlan() we would do something like this to update the induction widening recipes:

EpilogueVectorizerEpilogueLoop *EILV = dyn_cast<EpilogueVectorizerEpilogueLoop>(ILV);
if (EILV) {
  Value *IndStart = EILV->getResumeValue();
  assert(IndStart && "Expected valid resume value");
  ...
}

Also don't we need to store some sort of a map to be able to handle loops with multiple widened induction vars (with potentially different resume values)? I'm thinking of a case like this:

void foo(int * restrict A, int N) {
  int i, j, k;
  for (i = 0, j = 1, k = 2; i < N; i++, j++, k++)
    A[i] = j + k;
}
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7645

Don't we need to make sure that it is an "external definition" before treating it as such?

Hi Florian,
--Snip--
void foo(double * restrict a, double * restrict b, int N) {
int k=0;
for(int i = 0 , k =2; i < N; ++i, ++k)

a[i] = sin(i)+k;

}
--Snip--

I am able to see the resume value for induction "i" updated properly now after the patch .

--Snip--
33: ; preds = %8, %28

%34 = phi i64 [ %11, %28 ], [ 0, %8 ]
%35 = and i64 %6, 4294967294
%36 = trunc i64 %35 to i32
%37 = add i32 %36, 2
%38 = insertelement <2 x i64> poison, i64 %34, i32 0
%39 = shufflevector <2 x i64> %38, <2 x i64> poison, <2 x i32> zeroinitializer
%40 = or <2 x i64> %39, <i64 0, i64 1>
%41 = trunc i64 %34 to i32
%42 = insertelement <2 x i32> poison, i32 %41, i32 0
%43 = shufflevector <2 x i32> %42, <2 x i32> poison, <2 x i32> zeroinitializer
%44 = or <2 x i32> %43, <i32 0, i32 1>
br label %45

45: ; preds = %45, %33

%46 = phi i64 [ %34, %33 ], [ %55, %45 ]
%47 = phi <2 x i64> [ %40, %33 ], [ %56, %45 ]   <== updated
%48 = phi <2 x i32> [ %44, %33 ], [ %57, %45 ] <== updated

--Snip--

I was thinking if we could reuse the incremented widened induction results from the main vector loop. But this seems cleaner and simpler code .

Hi Florian,

This patch breaks one assert when building compiler-rt. However I noticed that in my work space I lowered the epilog generation thresholds to 8 so got this assert.

Here is the general reproducer.

---Snip winden-ce.cpp ---
unsigned short int * MemToShadow(unsigned int x);

void Add( unsigned int begin,unsigned int end, unsigned int cfi_check) {

// Don't fill anything below cfi_check. We can not represent those addresses
// in the shadow, and must make sure at codegen to place all valid call
// targets above cfi_check.
//begin = Max(begin, cfi_check);
unsigned short int *s = MemToShadow(begin);
unsigned short int *s_end = MemToShadow(end - 1) + 1;
unsigned short int  sv = ((begin - cfi_check)) + 1;
for (; s < s_end; s++, sv++)
  *s = sv;

}
---Snip--

clang++ -O3 winden-ce.cpp -S -emit-llvm -mllvm -epilogue-vectorization-minimum-VF=8

LoopVectorize.cpp:2535: virtual llvm::Value* llvm::InnerLoopVectorizer::getStepVector(llvm::Value*, int, llvm::Value*, llvm::Instruction::BinaryOps): Assertion `Step->getType() == STy && "Step has wrong type"' failed.

I noticed Main loop Vector splat created in the preheader is with i16 type (unsigned short *)
%.splat = shufflevector <8 x i16> %.splatinsert, <8 x i16> poison, <8 x i32> zeroinitializer

whereas Epilog loop creates vector splat with i32 type.
%.splat31 = shufflevector <4 x i64> %.splatinsert30, <4 x i64> poison, <4 x i32> zeroinitializer

fhahn planned changes to this revision.Oct 8 2021, 11:57 AM

Thanks for taking a look and giving it a try! I now remember what was still missing, handling noprimiary IV cases. I'll work to update the patch, but I think it might need a few more patches to bring other parts of the infrastructure up to speedd.

fhahn updated this revision to Diff 461316.Sep 19 2022, 12:42 PM

I reworked the patch to use a createInductionResumeValue to create the required resume values as needed.

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 12:42 PM
rengolin added inline comments.Sep 28 2022, 6:17 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3184–3185

unused?

7903

A comment here for the semantics of this would be nice

10498–10500

80 columns?

10531

This code is hard to follow, perhaps a comment?

Hi

I tried to build the patch on top of the latest trunk and facing this issue.

/home/amd/venkat/aocc-work/mirror/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10505:24: error: ‘class llvm::VPWidenPointerInductionRecipe’ has no member named ‘getInductionDescriptor’
10505 | ID = &Ind->getInductionDescriptor();

Any other patch I am missing to add other than this one and its parent ?

fhahn updated this revision to Diff 463845.Sep 29 2022, 4:21 AM

Thanks for taking a look!

Comments should be addressed, also moved getInductionDescriptor definition from separate patch to this one.

fhahn marked 4 inline comments as done.Sep 29 2022, 4:24 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3184–3185

This has been removed in the earlier change that did the refactoring, some for the return type fix.

7903

Added a comment, thanks.

10498–10500

Thanks, formatting should be fixed.

10531

Thanks, added a comment.

Ayal added a comment.Oct 5 2022, 2:00 PM

Good improvement, adding various nits.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7881–7882

nit: and reduction?

7907

nit: none_of?

10501–10503

nit: can cast R into VPHeaderPHIRecipe and have it provide SetStartValue() to be used by all below.

10505–10528

early continue? Or have all cases issue a common R.setOperand(0, StartVal)?

10542

else it's an FOR and continue - they resume ok?

Rather than check and continue, better assert (IndPhi && "...") for inductions - whose start/resume value must always be fixed?

10547

Compare with how other "live-ins" are handled by creating empty VPValues registered with VPlan to be filled with Values created during VPlan::prepareToExecute(). Are these live-ins better handled here, inside LoopVectorizePass::processLoop() which is already overburdened with 400 LOC? Can leave behind a TODO...

llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
173

nit: why this redundant label change, here and additional multiple instances below?

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
12

Adding assertions to show we can vectorize main loop but not epilog?

89

worth retaining f3 testcase to demonstrate that widened/truncated inductions do get epilog vectorized, or becomes redundant?

fhahn updated this revision to Diff 466296.Oct 8 2022, 9:36 AM
fhahn marked 7 inline comments as done.

Address latest comments, thanks!

fhahn updated this revision to Diff 466299.Oct 8 2022, 9:44 AM
fhahn marked 3 inline comments as done.

Move test with truncated IV to optimal-epilog-vectorization.ll

fhahn marked an inline comment as done.Oct 8 2022, 9:45 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7881–7882

Added, thanks!

7907

Thanks, I wasn't aware of none_of. Updated!

10501–10503

Added the helper and use it below.

10505–10528

Restructured the code to have a single setStartValue below.

10542

The only allowed recipe to skip should be the canonical IV. Added an assert. If we have FORs here it would be a mis-compile.

10547

Added a todo, thanks! Not sure yet how to best move the code, as it requires access to ILV and prepareToExecute is defined in VPlan.cpp, so doesn't really have access to it at the moment unfortunately.

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
89

Sounds good, I moved it to optimal-epilog-vectorization.ll

Ayal added a comment.Oct 11 2022, 4:34 PM

Thanks for accommodating, this looks good to me, with a couple of final nits.
@rengolin, @bmahjour, @venkataramanan.kumar.llvm - any further comments or ok with you to accept?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10514

nit: may introduce getPHINode() here as well; and potentially a pure virtual VPWidenInductionRecipe base class to cover the common interface of it and getInductionDescriptor().

10521

nit: } else { ?

llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
173

ok, iter.check is an indicator that createEpilogueVectorizedLoopSkeleton() kicked in, sigh.

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
12

Clarify why assertions were added here now?

All my comments addressed, if other people are happy, I'm also happy. Thanks!

Just one question otherwise patch looks good to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10528

Just a question here we are creating induction resume values two times for the main vector loop. One here while setting up resume value for epilog vector loop. The second one while setting up resume value for the scalar loop. is it possible to reuse already created one?

If implementing the reuse is not worth the effort than creating a new one please ignore my comment.

fhahn updated this revision to Diff 467141.Oct 12 2022, 7:37 AM
fhahn marked an inline comment as done.

Rebase on top of additional tests in 26c8632f22a5f49f57410607476c874a45daf3a5 and also update comment in EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton.

fhahn added inline comments.Oct 12 2022, 7:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10528

Just a question here we are creating induction resume values two times for the main vector loop. One here while setting up resume value for epilog vector loop. The second one while setting up resume value for the scalar loop. is it possible to reuse already created one?

I think in general, the resume values could be different, e.g the edge main-vector-loop-successor -> scalar loop it would be based on (N - main vector TC), the edge epilogue-vector-loop-successor it would be based on (N - main vector TC - epilogue vector TC).

Is it possible that you were looking at the resume values in @test_widen_ptr_induction? I think those are a bit misleading, the resume values are the same as the epilogue vector loop never executes AFAICT. I added a few more interesting variants with runtime trip counts (e.g. @test_widen_induction).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10528

I am referring to induction end value computation in "InnerLoopVectorizer::createInductionResumeValue". It is called at this line (N - main vector TC).

Then again in CreateInductionResumeValues({VecEpilogueIterationCountCheck, EPI.VectorTripCount} /* AdditionalBypass */). in "EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton". We create again at Additional bypass block (N - main vector TC)

Example:

int a[10000],n;
void fn() {
for (int i=2;i<n;i++) {

a[i] = a[i] + i;

}
}

---Snip--
vector.ph: ; preds = %vector.main.loop.iter.check

%n.vec = and i64 %1, -64
**%ind.end = or i64 %n.vec, 2**

vec.epilog.iter.check: ; preds = %middle.block
%ind.end20 = or i64 %n.vec, 2
---Snip--

But I think creating two times should be Ok.

fhahn updated this revision to Diff 467475.Oct 13 2022, 8:03 AM

Rebased on top of additional test (518bccfd6e8b3aa3)

fhahn added inline comments.Oct 13 2022, 8:06 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10528

Oh right, thanks for sharing the case. I added a similar test case in 518bccfd6e8b.

I don't think it is worth caching those values for now, as it would require additional state which might add extra complexity and the extra instructions should be cleaned up by later passes. Going forward it might make sense to use SCEVExpander to create the end values, which would mean we would get re-use for free.