This is an archive of the discontinued LLVM Phabricator instance.

[LV] Support widened induction variables in epilogue vectorization.
ClosedPublic

Authored by fhahn on Nov 25 2020, 1:23 PM.

Details

Summary

Code generation now uses the start VPValue of induction recipes.

This makes it possible to adjust the start value of the epilogue
vector loop to use the 'resume' value of the main vector loop.

Fixes #59459.

Diff Detail

Event Timeline

fhahn created this revision.Nov 25 2020, 1:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2020, 1:23 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
fhahn requested review of this revision.Nov 25 2020, 1:23 PM
fhahn updated this revision to Diff 377997.Oct 7 2021, 1:47 PM

rebased now that all required dependencies landed a while ago.

fhahn retitled this revision from [LV] Support widened induction variables in epilogue vectorization (WIP). to [LV] Support widened induction variables in epilogue vectorization..Oct 7 2021, 1:53 PM
fhahn edited the summary of this revision. (Show Details)
fhahn added reviewers: bmahjour, Ayal, gilr, rengolin.

Instead of changing the generic interface for skeleton creation, how about adding a field to hold the resume value inside EpilogueVectorizerEpilogueLoop? Then inside executePlan() we would do something like this to update the induction widening recipes:

EpilogueVectorizerEpilogueLoop *EILV = dyn_cast<EpilogueVectorizerEpilogueLoop>(ILV);
if (EILV) {
  Value *IndStart = EILV->getResumeValue();
  assert(IndStart && "Expected valid resume value");
  ...
}

Also don't we need to store some sort of a map to be able to handle loops with multiple widened induction vars (with potentially different resume values)? I'm thinking of a case like this:

void foo(int * restrict A, int N) {
  int i, j, k;
  for (i = 0, j = 1, k = 2; i < N; i++, j++, k++)
    A[i] = j + k;
}
bmahjour added inline comments.Oct 7 2021, 3:04 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8217

Don't we need to make sure that it is an "external definition" before treating it as such?

Hi Florian,
--Snip--
void foo(double * restrict a, double * restrict b, int N) {
int k=0;
for(int i = 0 , k =2; i < N; ++i, ++k)

a[i] = sin(i)+k;

}
--Snip--

I am able to see the resume value for induction "i" updated properly now after the patch .

--Snip--
33: ; preds = %8, %28

%34 = phi i64 [ %11, %28 ], [ 0, %8 ]
%35 = and i64 %6, 4294967294
%36 = trunc i64 %35 to i32
%37 = add i32 %36, 2
%38 = insertelement <2 x i64> poison, i64 %34, i32 0
%39 = shufflevector <2 x i64> %38, <2 x i64> poison, <2 x i32> zeroinitializer
%40 = or <2 x i64> %39, <i64 0, i64 1>
%41 = trunc i64 %34 to i32
%42 = insertelement <2 x i32> poison, i32 %41, i32 0
%43 = shufflevector <2 x i32> %42, <2 x i32> poison, <2 x i32> zeroinitializer
%44 = or <2 x i32> %43, <i32 0, i32 1>
br label %45

45: ; preds = %45, %33

%46 = phi i64 [ %34, %33 ], [ %55, %45 ]
%47 = phi <2 x i64> [ %40, %33 ], [ %56, %45 ]   <== updated
%48 = phi <2 x i32> [ %44, %33 ], [ %57, %45 ] <== updated

--Snip--

I was thinking if we could reuse the incremented widened induction results from the main vector loop. But this seems cleaner and simpler code .

Hi Florian,

This patch breaks one assert when building compiler-rt. However I noticed that in my work space I lowered the epilog generation thresholds to 8 so got this assert.

Here is the general reproducer.

---Snip winden-ce.cpp ---
unsigned short int * MemToShadow(unsigned int x);

void Add( unsigned int begin,unsigned int end, unsigned int cfi_check) {

// Don't fill anything below cfi_check. We can not represent those addresses
// in the shadow, and must make sure at codegen to place all valid call
// targets above cfi_check.
//begin = Max(begin, cfi_check);
unsigned short int *s = MemToShadow(begin);
unsigned short int *s_end = MemToShadow(end - 1) + 1;
unsigned short int  sv = ((begin - cfi_check)) + 1;
for (; s < s_end; s++, sv++)
  *s = sv;

}
---Snip--

clang++ -O3 winden-ce.cpp -S -emit-llvm -mllvm -epilogue-vectorization-minimum-VF=8

LoopVectorize.cpp:2535: virtual llvm::Value* llvm::InnerLoopVectorizer::getStepVector(llvm::Value*, int, llvm::Value*, llvm::Instruction::BinaryOps): Assertion `Step->getType() == STy && "Step has wrong type"' failed.

I noticed Main loop Vector splat created in the preheader is with i16 type (unsigned short *)
%.splat = shufflevector <8 x i16> %.splatinsert, <8 x i16> poison, <8 x i32> zeroinitializer

whereas Epilog loop creates vector splat with i32 type.
%.splat31 = shufflevector <4 x i64> %.splatinsert30, <4 x i64> poison, <4 x i32> zeroinitializer

fhahn planned changes to this revision.Oct 8 2021, 11:57 AM

Thanks for taking a look and giving it a try! I now remember what was still missing, handling noprimiary IV cases. I'll work to update the patch, but I think it might need a few more patches to bring other parts of the infrastructure up to speedd.

fhahn updated this revision to Diff 461316.Sep 19 2022, 12:42 PM

I reworked the patch to use a createInductionResumeValue to create the required resume values as needed.

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 12:42 PM
rengolin added inline comments.Sep 28 2022, 6:17 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3610–3611

unused?

8552

A comment here for the semantics of this would be nice

10440–10441

80 columns?

10442

This code is hard to follow, perhaps a comment?

Hi

I tried to build the patch on top of the latest trunk and facing this issue.

/home/amd/venkat/aocc-work/mirror/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10505:24: error: ‘class llvm::VPWidenPointerInductionRecipe’ has no member named ‘getInductionDescriptor’
10505 | ID = &Ind->getInductionDescriptor();

Any other patch I am missing to add other than this one and its parent ?

fhahn updated this revision to Diff 463845.Sep 29 2022, 4:21 AM

Thanks for taking a look!

Comments should be addressed, also moved getInductionDescriptor definition from separate patch to this one.

fhahn marked 4 inline comments as done.Sep 29 2022, 4:24 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3610–3611

This has been removed in the earlier change that did the refactoring, some for the return type fix.

8552

Added a comment, thanks.

10440–10441

Thanks, formatting should be fixed.

10442

Thanks, added a comment.

Ayal added a comment.Oct 5 2022, 2:00 PM

Good improvement, adding various nits.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8543

nit: and reduction?

8556

nit: none_of?

10440–10441

early continue? Or have all cases issue a common R.setOperand(0, StartVal)?

10440–10441

nit: can cast R into VPHeaderPHIRecipe and have it provide SetStartValue() to be used by all below.

10453

else it's an FOR and continue - they resume ok?

Rather than check and continue, better assert (IndPhi && "...") for inductions - whose start/resume value must always be fixed?

10458

Compare with how other "live-ins" are handled by creating empty VPValues registered with VPlan to be filled with Values created during VPlan::prepareToExecute(). Are these live-ins better handled here, inside LoopVectorizePass::processLoop() which is already overburdened with 400 LOC? Can leave behind a TODO...

llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
233

nit: why this redundant label change, here and additional multiple instances below?

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
44

Adding assertions to show we can vectorize main loop but not epilog?

71

worth retaining f3 testcase to demonstrate that widened/truncated inductions do get epilog vectorized, or becomes redundant?

fhahn updated this revision to Diff 466296.Oct 8 2022, 9:36 AM
fhahn marked 7 inline comments as done.

Address latest comments, thanks!

fhahn updated this revision to Diff 466299.Oct 8 2022, 9:44 AM
fhahn marked 3 inline comments as done.

Move test with truncated IV to optimal-epilog-vectorization.ll

fhahn marked an inline comment as done.Oct 8 2022, 9:45 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8543

Added, thanks!

8556

Thanks, I wasn't aware of none_of. Updated!

10440–10441

Restructured the code to have a single setStartValue below.

10440–10441

Added the helper and use it below.

10453

The only allowed recipe to skip should be the canonical IV. Added an assert. If we have FORs here it would be a mis-compile.

10458

Added a todo, thanks! Not sure yet how to best move the code, as it requires access to ILV and prepareToExecute is defined in VPlan.cpp, so doesn't really have access to it at the moment unfortunately.

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
71

Sounds good, I moved it to optimal-epilog-vectorization.ll

Ayal added a comment.Oct 11 2022, 4:34 PM

Thanks for accommodating, this looks good to me, with a couple of final nits.
@rengolin, @bmahjour, @venkataramanan.kumar.llvm - any further comments or ok with you to accept?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10447

nit: may introduce getPHINode() here as well; and potentially a pure virtual VPWidenInductionRecipe base class to cover the common interface of it and getInductionDescriptor().

10454

nit: } else { ?

llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
233

ok, iter.check is an indicator that createEpilogueVectorizedLoopSkeleton() kicked in, sigh.

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
44

Clarify why assertions were added here now?

All my comments addressed, if other people are happy, I'm also happy. Thanks!

Just one question otherwise patch looks good to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10461

Just a question here we are creating induction resume values two times for the main vector loop. One here while setting up resume value for epilog vector loop. The second one while setting up resume value for the scalar loop. is it possible to reuse already created one?

If implementing the reuse is not worth the effort than creating a new one please ignore my comment.

fhahn updated this revision to Diff 467141.Oct 12 2022, 7:37 AM
fhahn marked an inline comment as done.

Rebase on top of additional tests in 26c8632f22a5f49f57410607476c874a45daf3a5 and also update comment in EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton.

fhahn added inline comments.Oct 12 2022, 7:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10461

Just a question here we are creating induction resume values two times for the main vector loop. One here while setting up resume value for epilog vector loop. The second one while setting up resume value for the scalar loop. is it possible to reuse already created one?

I think in general, the resume values could be different, e.g the edge main-vector-loop-successor -> scalar loop it would be based on (N - main vector TC), the edge epilogue-vector-loop-successor it would be based on (N - main vector TC - epilogue vector TC).

Is it possible that you were looking at the resume values in @test_widen_ptr_induction? I think those are a bit misleading, the resume values are the same as the epilogue vector loop never executes AFAICT. I added a few more interesting variants with runtime trip counts (e.g. @test_widen_induction).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10461

I am referring to induction end value computation in "InnerLoopVectorizer::createInductionResumeValue". It is called at this line (N - main vector TC).

Then again in CreateInductionResumeValues({VecEpilogueIterationCountCheck, EPI.VectorTripCount} /* AdditionalBypass */). in "EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton". We create again at Additional bypass block (N - main vector TC)

Example:

int a[10000],n;
void fn() {
for (int i=2;i<n;i++) {

a[i] = a[i] + i;

}
}

---Snip--
vector.ph: ; preds = %vector.main.loop.iter.check

%n.vec = and i64 %1, -64
**%ind.end = or i64 %n.vec, 2**

vec.epilog.iter.check: ; preds = %middle.block
%ind.end20 = or i64 %n.vec, 2
---Snip--

But I think creating two times should be Ok.

fhahn updated this revision to Diff 467475.Oct 13 2022, 8:03 AM

Rebased on top of additional test (518bccfd6e8b3aa3)

fhahn added inline comments.Oct 13 2022, 8:06 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10461

Oh right, thanks for sharing the case. I added a similar test case in 518bccfd6e8b.

I don't think it is worth caching those values for now, as it would require additional state which might add extra complexity and the extra instructions should be cleaned up by later passes. Going forward it might make sense to use SCEVExpander to create the end values, which would mean we would get re-use for free.

fhahn updated this revision to Diff 483814.Dec 18 2022, 6:09 AM

Rebase & ping. This also fixes another issue: https://github.com/llvm/llvm-project/issues/59459

fhahn edited the summary of this revision. (Show Details)Dec 18 2022, 6:09 AM
Ayal accepted this revision.Dec 20 2022, 12:43 PM

Ship it!
This looks even better with the added tests, and the last issue @venkataramanan.kumar.llvm raised regarding replicated end values seems acceptable.
SCEV expanding the end values is worth a TODO? Added a couple of thoughts that can be addressed separately.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8368–8370

Should this caching of trip count be taken are of by executePlan() prior to creating the skeleton, as raised in optimizeForVFAndUF()? (Independent of this patch.)

10454

Or early continue, along with above nit:

if (isa<VPCanonicalIVPHIRecipe>(&R))
  continue;
if (auto *ReductionPhi = dyn_cast<VPReductionPHIRecipe>(&R))
  ResumeV = MainILV.getReductionResumeValue(ReductionPhi->getRecurrenceDescriptor());
else {
  auto *Ind = cast<VPWidenInductionRecipe>(&R);
  PHINode *IndPhi = Ind->getPHINode();
  const InductionDescriptor *ID = &Ind->getInductionDescriptor();
  ResumeV = MainILV.createInductionResumeValue(IndPhi, *ID, {EPI.MainLoopIterationCountCheck});
}
llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
44

assertions gone...

This revision is now accepted and ready to land.Dec 20 2022, 12:43 PM
This revision was landed with ongoing or failed builds.Dec 21 2022, 5:59 AM
This revision was automatically updated to reflect the committed changes.
fhahn marked 3 inline comments as done.Dec 21 2022, 1:41 PM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8368–8370

Yes that would reduce duplication. Done in 7d8528dbf290

10454

Added the early exit at the top of the loop, thanks!

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
44

I think the REQUIRES: asserts is needed because the test only checks debug output.