This is an archive of the discontinued LLVM Phabricator instance.

[LV] Handle Fold-Tail of loops with vectorizarion factor (VF) equal to 1
ClosedPublic

Authored by anhtuyen on May 14 2020, 5:44 PM.

Details

Summary

When handling loops whose VF is 1, fold-tail vectorization sets the backedge taken count of the original loop with a vector of a single element. This approach will cause type-mismatch during instruction generartion, as reported in https://reviews.llvm.org/D78847#2029339

Based on discussion with Ayal @Ayal , I provide this patch to address the case of VF==1.

Diff Detail

Event Timeline

anhtuyen created this revision.May 14 2020, 5:44 PM
Ayal added a comment.May 15 2020, 12:14 AM

Thanks for taking care of this, only a couple of nits.

llvm/lib/Transforms/Vectorize/VPlan.cpp
444–446

nit: using auto VF = State->VF will help shorten this to 2 lines.

820

nit: can move VF = State.VF to be reused above and below.

822–824

nit: ConstantInt::get(STy, Part) was already pushed inside Indices, can retrieve it e.g. via Indices.back().

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
3 ↗(On Diff #264130)

Would be good to explain what this test is for; e.g., that fold-tail produces correct scalar code when LV is only unrolling w/o vectorizing.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-vector.ll
3 ↗(On Diff #264130)

Better have both tests in the same file, briefly explain what this tests - possibly using an informative function name rather than "foo".

bmahjour added inline comments.
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
17 ↗(On Diff #264130)

How is it that the original loop executes 15 iterations, but the vector loop iterates 16? It seems the minimum iteration count check branch at the top should branch to the scalar loop instead of vector.ph.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-vector.ll
37 ↗(On Diff #264130)

Can the undef values be replaced with some valid values (such as an incoming parameter)?

Ayal added inline comments.May 15 2020, 1:41 PM
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
15 ↗(On Diff #264130)

continue CHECK-NEXTing the scalar icmp ule's

17 ↗(On Diff #264130)

(Thanks for asking, reminded to check above and below that fold-tail emits the desired scalar icmp ule's, which are the focus of this patch.)

Fold-tail is responsible for rounding-up the trip count from 15 to 16, see https://reviews.llvm.org/D50480.
Regarding minimum iteration count check branch, fold-tail is also responsible in general for branching directly to vector.ph w/o an "if (trip-count < VF*UF)", which in this case is known to be false anyhow.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-vector.ll
23 ↗(On Diff #264130)

continue CHECK-NEXTing the scalar icmp ule's

anhtuyen marked 9 inline comments as done.May 15 2020, 4:04 PM
anhtuyen added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.cpp
444–446

I will update it shortly.

820

I will update it shortly.

822–824

I will update it shortly.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
3 ↗(On Diff #264130)

I will update it shortly.

15 ↗(On Diff #264130)

Good catch! I will add them.

17 ↗(On Diff #264130)

Ayal @Ayal , thank you very much for your help to clarify. My guess is that Bardia @bmahjour might have been concerned whether by any chance the effective addition of the 16th iteration would affect the correctness of the generated code. Because I have neither evidence nor counterevidence to address his concern, if you can shed some light on it when you have some time, that will be great.

Back to this patch but given the fact that the value of the trip-count is not the main focus hereof, I take the liberty of omitting its value from the checks. If that is not acceptable to any of us here, please let me know.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-vector.ll
3 ↗(On Diff #264130)

I will update it shortly.

23 ↗(On Diff #264130)

I will update it shortly.

37 ↗(On Diff #264130)

I will update it shortly.

bmahjour added inline comments.May 15 2020, 4:43 PM
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
17 ↗(On Diff #264130)

I guess I would understand how the rounding-up would work, if the instructions in the body were somehow predicated, but I don't see any predication in the output IR. Is that because there are no instructions in this test case with side-effects, read/writes, etc?

anhtuyen updated this revision to Diff 264443.May 16 2020, 12:59 PM

This patch includes all the changes recommended by the reviewers - thank you very much!
I also added a new LIT test, which uses the ORE to confirm which pass has the impact.
In terms of testing, I have run a number of test-cases with it, and all were successful.

fhahn added inline comments.May 16 2020, 2:15 PM
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
17 ↗(On Diff #264130)

yes, this would probably need some actual instructions in the loop body, so there is something to predicate. @anhtuyen could you add a small vectorizable body, e.g. just storing to ptr+induction?

fhahn added a comment.May 16 2020, 2:44 PM

The issue has also been reported as https://bugs.llvm.org/show_bug.cgi?id=45943

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
112

metadata here and below unused?

anhtuyen marked 2 inline comments as done.May 16 2020, 4:14 PM
anhtuyen added inline comments.
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
17 ↗(On Diff #264130)

I will add it later tonight.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
112

Hi Florian @fhahn, The short answer is yes, we need it.
As you can see on https://reviews.llvm.org/D78847, it is simple to create a minimal reproducer asserting when going through function VPlan::execute(). It is, however, a bit tricky to craft a minimal LIT test which exhibits the same problem when going through the other function VPWidenCanonicalIVRecipe::execute(). I came up with an idea to use profile data for the function online 57

define void @vectorize-factor-1-vector-bound(double* %pt1) !prof !12 {

to guide it through VPWidenCanonicalIVRecipe::execute(). Although I made the data up. I am sure almost any meaningful values will serve the purpose.

The issue has also been reported as https://bugs.llvm.org/show_bug.cgi?id=45943

Thank you @fhahn for linking the bug report to this patch.
I have requested a password reset from Bugzilla since earlier last week to open a bug report, but I still have not received any thing from the server.

anhtuyen marked an inline comment as done.May 16 2020, 8:51 PM
anhtuyen added inline comments.
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
112

I have just found simple way to alter the 2nd function in this LIT test so that it still goes through VPWidenCanonicalIVRecipe::execute() even without the meta data. The new test-case is also clearer. I will upload the new patch shortly.

anhtuyen updated this revision to Diff 264468.May 16 2020, 9:10 PM

Similar to the 2nd patch earlier, this patch includes all the changes recommended by the reviewers.
Also, I

  • modified the 2nd function of the LIT test to make it clearer and simpler, and
  • added ORE to confirm which pass has/does not have the impact.
anhtuyen marked an inline comment as done.May 16 2020, 10:29 PM
anhtuyen added inline comments.
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
17 ↗(On Diff #264130)

Hi Florian @fhahn, This is testcase, which I modified along the way you had suggested. It has a trip count of 15. However, I cannot make it go through Fold-Tail.

define void @foo(double* %paramP) {
entry:
  %localP = getelementptr inbounds double, double* %paramP, i64 15
  br label %for.body

for.cond.cleanup:
  ret void

for.body:
  %addr = phi double* [ %indP, %for.body ], [ %paramP, %entry ]
  %indP = getelementptr inbounds double, double* %addr, i64 1
  %cond = icmp eq double* %indP, %localP
  store double 3.14, double* %indP              ;<======== I added this line
  br i1 %cond, label %for.cond.cleanup, label %for.body
}

Do you have some hints, which I could use ?

Ayal added inline comments.May 17 2020, 8:53 AM
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
17 ↗(On Diff #264130)

However, I cannot make it go through Fold-Tail.

That is due to bug in Fold-Tail, namely PR45679, which should be fixed by D80085. Specifically, using only type double on default target leads to internally computed MaxVF=1.
The test added in D80085, pr45679-fold-tail-by-masking.ll, should be extended by this patch to also handle VF=1, UF=4 case.

anhtuyen marked an inline comment as done.May 17 2020, 4:49 PM
anhtuyen added inline comments.
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll
17 ↗(On Diff #264130)

Ah, thank you! Please let me know if there any thing else you like me to further in this patch. Monday May 18 will be a statutory holiday in Canada, and I will be away, but I will respond the following day.

Hello,
I write to ask if there is any thing else in this patch you like me to address. If not, please see if you could approve it. A number of testcases I am running are affected and it will be great if I can get them pass with this patch. Thank you very much!

Ayal accepted this revision.May 19 2020, 1:55 PM

This looks good to me, thanks! Please wait a day or so if @fhahn has further comments.

As noted, it would now be good to also extend pr45679-fold-tail-by-masking.ll with an additional RUN of -force-vector-width=1 and -force-vector-interleave=4.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
108

Interesting; this test w/o a primary IV seems to work ok w/o the patch; comparing pairs of vectors of size 1...

This revision is now accepted and ready to land.May 19 2020, 1:55 PM
anhtuyen marked an inline comment as done.May 19 2020, 9:01 PM

This looks good to me, thanks! Please wait a day or so if @fhahn has further comments.

As noted, it would now be good to also extend pr45679-fold-tail-by-masking.ll with an additional RUN of -force-vector-width=1 and -force-vector-interleave=4.

It is very kind of you, Ayal @Ayal , and I thank you very much. As you have mentioned, I surely will wait a day or two to see whether Florian and Bardia might have further comments.
About pr45679-fold-tail-by-masking.ll, do you want me to add it for completeness?
With VF=1, the testcase does not go through the code changed by this patch (b/c its BackedgeTakenCount is null).

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
108

Yes, this test should pass without this patch b/c it compares a pair of vectors size 1. When we decided to replace VTCMO with TCMO when State->VF == 1, we also updated VPWidenCanonicalIVRecipe::execute() , and it now compares two scalars. Did I mis-understand your above comment about its interesting-ness?

This looks good to me, thanks! Please wait a day or so if @fhahn has further comments.

As noted, it would now be good to also extend pr45679-fold-tail-by-masking.ll with an additional RUN of -force-vector-width=1 and -force-vector-interleave=4.

It is very kind of you, Ayal @Ayal , and I thank you very much. As you have mentioned, I surely will wait a day or two to see whether Florian and Bardia might have further comments.
About pr45679-fold-tail-by-masking.ll, do you want me to add it for completeness?
With VF=1, the testcase does not go through the code changed by this patch (b/c its BackedgeTakenCount is null).

Oh, I think I know what you meant now: adding the test pr45679-fold-tail-by-masking.ll AFTER applying your patch https://reviews.llvm.org/D80085. I will do.

anhtuyen updated this revision to Diff 265136.May 19 2020, 10:23 PM

Extend pr45679-fold-tail-by-masking.ll with an additional RUN of -force-vector-width=1 and -force-vector-interleave=4 from https://reviews.llvm.org/D80085.

Ayal added a comment.May 20 2020, 3:27 AM

This looks good to me, thanks! Please wait a day or so if @fhahn has further comments.

As noted, it would now be good to also extend pr45679-fold-tail-by-masking.ll with an additional RUN of -force-vector-width=1 and -force-vector-interleave=4.

It is very kind of you, Ayal @Ayal , and I thank you very much. As you have mentioned, I surely will wait a day or two to see whether Florian and Bardia might have further comments.
About pr45679-fold-tail-by-masking.ll, do you want me to add it for completeness?
With VF=1, the testcase does not go through the code changed by this patch (b/c its BackedgeTakenCount is null).

Oh, I think I know what you meant now: adding the test pr45679-fold-tail-by-masking.ll AFTER applying your patch https://reviews.llvm.org/D80085. I will do.

Yes, thanks.
Sorry for the confusion; before D80085 this test generated wrong code with VF=1, afterwards the compiler crashes, with this patch it generates correct code.

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
108

Right. Just noted that the original motivation was to fix crashing cases, and this test shows it also helps to improve correct code cases.

This revision was automatically updated to reflect the committed changes.
Ayal added inline comments.May 22 2020, 7:28 AM
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
114

Ah, suffice to add to pr45679-fold-tail-by-masking.ll a RUN of -force-vector-width=1 -force-vector-interleave=4 along with its CHECK-VF1's, instead of replicating test @pr45679 here. Note that all (3) tests here will be compiled by all (3) RUNs, some w/o any corresponding CHECK's.

anhtuyen marked an inline comment as done.May 22 2020, 9:18 AM
anhtuyen added inline comments.
llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll
114

Thanks Ayal @Ayal for pointing this out. I will move it with patch https://reviews.llvm.org/D80446
I thought the test was to address Bardia's and Florian's comments about needing "some actual instructions in the loop body, so there is something to predicate", so I added it to the test in this patch. I then only CHECK-ed for details of interest and thus omitted those covered by the other CHECKs.