When a scalar epilogue is required, at least one iteration of the scalar loop
has to execute. Adjust ConstTripCount accordingly to avoid picking a max VF
that results in a dead vector loop.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll | ||
---|---|---|
7–9 | Comment above deserves updating? It refers to skipping over the main loop vectorized to VF=32 and jumping to the epilog loop vectorized to VF=8 (with original trip count of 32 and requires scalar epilog), whereas now the main loop is vectorized to VF=8 and the epilog loop is not vectorized at all. |
Rebase on top of e561edaaa56c9a8818d546774b141dead7224b50 which updates the test in pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll so it keeps testing for the original issue.
llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll | ||
---|---|---|
7–9 | Thanks, I moved the test to limit-vf-by-tripcount.ll in e561edaaa56c9a8818d546774b141dead7224b50 and update the trip counts in the test so that there's an iteration of the main loop and epilogue loop, so it keeps testing for the original issue. |
Ah, the original issue of pr56319 was to ensure that the epilog loop leaves at-least one iteration for the scalar loop (or actually epilogVF iterations) rather than vectorizing all remaining iterations? An odd trip count of 39 will always end up with a last scalar iteration, due to main and epilog VF's being even. Perhaps a trip count of 48 with (forced?) main-loop VF of 32 and epilog-loop VF of 8 should ensure the epilog loop runs only once.
llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll | ||
---|---|---|
285–286 | nit: name of test is a bit misleading, the VF of the main loop is limited to avoid excessive values rather than avoiding epilog vectorization. |
The original issue was that we simplified the branch in the main vector loop to exit in the first iteration (VF == vector TC) and because epilogue and main vector loops share the same VPlan, it was also simplified for the epilogue vector loop, but there where more than 1 vector iteration (fix was https://github.com/llvm/llvm-project/commit/0dddf04caba55a64f8534518d65311bdac05cf39).
Now, the test still checks that we don't simplify neither the main nor epilogue vector loop branch. But in this case, we could simplify both, because the epilogue vector loop only executes a single iteration; we should also have a variant where the epilogue vector loops executes multiple times (e.g. main VF=32, epilogue VF=8, TC = 49), but unfortunately my attempts so far weren't successful, because there's some code in isMoreProfitable that estimates the cost of the vector loop + scalar tail for the given VFs, which ignores possible epilogue vectorization( https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5384). For all cases I tried, this leads to a small main VF ends up being chose (8 in that case) if there would be more than a single iteration of the epilogue vector loop.
llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll | ||
---|---|---|
285–286 | adjusted the name, thanks! |
Ah, ok, that optimization was later outlined by having LVP::execute_plan() do if (!IsEpilogueVectorization) VPlanTransforms::optimizeForVFAndUF()).
(And may explain the observed clean-up behavior observed in https://reviews.llvm.org/D154264#inline-1491906 :-)
The explanation for test @pr56319 may be misleading, though, as it specifically deals with having the epilog loop vectorization consider requiresScalarEpilog:
; Test case where the exit condition in the main vector loop can be optimized
; to true, but not in the epilogue vector loop. In the test the interleave
; group requires to execute at least one scalar iteration, meaning the last
; vector iteration of the epilogue vector loop cannot be executed.
Could a test with trip count of, say, 34, work - be vectorized with main VF=32 and epilog VF=2 w/o requiresScalarEpilog - when last member of the interleave-group is accessed, while with requiresScalarEpilog main VF=32 will continue but epilog vectorization will be abandoned?
Agreed, I think the interleave group requiring one scalar iteration is a red herring with respect to the orinal issue. Tried to clarify it in this patch.
Could a test with trip count of, say, 34, work - be vectorized with main VF=32 and epilog VF=2 w/o requiresScalarEpilog - when last member of the interleave-group is accessed, while with requiresScalarEpilog main VF=32 will continue but epilog vectorization will be abandoned?
With 34, we could simplify the branch in the epilogue vector loop, as the epilogue vector loop would be dead (or shouldn't get generated, as in follow-on patches). I *think* TC = 37 with forced epilogue VF=2 should work (single main loop iteration with VF=32, 2 iterations of the epilogue vector loop (so branch cannot be simplified), final scalar iteration. Updated accoringly
Ok, very well, then better drop the last requireScalarEpilog part from the explanation of the test?
llvm/test/Transforms/LoopVectorize/X86/pr56319-vector-exit-cond-optimization-epilogue-vectorization.ll | ||
---|---|---|
9–10 | The scalar loop will get to run the final iteration in any case, due to the trip-count being odd, regardless of any interleave group requiring it (aka red herring). | |
29 | (Just noting that we're refraining from optimizing this single iteration main loop with an unconditional branch exiting to middle block, because that would also cause the double iteration epilog loop to bail out after its first iteration.) |
nit: name of test is a bit misleading, the VF of the main loop is limited to avoid excessive values rather than avoiding epilog vectorization.