This is an archive of the discontinued LLVM Phabricator instance.

[LV] Remove the reminder loop if we know the mask is always true
AbandonedPublic

Authored by Allen on Jul 2 2023, 7:40 PM.

Details

Summary

We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.

Fix https://github.com/llvm/llvm-project/issues/63616.

Diff Detail

Event Timeline

Allen created this revision.Jul 2 2023, 7:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 2 2023, 7:40 PM
Allen requested review of this revision.Jul 2 2023, 7:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 2 2023, 7:40 PM
david-arm added inline comments.Jul 3 2023, 1:34 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5180

I think this looks like a regression to me. If we know the mask is always going to be true for every vector iteration, then we should be using unpredicated loops instead even if the target supports tail-folding. That's because creating and maintaining the loop predicate is always going to be a bit more expensive than maintaining a simple integer induction variable.

I believe gcc's code in https://github.com/llvm/llvm-project/issues/63616 is worse than clang, although perhaps it's worth testing this on AArch64 hardware to confirm?

Allen added inline comments.Jul 5 2023, 6:48 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5180

Thanks for your idea.

I test their performance , and it doesn't look significantly different base on aarch64 target (the length of scalable vector is 256).
But on the x86 target,it seems clang is better (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411), so I agree with you, this is not a problem.

Allen updated this revision to Diff 537578.Jul 5 2023, 9:47 PM
Allen retitled this revision from [LV] Prefer the tail fold according the the user hint to Remove the reminder loop if we know the mask is always true.
Allen edited the summary of this revision. (Show Details)
Allen retitled this revision from Remove the reminder loop if we know the mask is always true to [LV] Remove the reminder loop if we know the mask is always true.
david-arm added inline comments.Jul 6 2023, 1:39 AM
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

I'm guessing that InstCombine does not determine this is guaranteed to always be true? However, I thought that someone did work in the DAGCombiner that will replace this with

br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]

when vscale is known to be a power of 2? Are you hoping to benefit from eliminating the scalar tail in IR because it helps us to make better decisions later in the pipeline? I can imagine it's beneficial for LTO where the scalar tail could prevent inlining.

If I remember correctly one of the problems with folding away the icmp in InstCombine is that it doesn't have access to the TTI interface so we cannot query the target.

llvm/test/Transforms/LoopVectorize/X86/constant-fold.ll
30

For all the fixed-lenth vector tests this icmp will get replaced with "i1 true" by InstCombine so the scalar tail should get automatically deleted.

llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll
4

Hmm, I know this is not introduced by your patch, but I don't think we should have tests with target-specifics in the top level LoopVectorize directory.

Allen added inline comments.Jul 6 2023, 1:51 AM
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

I may not have caught your idea, are you saying that the current optimization needs to be handled in combinine ?

Allen updated this revision to Diff 537629.Jul 6 2023, 2:06 AM

Fix Transforms/LoopVectorize/RISCV/short-trip-count.ll

Allen added inline comments.Jul 7 2023, 12:59 AM
llvm/test/Transforms/LoopVectorize/X86/constant-fold.ll
30

Yes, the remainder loop body will be deleted as the trip count is multiple of VF, so it will not be executed.

llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll
4

Thanks.
Do you have any plans to reposition them? Or where do you want me to put them that would look more appropriate?

david-arm added inline comments.Jul 7 2023, 5:00 AM
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

Well, I'm just trying to understand what this patch is trying to achieve that's all. I'm not against it because it does clean up the IR generated by the vectoriser. However, I'm not sure if expect you to see many real performance gains from doing this because we should delete the scalar tail during codegen.

It might also be worth investigating whether or not InstCombine already optimises the urem calculation, similar to what this patch (https://reviews.llvm.org/D129609) did in codegen.

Allen added inline comments.Jul 8 2023, 1:41 AM
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

Yes, I think it's going to benefit beside the codesize in the following scenario.

  1. When the loop is a nested loop, the trip count of the inner loop is an power of 2, so eliminating the tail block reduces branch jumps.

BTW, the AArch64 backend already supported the isVScaleKnownToBeAPowerOfTwo in patch (https://reviews.llvm.org/D146199).

david-arm added inline comments.Jul 10 2023, 1:19 AM
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

Sure, you're absolutely right about about the benefit of deleting the tail! I'm just trying to say that we should be deleting the tail already as part of codegen. If you look at the final assembly output for the loop I hope that there is no tail because the DAGCombiner + MachineDCE should have killed it off.

So this patch is then about trying to delete the tail *even earlier* in IR. It's a nice cleanup, but I was thinking it would be nice if InstCombine could do this for us because that would be even more powerful.

There are already function attributes for specifying the minimum and maximum vscale vector lengths through vscale_range. I think it should be expanded to include whether the vscale is know to be a power-of-two. That would allow Instcombine and value tracking to properly reason about it in other situations too, as well as removing the dead tails in cases like this.

That doesn't mean this patch is not a good idea too, as we already know the tail is unnecessary in the vectorizer and removing it earlier will have benefits.

v01dXYZ added inline comments.
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

Do you know where to start in order to understand how the two following passes can possibly simplify the comparison result ?

  • InstCombine
  • DAGCombiner

They would need to propagate the fact vscale is a power of two within a given interval in order to simplify urem operations, similar to what is done by interval partition or scalar evolution. Do you think those two passes are powerful enough to do this kind of analysis ?

v01dXYZ added inline comments.Jul 12 2023, 2:42 AM
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

*edit: I meant range analysis, not interval partition.

Allen planned changes to this revision.Jul 12 2023, 3:59 AM
Allen added inline comments.
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
34

the compare is insert in LoopVectorize.cpp:3255
vscale is a power of two is discussed in https://reviews.llvm.org/D154953

Allen abandoned this revision.Jul 17 2023, 5:36 AM