We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Details
Diff Detail
Event Timeline
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5180 | I think this looks like a regression to me. If we know the mask is always going to be true for every vector iteration, then we should be using unpredicated loops instead even if the target supports tail-folding. That's because creating and maintaining the loop predicate is always going to be a bit more expensive than maintaining a simple integer induction variable. I believe gcc's code in https://github.com/llvm/llvm-project/issues/63616 is worse than clang, although perhaps it's worth testing this on AArch64 hardware to confirm? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5180 | Thanks for your idea. I test their performance , and it doesn't look significantly different base on aarch64 target (the length of scalable vector is 256). |
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | I'm guessing that InstCombine does not determine this is guaranteed to always be true? However, I thought that someone did work in the DAGCombiner that will replace this with br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]] when vscale is known to be a power of 2? Are you hoping to benefit from eliminating the scalar tail in IR because it helps us to make better decisions later in the pipeline? I can imagine it's beneficial for LTO where the scalar tail could prevent inlining. If I remember correctly one of the problems with folding away the icmp in InstCombine is that it doesn't have access to the TTI interface so we cannot query the target. | |
llvm/test/Transforms/LoopVectorize/X86/constant-fold.ll | ||
30 | For all the fixed-lenth vector tests this icmp will get replaced with "i1 true" by InstCombine so the scalar tail should get automatically deleted. | |
llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll | ||
4 | Hmm, I know this is not introduced by your patch, but I don't think we should have tests with target-specifics in the top level LoopVectorize directory. |
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | I may not have caught your idea, are you saying that the current optimization needs to be handled in combinine ? |
llvm/test/Transforms/LoopVectorize/X86/constant-fold.ll | ||
---|---|---|
30 | Yes, the remainder loop body will be deleted as the trip count is multiple of VF, so it will not be executed. | |
llvm/test/Transforms/LoopVectorize/dont-fold-tail-for-divisible-TC.ll | ||
4 | Thanks. |
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | Well, I'm just trying to understand what this patch is trying to achieve that's all. I'm not against it because it does clean up the IR generated by the vectoriser. However, I'm not sure if expect you to see many real performance gains from doing this because we should delete the scalar tail during codegen. It might also be worth investigating whether or not InstCombine already optimises the urem calculation, similar to what this patch (https://reviews.llvm.org/D129609) did in codegen. |
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | Yes, I think it's going to benefit beside the codesize in the following scenario.
BTW, the AArch64 backend already supported the isVScaleKnownToBeAPowerOfTwo in patch (https://reviews.llvm.org/D146199). |
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | Sure, you're absolutely right about about the benefit of deleting the tail! I'm just trying to say that we should be deleting the tail already as part of codegen. If you look at the final assembly output for the loop I hope that there is no tail because the DAGCombiner + MachineDCE should have killed it off. So this patch is then about trying to delete the tail *even earlier* in IR. It's a nice cleanup, but I was thinking it would be nice if InstCombine could do this for us because that would be even more powerful. |
There are already function attributes for specifying the minimum and maximum vscale vector lengths through vscale_range. I think it should be expanded to include whether the vscale is know to be a power-of-two. That would allow Instcombine and value tracking to properly reason about it in other situations too, as well as removing the dead tails in cases like this.
That doesn't mean this patch is not a good idea too, as we already know the tail is unnecessary in the vectorizer and removing it earlier will have benefits.
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | Do you know where to start in order to understand how the two following passes can possibly simplify the comparison result ?
They would need to propagate the fact vscale is a power of two within a given interval in order to simplify urem operations, similar to what is done by interval partition or scalar evolution. Do you think those two passes are powerful enough to do this kind of analysis ? |
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | *edit: I meant range analysis, not interval partition. |
llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll | ||
---|---|---|
34 | the compare is insert in LoopVectorize.cpp:3255 |
I think this looks like a regression to me. If we know the mask is always going to be true for every vector iteration, then we should be using unpredicated loops instead even if the target supports tail-folding. That's because creating and maintaining the loop predicate is always going to be a bit more expensive than maintaining a simple integer induction variable.
I believe gcc's code in https://github.com/llvm/llvm-project/issues/63616 is worse than clang, although perhaps it's worth testing this on AArch64 hardware to confirm?