For SVE, when scalarising the PHI instruction the whole vector part is
generated as opposed to creating instructions for each lane for fixed-
width vectors. However, in some cases the lane values may be needed
later (e.g for a load instruction) so we still need to calculate
these values to avoid extractelement being called on the vector part.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
1,800 ms | x64 debian > libarcher.races::task-dependency.c | |
830 ms | x64 debian > libomp.lock::omp_init_lock.c |
Event Timeline
Just a minor nit on the commit message, this patch is not really specific to AArch64 SVE but rather to scalable vectors.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 | Hi @RosieSumpter, I think it's worth elaborating a little bit more on the 'generate better code' in the comment. [(too) long explanation here] From what I understand, the code is better because the extractelement instruction that is otherwise generated (for scalar uses of this vector) may not always be folded away if the stepvector has multiple uses, leading to a redundant move (in case of element 0 for the vector-element-0 -> gpr move) or possibly expensive extractelement instructions (to extract a fixed-width lane from a scalable vector) for element > 0. In the former case, the value for element 0 is freely available because it is the start value of the stepvector. Can you maybe capture some of that in the comment? (albeit more succinctly) |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 | I had exactly the same questions as Sander. The main question I think is indeed why this is better, which it's not (that) obvious from the test changes. Thus, I was wondering, does this deserve adding some CodeGen tests? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 | IMO an improved description should be sufficient. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 | Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this? I think it would help too with explaining why this is better. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 | For now I have just updated the comment (hopefully it makes sense - I have tried to add some detail but keep it concise, but am happy to change it!) Also happy to add a codegen test if it's decided that it's necessary. |
Thanks @RosieSumpter that explains it well I think.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 |
Correct, it normally doesn't happen for individual IR passes that the resulting asm for a particular target is tested for a change to the transformation, right? IMO that should be avoided whenever possible, because it defeats the purpose of a unit test. In this case it should be sufficient to test explicitly that the extractelement does not exist after loop-vectorize. That happens with the extra CHECK-NOT line in sve-widen-phi.ll. The fmov is the code-generated equivalent to the extract element of element 0, so would there be any value in also code-generating this for AArch64 to show that the fmov is removed? Perhaps @RosieSumpter can add a similar comment to clarify the CHECK-NOT line to clarify this a bit more in the test itself? | |
4771 | nit: s/is not folded/is not folded away/ | |
4771–4772 | nit: s/we still need to calculate/we still calculate/ |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 | I was proposing an additional test, thus I was not proposing that a codegen test should replace an IR test, which would indeed be a bad idea. This change is/was talking about codegen improvements. The IR changes here don't make that very obvious IMO. So if I was doing this work, I would add additional codegen tests with the before/after IR as input, and check its codegen to make sure these improvements are there. Not sure if I am missing something or should be discussing adding tests here. Don't have strong opinions either, so will leave finishing the review up to you. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4769–4771 | Thanks, I understood what you meant about adding an additional test. In this case it's more about InstCombine not folding it away because the vector has multiple uses than it is about the target not being able to do something special with it, so from my perspective adding a codegen test for this would be a bit out of place. |
Hi all, apologies for my last comment on this patch - a mistake with email auto-complete led to an internal email being forwarded to Phabricator.
Hi @RosieSumpter, I think it's worth elaborating a little bit more on the 'generate better code' in the comment.
[(too) long explanation here]
From what I understand, the code is better because the extractelement instruction that is otherwise generated (for scalar uses of this vector) may not always be folded away if the stepvector has multiple uses, leading to a redundant move (in case of element 0 for the vector-element-0 -> gpr move) or possibly expensive extractelement instructions (to extract a fixed-width lane from a scalable vector) for element > 0.
In the former case, the value for element 0 is freely available because it is the start value of the stepvector.
In the latter case, there will be a cost regardless. Either the additional add/gep generated below to offset from the start value of the stepvector, or the extract from the stepvector itself. It's just expected that the scalar code will be cheaper.
Can you maybe capture some of that in the comment? (albeit more succinctly)