If we gather extract elements and they actually are just shuffles, it
might be profitable to vectorize them even if the tree is tiny.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions-inseltpoison.ll | ||
---|---|---|
43 | why do many of these libm vectorizations result in a v2f32 and 2 * f32 scalar calls? I'd expect either 2 x v2f32 or a v4f32. |
llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions-inseltpoison.ll | ||
---|---|---|
43 | Cost model. Cost of 4x calls is too high (Call cost 18 (58-40) for %1 = tail call fast float @llvm.sin.f32(float %vecext) and the cost of 2x calls is high (Call cost 6 (26-20) for %1 = tail call fast float @llvm.sin.f32(float %vecext)), but the cost of the extractelements with indices 1-2 is 5 (they are removed by the vectorizer) + compensate of the costs for inserts. |
llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions-inseltpoison.ll | ||
---|---|---|
43 | I guess it is a bit difficult to follow the logic here. I think I can understand that extracting element 0 is basically free so keeping the first scalar llvm.sin.f32 makes sense I suppose? Then we decide to make a vector call for elements 1 + 2, although I can't see where they are removed by the vectoriser? It still looks like we have 4 extractelements from the original <4 x float> vector. I did try out the patch though and I can see with these changes we end up with 5 more lines of assembly in the generated code for this function, so it doesn't seem like a win to be honest. Perhaps there is an issue with the AArch64 cost model for the math calls? | |
llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll | ||
21 | At first glance this looks worse, but I've tried out your patch and can see the generated code is the same because the entire first sequence of inserts, sext and trunc get folded away, since the sext + trunc is basically a no-op. |
llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll | ||
---|---|---|
21 | Yeah, llvm-mca gives throughput 13.5 without being vectorized and 15.5 with vectorized call (the diff is less for newer processors). Looks like another example of a known problem with too optimistic user cost compensation. This must go away once we land the proper implementation of insertelement instruction vectorization but I'll try to prepare a temp patch to try to improve the situation with this temporarily. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4037–4038 | Does the comment need updating here to reflect the change? |
Do not mark extract/insert element instruction as the ones for demote if they are zext/sext operands.
Does the comment need updating here to reflect the change?