This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Try to vectorize tiny trees with shuffled gathers of extractelements.
ClosedPublic

Authored by ABataev on Apr 28 2021, 7:32 AM.

Details

Summary

If we gather extract elements and they actually are just shuffles, it
might be profitable to vectorize them even if the tree is tiny.

Diff Detail

Event Timeline

ABataev created this revision.Apr 28 2021, 7:32 AM
ABataev requested review of this revision.Apr 28 2021, 7:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2021, 7:32 AM
RKSimon added inline comments.Apr 28 2021, 9:47 AM
llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions-inseltpoison.ll
43

why do many of these libm vectorizations result in a v2f32 and 2 * f32 scalar calls? I'd expect either 2 x v2f32 or a v4f32.

ABataev added inline comments.Apr 28 2021, 10:05 AM
llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions-inseltpoison.ll
43

Cost model. Cost of 4x calls is too high (Call cost 18 (58-40) for %1 = tail call fast float @llvm.sin.f32(float %vecext) and the cost of 2x calls is high (Call cost 6 (26-20) for %1 = tail call fast float @llvm.sin.f32(float %vecext)), but the cost of the extractelements with indices 1-2 is 5 (they are removed by the vectorizer) + compensate of the costs for inserts.

Adding some people who might be able to advise on the AARCH64 libm costs

david-arm added inline comments.Apr 29 2021, 1:23 AM
llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions-inseltpoison.ll
43

I guess it is a bit difficult to follow the logic here. I think I can understand that extracting element 0 is basically free so keeping the first scalar llvm.sin.f32 makes sense I suppose? Then we decide to make a vector call for elements 1 + 2, although I can't see where they are removed by the vectoriser? It still looks like we have 4 extractelements from the original <4 x float> vector.

I did try out the patch though and I can see with these changes we end up with 5 more lines of assembly in the generated code for this function, so it doesn't seem like a win to be honest. Perhaps there is an issue with the AArch64 cost model for the math calls?

llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll
21

At first glance this looks worse, but I've tried out your patch and can see the generated code is the same because the entire first sequence of inserts, sext and trunc get folded away, since the sext + trunc is basically a no-op.

ABataev added inline comments.Apr 29 2021, 5:33 AM
llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll
21

Yeah, llvm-mca gives throughput 13.5 without being vectorized and 15.5 with vectorized call (the diff is less for newer processors). Looks like another example of a known problem with too optimistic user cost compensation. This must go away once we land the proper implementation of insertelement instruction vectorization but I'll try to prepare a temp patch to try to improve the situation with this temporarily.

Matt added a subscriber: Matt.Apr 29 2021, 12:50 PM
david-arm added inline comments.May 18 2021, 3:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4037–4038

Does the comment need updating here to reflect the change?

ABataev updated this revision to Diff 346236.May 18 2021, 11:56 AM

Rebase + address the comment

This revision is now accepted and ready to land.May 19 2021, 12:35 AM
ABataev updated this revision to Diff 346467.May 19 2021, 8:07 AM

Do not mark extract/insert element instruction as the ones for demote if they are zext/sext operands.

RKSimon accepted this revision.May 20 2021, 2:20 AM

LGTM