Given a shuffle with 4 elements size 16 or 32, we can use the costs directly from the PerfectShuffle tables to get a slightly more accurate cost for the resulting shuffle.
Details
Diff Detail
Event Timeline
llvm/lib/Target/AArch64/AArch64PerfectShuffle.h | ||
---|---|---|
6590 | There is a comment in the summary of D123379 that might help explain perfect shuffles. The quick version is that they only support 4 entry shuffles, because otherwise the tables we store would just be too large. | |
6611–6617 | I'm not sure exactly what else to say, other than this is how perfect shuffle tables work :) |
Just a heads up, I'm seeing a few 4-8% regressions on different AArch64 CPUs with this change for a few benchmarks. I still need to isolate the binary changes.
Did you ever manage to come up with a reproducer? I'm hope this new cost model is generally more accurate, but you know cost modelling.. The codegen might be off or their might be any number of second order effects going wrong. Let me know.
The issue was that after this patch some code got vectorized when it wasn't profitable, but it looked like a general SLP issue. Previously it just didn't get vectorized because some ridiculously high costs where used for some shuffles. It looks like D115750 fixed the underlying issue and the code is back to not getting vectorized and original performance is restored :)
Why do we have to limit this to 4x16 or 4x32 shuffles?