This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use PerfectShuffle costs in AArch64TTIImpl::getShuffleCost
ClosedPublic

Authored by dmgreen on Apr 8 2022, 10:17 AM.

Details

Summary

Given a shuffle with 4 elements size 16 or 32, we can use the costs directly from the PerfectShuffle tables to get a slightly more accurate cost for the resulting shuffle.

Diff Detail

Event Timeline

dmgreen created this revision.Apr 8 2022, 10:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 10:17 AM
dmgreen requested review of this revision.Apr 8 2022, 10:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 10:17 AM
samtebbs added inline comments.Apr 13 2022, 3:15 AM
llvm/lib/Target/AArch64/AArch64PerfectShuffle.h
6590

Why do we have to limit this to 4x16 or 4x32 shuffles?

6602–6609

A comment about what is being done here would be beneficial.

6611–6617

A bit more elaboration here would be nice too.

dmgreen updated this revision to Diff 423573.Apr 19 2022, 3:11 AM
dmgreen added inline comments.
llvm/lib/Target/AArch64/AArch64PerfectShuffle.h
6590

There is a comment in the summary of D123379 that might help explain perfect shuffles. The quick version is that they only support 4 entry shuffles, because otherwise the tables we store would just be too large.

6611–6617

I'm not sure exactly what else to say, other than this is how perfect shuffle tables work :)

samtebbs accepted this revision.Apr 20 2022, 6:37 AM

LGTM!

llvm/lib/Target/AArch64/AArch64PerfectShuffle.h
6590

Ah thanks for the link, that explains it.

6611–6617

The comment in the post you linked did help a lot! cheers

This revision is now accepted and ready to land.Apr 20 2022, 6:37 AM
This revision was landed with ongoing or failed builds.Apr 27 2022, 4:09 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.May 2 2022, 9:03 AM

Just a heads up, I'm seeing a few 4-8% regressions on different AArch64 CPUs with this change for a few benchmarks. I still need to isolate the binary changes.

Did you ever manage to come up with a reproducer? I'm hope this new cost model is generally more accurate, but you know cost modelling.. The codegen might be off or their might be any number of second order effects going wrong. Let me know.

fhahn added a comment.May 13 2022, 8:14 AM

Did you ever manage to come up with a reproducer? I'm hope this new cost model is generally more accurate, but you know cost modelling.. The codegen might be off or their might be any number of second order effects going wrong. Let me know.

The issue was that after this patch some code got vectorized when it wasn't profitable, but it looked like a general SLP issue. Previously it just didn't get vectorized because some ridiculously high costs where used for some shuffles. It looks like D115750 fixed the underlying issue and the code is back to not getting vectorized and original performance is restored :)