This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Cost all perfect shuffles entries as cost 1
ClosedPublic

Authored by dmgreen on Apr 8 2022, 4:14 AM.

Details

Summary

A brief introduction to perfect shuffles - AArch64 NEON has a number of shuffle operations - dups, zips, exts, movs etc that can in some way shuffle around the lanes of a vector. Given a shuffle of size 4 with 2 inputs, some shuffle masks can be easily codegen'd to a single instruction. A <0,0,1,1> mask for example is a zip LHS, LHS. This is great, but some masks are not so simple, like a <0,0,1,2>. It turns out we can generate that from zip LHS, <0,2,0,2>, and then generate <0,2,0,2> from uzp LHS, LHS, producing the result in 2 instructions.

It is not obvious from a given mask how to get there though. So we have a simple program (PerfectShuffle.cpp in the util folder) that can scan through all combinations of 4-element vectors and generate the "perfect" combination of results needed for each shuffle mask (for some definition of perfect). This is run offline to generate a table that is queried for generating shuffle instructions. (Because the table could get quite big, it is limited to 4 element vectors).

In the perfect shuffle tables zip, unz and trn shuffles were being cost as 2, which is higher than needed and skews the perfect shuffle tables to create inefficient combinations. This sets them to 1 and regenerates the tables for them. The codegen will usually be better and the costs should be more precise (but it can get less second-order re-use of values from multiple shuffles, these cases should be fixed up in subsequent patches.

Diff Detail

Event Timeline

dmgreen created this revision.Apr 8 2022, 4:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 4:14 AM
dmgreen requested review of this revision.Apr 8 2022, 4:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 4:14 AM
dmgreen edited the summary of this revision. (Show Details)Apr 8 2022, 6:24 AM

Thanks for the introduction to perfect shuffles! :-)

The codegen will usually be better and the costs should be more precise (but it can get less second-order re-use of values from multiple shuffles, these cases should be fixed up in subsequent patches.

The approach looks sensible, but can you elaborate why codegen is better? This isn't that obvious to me, and by quickly eyeballing it I might even spotted some regressions.

The codegen will usually be better and the costs should be more precise (but it can get less second-order re-use of values from multiple shuffles, these cases should be fixed up in subsequent patches.

The approach looks sensible, but can you elaborate why codegen is better? This isn't that obvious to me, and by quickly eyeballing it I might even spotted some regressions.

Thanks for taking a look. I think what I was meaning, was a combination of:

  • If you take all these patches that are linked together (plus one more for i64 lane moves, will upload soon), the sizes of the assemblies are better (or the same) for all/most the tests, I believe. So the ones that do get worse here get better elsewhere. I'm not sure that's true for every shuffle, but it seems to be an improvement in general.
  • Getting the costs correct is good if we start using these for the costs of shuffles (as in D123409).
  • For normal/simple shuffles, this should be an improvement because we will be using less instructions (in general). It won't chose to pick 2 ext as opposed to a trn, for example.
  • The problems can come where there are multiple shuffles together. A <16 x i32> shuffle is broken down, for example. Or the disguised_dup case stores the results of 2 shuffles. In these cases sometimes the two perfect shuffles happen to re-use nodes between them. (For example the 1,2,0,0 shuffle in disguised_dup happened to re-use the dup 0,0,0,0 from the other shuffle). Those second-order effects are hard to predict.
  • Those big tests might not be so useful. I have not committed them into tree yet, and I'm not sure if they are very useful to be added. I'll see.

I might try and do something about trying to get it to reuse more nodes, but it's not something where the solution is obvious. Anything you chose might make one case better and another worse.

SjoerdMeijer accepted this revision.Apr 14 2022, 1:37 AM

Thanks for explaining! Makes sense, LGTM.

This revision is now accepted and ready to land.Apr 14 2022, 1:37 AM
This revision was landed with ongoing or failed builds.Apr 19 2022, 4:05 AM
This revision was automatically updated to reflect the committed changes.