This is an archive of the discontinued LLVM Phabricator instance.

[CostModel][AArch64] Add some initial costs for SK_Select and SK_PermuteSingleSrc
ClosedPublic

Authored by RKSimon on Jun 14 2018, 8:16 AM.

Details

Summary

AArch64 was only setting costs for SK_Transpose, which meant that many of the simpler shuffles (e.g. SK_Select and SK_PermuteSingleSrc for larger vector elements) was being severely overestimated by the default shuffle expansion.

This patch adds costs to help improve SLP performance and avoid a regression in reductions in an upcoming patch that will allow SLP to recognise a lot more SK_Select shuffle patterns.

I'm not very knowledgable about AArch64 shuffle lowering so I've kept the extra costs to a minimum - someone who knows this code can add extra costs which should improve vectorization a lot more.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jun 14 2018, 8:16 AM
evandro added inline comments.Jun 14 2018, 9:29 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
978

I'm not sure that the costs of 64 and 128 bits long vectors should be different.

993

Ditto.

RKSimon added inline comments.Jun 14 2018, 9:59 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
978

I'm going off the codegen in the shuffle-select.ll tests below where 2X32 selects are a move and 4X32 selects are a rev+trn.

If you have better cost estimates please suggest them - I don't know much about the aarch64 microarch and I'm really just wanting to fix regressions in D48174 caused by the existing (dubious) shuffle costs.

993

I wasn't sure what the best form shuffle test were for these cases - do you know what the likely worst offending shuffle masks will be? If you can give me some hints I'd be happy to create a shuffle-single-src.ll costs test file to at least cover these types.

evandro added inline comments.Jun 14 2018, 12:28 PM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
978

That's an artifact due to the mask values being different for both tests. Were they, say, <0, 2> in sel.v2i32() and <0, 2, 4, 6> in sel.v4i32(), both would result in a single instruction performing the shuffle.

RKSimon added inline comments.Jun 14 2018, 1:15 PM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
978

But this is for SK_Select style patterns:

2x32 - <0,3> or <2,1> - this will always be 1 mov AFAICT
4x32 - <0,1,2,7>, <0,1,6,3>, <0,5,2,7>, etc. - I couldn't cause this to be anything more than 2 instructions, sometimes independent othertimes not.

That was the reasoning behind the costs - I'm a litttle more unsure about what would is the maximum for SK_PermuteSingleSrc but I reckoned it would be 2 instructions as well.

Can anyone recommend the best shuffle costs to put here for the 32/64 cases?

evandro added inline comments.Jun 19 2018, 10:39 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
979

Independently of the costs, this lookup table could be simplified to a test on the number of vector elements.

994

Ditto.

RKSimon added inline comments.Jun 19 2018, 11:15 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
979

That's trivial to do - the tables look easier to grok, but I'm happy to make the change.

RKSimon updated this revision to Diff 151956.Jun 19 2018, 11:29 AM

Removed shuffle cost tables

RKSimon updated this revision to Diff 152087.Jun 20 2018, 7:44 AM

Rebased now that D48174 has landed - shows the full vectorisation of the reduction_v4i32 test (prior to D48174 it only vectorised the second half).

Next time, when one patch causes a regression that another patch fixes, please, make the one a parent of the other.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
982

Perhaps you should assert instead of testing here.

991

Ditto.

evandro added inline comments.Jun 20 2018, 8:33 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
980

Please, comment on what you are trying to achieve here.

989

And here too.

RKSimon updated this revision to Diff 152113.Jun 20 2018, 10:22 AM

I'm trying a different approach which should work better as additional costs are added for other types or shuffle kinds - this will be beneficial to aarch64 as SLP etc. starts using the ShuffleKind matches more directly (see D48236).

I've gone back to using the costs tables as they tend to be easier to understand at a glance and the number of entries/calls aren't really large enough to be a perf issue. I've also changed the key to use the ShuffleKind, which matches what we do on X86, again for clarity.

It looks reasonable to me. If no one else objects in a couple of days or so, it LGTM.

efriedma added inline comments.
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
958

I'm pretty sure the general case here is not rev+trn, but I guess it should be two instructions in general (it's always at most two vector-vector element insertions).

965

For a single source with two elements, a shuffle is either an identity shuffle, a splat, or a reversal; I guess those are all in fact 1 instruction.

966

egrep -c "[0-3],[0-3],[0-3],[0-3].*Cost [3-4]" lib/Target/AArch64/AArch64PerfectShuffle.h says there are 182 permutes which cost more than 2. But I guess this is a reasonable estimate not knowing anything else.

RKSimon added inline comments.Jun 21 2018, 4:46 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
966

Out of interest, are those perfect shuffle costs based off throughput or number of instructions?

RKSimon updated this revision to Diff 152250.Jun 21 2018, 4:51 AM

Updated SK_Select comment to make it clear that the shuffle could be a rev+trn, or something else....

RKSimon updated this revision to Diff 152251.Jun 21 2018, 4:55 AM

Use worst case PermuteSingleSrc costs from perfect shuffle tool

efriedma accepted this revision.Jun 21 2018, 12:51 PM

LGTM

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
966

All the shuffles it uses have the same throughput on an A57, so there isn't really any difference.

This revision is now accepted and ready to land.Jun 21 2018, 12:51 PM
This revision was automatically updated to reflect the committed changes.