Page MenuHomePhabricator

[X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets

Authored by RKSimon on Feb 12 2021, 7:28 AM.



Until AVX512 we don't have any vector truncation instructions, and always lower using shuffles instead.

combineVectorTruncation performs this earlier than lowering as it makes it easier to use any sign/zero-extended bits in the truncated bits with PACKSS/PACKUS to perform the shuffle.

We currently don't attempt to use combineVectorTruncation on AVX2 targets as in the past 256-bit PACKSS/PACKUS tended to cause 128-bit lane shuffle regressions - but these should now be all resolved with combineHorizOpWithShuffle and in all cases we now reduce the amount of cross-lane shuffling and variable shuffle mask usage.

Diff Detail

Event Timeline

RKSimon created this revision.Feb 12 2021, 7:28 AM
RKSimon requested review of this revision.Feb 12 2021, 7:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 12 2021, 7:28 AM
RKSimon added inline comments.Feb 12 2021, 7:29 AM

This will be fixed by D96413

RKSimon updated this revision to Diff 323397.Feb 12 2021, 10:48 AM
RKSimon edited the summary of this revision. (Show Details)

Rebase after D96413

xbolva00 added inline comments.


RKSimon added inline comments.Feb 23 2021, 11:08 AM

We remove lane crossing shuffles, a pshufb (so no constant pool mask load) and a domain crossing shufps. Some AVX2 targets won't care but others will (e.g. znver1 will love losing the lane shuffles).

pengfei added inline comments.Feb 23 2021, 5:38 PM

So it means some targets worse and some better?

craig.topper added inline comments.Feb 23 2021, 6:05 PM

Arent most lane crossing shuffles on Intel 3 cycles?

RKSimon added inline comments.Feb 24 2021, 2:21 AM

By 'won't care' I meant the diff shouldn't be a regression on any target but some targets would benefit more than others - in particular by getting rid of the vperm2f128 which have gotten slower since Haswell on Intel targets (and faster since Zen2 on AMD targets).

pengfei added inline comments.Feb 24 2021, 7:14 AM

I compared the uops of vperm2f128, Haswell and latter Intel targets as well as AMX Zen2 have the same performance: Lat = 3, Uops =1. Zen1 has big gap since Lat = 4, Uops = 8.

pengfei accepted this revision.Mar 24 2021, 6:10 PM

LGTM. Thanks for improving it :)

This revision is now accepted and ready to land.Mar 24 2021, 6:10 PM

LGTM. Thanks for improving it :)

That's what we're here to do. Cheers!

This revision was landed with ongoing or failed builds.Mar 25 2021, 3:35 AM
This revision was automatically updated to reflect the committed changes.