This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets
ClosedPublic

Authored by RKSimon on Feb 12 2021, 7:28 AM.

Details

Summary

Until AVX512 we don't have any vector truncation instructions, and always lower using shuffles instead.

combineVectorTruncation performs this earlier than lowering as it makes it easier to use any sign/zero-extended bits in the truncated bits with PACKSS/PACKUS to perform the shuffle.

We currently don't attempt to use combineVectorTruncation on AVX2 targets as in the past 256-bit PACKSS/PACKUS tended to cause 128-bit lane shuffle regressions - but these should now be all resolved with combineHorizOpWithShuffle and in all cases we now reduce the amount of cross-lane shuffling and variable shuffle mask usage.

Diff Detail

Event Timeline

RKSimon created this revision.Feb 12 2021, 7:28 AM
RKSimon requested review of this revision.Feb 12 2021, 7:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 12 2021, 7:28 AM
RKSimon added inline comments.Feb 12 2021, 7:29 AM
llvm/test/CodeGen/X86/psubus.ll
1526

This will be fixed by D96413

RKSimon updated this revision to Diff 323397.Feb 12 2021, 10:48 AM
RKSimon edited the summary of this revision. (Show Details)

Rebase after D96413

xbolva00 added inline comments.
llvm/test/CodeGen/X86/vector-reduce-and-bool.ll
561

Worse?

RKSimon added inline comments.Feb 23 2021, 11:08 AM
llvm/test/CodeGen/X86/vector-reduce-and-bool.ll
561

We remove lane crossing shuffles, a pshufb (so no constant pool mask load) and a domain crossing shufps. Some AVX2 targets won't care but others will (e.g. znver1 will love losing the lane shuffles).

pengfei added inline comments.Feb 23 2021, 5:38 PM
llvm/test/CodeGen/X86/vector-reduce-and-bool.ll
561

So it means some targets worse and some better?

craig.topper added inline comments.Feb 23 2021, 6:05 PM
llvm/test/CodeGen/X86/vector-reduce-and-bool.ll
561

Arent most lane crossing shuffles on Intel 3 cycles?

RKSimon added inline comments.Feb 24 2021, 2:21 AM
llvm/test/CodeGen/X86/vector-reduce-and-bool.ll
561

By 'won't care' I meant the diff shouldn't be a regression on any target but some targets would benefit more than others - in particular by getting rid of the vperm2f128 which have gotten slower since Haswell on Intel targets (and faster since Zen2 on AMD targets).

pengfei added inline comments.Feb 24 2021, 7:14 AM
llvm/test/CodeGen/X86/vector-reduce-and-bool.ll
561

I compared the uops of vperm2f128, Haswell and latter Intel targets as well as AMX Zen2 have the same performance: Lat = 3, Uops =1. Zen1 has big gap since Lat = 4, Uops = 8.

pengfei accepted this revision.Mar 24 2021, 6:10 PM

LGTM. Thanks for improving it :)

This revision is now accepted and ready to land.Mar 24 2021, 6:10 PM

LGTM. Thanks for improving it :)

That's what we're here to do. Cheers!

This revision was landed with ongoing or failed builds.Mar 25 2021, 3:35 AM
This revision was automatically updated to reflect the committed changes.