This is an archive of the discontinued LLVM Phabricator instance.

[X86] Make lowerShuffleAsLanePermuteAndPermute use sublanes on AVX2
ClosedPublic

Authored by TellowKrinkle on Aug 24 2020, 12:07 AM.

Details

Summary

Extends lowerShuffleAsLanePermuteAndPermute to search for opportunities to use vpermq (64-bit cross-lane shuffle) and vpermd (32-bit cross-lane shuffle) to get elements into the correct lane, in addition to the 128-bit full-lane permutes it previously searched for

This is especially helpful in cross-lane byte shuffles, where the alternative tends to be "vpshufb both lanes separately and blend them with a vpblendvb", which is very expensive, especially on Haswell where vpblendvb uses the same execution port as all the shuffles.

Addresses Bug 47262

Diff Detail

Event Timeline

TellowKrinkle created this revision.Aug 24 2020, 12:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2020, 12:07 AM
TellowKrinkle requested review of this revision.Aug 24 2020, 12:07 AM

While most of the test output changes seemed positive, a few stood out to me as being worse:

  • shuffle_v16i16_00_16_01_17_02_18_03_27_08_24_09_25_10_26_11_27 and shuffle_v16i16_00_16_01_17_02_18_03_27_08_24_09_25_10_26_11_27 previously got lowered to a blend of two vunpcklwds, which then got optimized into a single vunpcklwd. Since the new codegen no longer outputs an unpcklwd, that optimization no longer applies.
  • The AVX512VL output for shuffle_v16i16_uu_uu_uu_uu_04_05_06_11_uu_uu_uu_uu_12_13_14_11 now loses track of the fact that a bunch of its outputs are undef. The output from lowerShuffleAsLanePermuteAndShuffle is a <4,5,6,7,8,9,10,11,12,13,14,15,8,9,10,11> shuffle followed by a <u,u,u,u,0,1,2,7,u,u,u,u,8,9,10,15> shuffle, which get combined back together in a later pass. Not sure if this is important.
  • The output of the pass is often asymmetric, and therefore doesn't work well with fixed shuffles, leading almost everything to use vpshufb. On systems with fast-variable-shuffle this is fine, since it usually saves a few instructions, but on systems without, this is mixed. Many byte shuffles already required vpshufb anyways, and this can reduce the number of them or at least reduce the number of other instructions, but many word shuffles get changed from not requiring variable shuffles to requiring them (e.g. shuffle_v16i16_00_01_00_01_02_03_02_11_08_09_08_09_10_11_10_11).

I'm on PTO this week but will try to take a look soon.

Please can you pre-commit the PR47262 test case with trunk's current codegen and rebase so the patch shows the diff.

nit: typo lowerShuffleAsLanePermuteAndShuffle -> lowerShuffleAsLanePermuteAndPermute

Please can you pre-commit the PR47262 test case with trunk's current codegen and rebase so the patch shows the diff.

I don't have commit access, but here's a patch for the PR47262 test case with trunk's current codegen, what should I do with it?

RKSimon added a subscriber: spatel.

@craig.topper @spatel Please can sombody add the new test case to trunk - I'm away from my dev machine for the next few days.

rG21a008bbba7

@TellowKrinkle - you'll need to update/rebase now and regenerate the CHECK lines for that test file. Then we'll see the test diff/improvement for the example from the bug report.

TellowKrinkle retitled this revision from [X86] Make lowerShuffleAsLanePermuteAndShuffle use sublanes on AVX2 to [X86] Make lowerShuffleAsLanePermuteAndPermute use sublanes on AVX2.
TellowKrinkle edited the summary of this revision. (Show Details)

Rebased to latest master

RKSimon added inline comments.Aug 26 2020, 2:21 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
15540

Maybe pull getSublanePermute inside lowerShuffleAsLanePermuteAndPermute as a lamdba?

15571

What happens if you try the defaut (AVX1/binary shuffle) variant first and then the AVX2/AVX2-FAST unary shuffles afterward?

15580

use the existing narrowShuffleMask helper to do this?

Move getSublanePermute to a lambda and use narrowShuffleMask instead of manually doing its job

TellowKrinkle added inline comments.Aug 26 2020, 5:08 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
15571

Do you mean check 128-bit sublanes on both sides of the if?

It cancels the points where a vpermq ymm0 = ymm0[2,3,2,3] gets turned into a ymm0 = ymm0[3,3,3,3] or a ymm0 = ymm0[0,1,0,1] gets turned into a vpbroadcastq, but that's about it

AFAIK the 64-bit sublane version handles everything the 128-bit sublane version does (and the 32-bit sublane version handles everything the 64-bit sublane version does, it's just that vpermd is potentially more expensive than vpermq and needs an extra register for the mask)

TellowKrinkle marked 2 inline comments as done.Aug 26 2020, 5:09 PM
RKSimon added inline comments.Aug 27 2020, 8:29 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
15629

Please can you try pulling this into the lambda as well? And refactor to try the 128-bit lanes first, then 64/32-bits on AVX2/fast-AVX2 cases? Sorry for asking you to experiment like this but I'm away from a devmachine for the next week.

Always check lane permute first, move return value generation into lambda

Please can you rebase? I'm not sure if D66004 will have affected this

Please can you rebase? I'm not sure if D66004 will have affected this

rG888049b97a74 may affect your patch as well (probably just preventing a clean merge)

Rebase to latest master

RKSimon accepted this revision.Sep 3 2020, 6:20 AM

LGTM - the remaining issues we should be able to clear up with some tweaks to the order we attempt to lower v16i16 shuffles

This revision is now accepted and ready to land.Sep 3 2020, 6:20 AM

I don't have commit access so someone else will have to merge it