This is an archive of the discontinued LLVM Phabricator instance.

[x86] try harder to form 256-bit unpck*
ClosedPublic

Authored by spatel on Jan 12 2020, 11:10 AM.

Details

Summary

This is another part of a problem noted in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024

The AVX2 code may use awkward 256-bit shuffles vs. the AVX code that gets split into the expected 128-bit unpack instructions. We have to be selective in matching the types where we try to do this though. Otherwise, we can end up with more instructions (in the case of v8x32/v4x64).

Diff Detail

Event Timeline

spatel created this revision.Jan 12 2020, 11:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2020, 11:10 AM
RKSimon added inline comments.Jan 13 2020, 8:24 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
16239

Add a comment and move the VPERMD comment.

16305

Add a comment.

This generates a shuffle sequence, its probably better to place this after all the lowerShuffle* calls that create single (non-variable) instructions.

16410

Add a comment.

This generates a shuffle sequence, its probably better to place this after all the lowerShuffle* calls that create single (non-variable) instructions.

spatel updated this revision to Diff 237733.Jan 13 2020, 10:55 AM
spatel marked 3 inline comments as done.

Patch updated:

  1. Added code comments to better explain the strategy.
  2. Moved calls lower and within predicate blocks (expect V2 to be undef) where this makes sense to try.
lebedev.ri added inline comments.
llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll
1721

This looks like some demandedelts deficiency?

RKSimon added inline comments.Jan 13 2020, 11:51 AM
llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll
1721

D66004 might catch it, else it might be due to the constant mask already being lowered - we don't do much to simplify constant vectors already lowered to the constant pool.

spatel marked an inline comment as done.Jan 13 2020, 1:00 PM
spatel added inline comments.
llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll
1721

I think we've already lowered to constant pool loads.

This patch creates the unpack as expected, but then a later combine does:

Combining: t23: v8i32 = X86ISD::UNPCKL t22, t22
Creating new node: t60: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<1>, Constant:i32<1>, Constant:i32<2>, Constant:i32<2>, Constant:i32<3>, Constant:i32<3>
Creating new node: t61: v8i32 = X86ISD::VPERMV t60, t4

And then we lower the build_vector:

    t65: v8i32,ch = load<(load 32 from constant-pool)> t0, t67, undef:i64
    t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
  t61: v8i32 = X86ISD::VPERMV t65, t4
   t49: v8i32,ch = load<(load 32 from constant-pool)> t0, t51, undef:i64
    t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
  t43: v8i32 = X86ISD::VPERMV t49, t2
t16: v8i32 = X86ISD::BLENDI t61, t43, TargetConstant:i8<17>

Before we reach the BLENDI. Doesn't seem like there'd be much chance of improvement this late.

RKSimon accepted this revision.Jan 17 2020, 5:38 AM

LGTM - technically this is a limited version of a "lowerShuffleAsLanePermuteAndUNPCK" pattern that we might want to investigate in the future, but its a more controllable initial implementation.

This revision is now accepted and ready to land.Jan 17 2020, 5:38 AM
This revision was automatically updated to reflect the committed changes.