This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Generate vector_shuffle nodes for undersized result vector sizes
ClosedPublic

Authored by mkuper on Aug 25 2016, 5:24 PM.

Details

Summary

Today, we can generate a vector_shuffle from an IR shuffle where the size of the result is exactly the sum of the sizes of the input vectors. If the output vector is smaller - e.g. a <12 x i8> being formed by a shuffle with two <8 x i8> inputs, we emit a sequence of extracts and inserts.

Instead, we can form a larger vector_shuffle, and then extract a subvector of the right size - e.g. shuffle the two <8 x i8>-s into a <16 x i8> and extract a <12 x i8>.

This solves PR29025.

This has a dependency on D23893 - forming a vector instead of a series of inserts/extracts pessimizes a test, because of a mess created by legalization, and we need D23893 for clean-up.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 69305.Aug 25 2016, 5:24 PM
mkuper retitled this revision from to [SelectionDAG] Generate vector_shuffle nodes for undersized result vector sizes.
mkuper updated this object.
mkuper added reviewers: spatel, RKSimon, craig.topper.
mkuper added a subscriber: llvm-commits.
RKSimon edited edge metadata.Aug 26 2016, 3:15 AM

Would it be possible to test some other vector sizes? vf32 / v3f64 turn up a lot in vector code for example.

Would it be possible to test some other vector sizes? vf32 / v3f64 turn up a lot in vector code for example.

Sure, will add more tests.
Thanks!

Also, this has a really bad (but easy to fix) bug in it, that causes the shuffle sequence to be completely wrong...
I'm going to commit a test-case that checks the code we currently generate, for different types, and then put up a patch that shows the differences.

mkuper updated this revision to Diff 69452.Aug 26 2016, 5:46 PM
mkuper edited edge metadata.

Fixed bug, and added test cases.
And, fortunately (thanks again, Simon!) while adding the tests I discovered that it regresses AVX2 codegen in some cases (even though it improves AVX codegen for the same tests). So this can't really be committed as is.

The difference comes down to which pattern shuffle lowering likes more:

    t2: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg0
    t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg1
  t8: v8i32 = concat_vectors t2, t4
t10: v8i32 = vector_shuffle<0,6,3,6,1,7,4,u> t8, undef:v8i32

Or

    t2: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg0
  t8: v8i32 = concat_vectors t2, undef:v4i32
    t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %vreg1
  t9: v8i32 = concat_vectors t4, undef:v4i32
t10: v8i32 = vector_shuffle<0,10,3,10,1,11,8,u> t8, t9

Before this patch, the builder created a sequence of inserts and extracts that eventually got combined into the first version - which is preferred by AVX2, because it can efficiently lower it with VPERM. With this patch, we directly create the second version, which AVX handles better, but AVX2 chokes on. I could modify the patch to directly create the first version instead, which would solve the original problem in PR29025 - but then we won't get the AVX gains for i32 vectors.
So we'll need both a builder change (that generates one of the two patterns above) and either a combine that picks the better version based on profitability, or lowering improvements. I'm still rather fuzzy on the details, though.

mkuper updated this revision to Diff 69729.Aug 30 2016, 11:07 AM

This version no longer has the AVX2 regressions, but I don't really know if it's the right way to approach it.
Any thoughts?

spatel edited edge metadata.Aug 30 2016, 4:20 PM

This version no longer has the AVX2 regressions, but I don't really know if it's the right way to approach it.
Any thoughts?

This seems fine to me, but I'm not sure what the alternatives are, so I'll let someone else comment on that.

I've just pointed out some nits in the inline comments.

lib/Target/X86/X86ISelLowering.cpp
26342–26343

Hard to parse that sentence. Is this accurate/better?
"We are looking for a shuffle where both sources are concatenated with undef and have a width that is half of the output's width."

26344

this a -> this as a

26344–26348

This function is specifically for AVX2, so I think the check that is currently outside of the function:

if (Subtarget.hasInt256())

should be inserted at the start in here. Also, it should be "hasAVX2()" rather than "hasInt256()" because this is distinguishing AVX2-ness, not 256-bit-ness.

Thanks, Sanjay!

Just to be clear - the other two alternatives I see are:

  1. Construct (vector_shuffle <mask> (concat_vectors t1, t2), undef) in the SelectionDAG, and have a target-specific combine in the other direction.
  2. Handle this directly in the shuffle lowering code, instead of a dag combine.
lib/Target/X86/X86ISelLowering.cpp
26342–26343

Yes, thanks!

26344–26348

Right, should have been hasAVX2(), thanks.

Regarding where the check should be - I'm not sure what's better, TBH. I'll move it into the function as you suggest, if anyone objects, let me know.

Thanks for working on this - we don't do much to improve shuffles of concat/inserted subvectors at all yet.

lib/Target/X86/X86ISelLowering.cpp
26349

You need an early out for !isa<ShuffleVectorSDNode>(N) as I think we can get here from target shuffle nodes as well as shufflevector nodes.

mkuper added inline comments.Aug 31 2016, 2:54 PM
lib/Target/X86/X86ISelLowering.cpp
26349

Right, thanks!

mkuper updated this revision to Diff 69908.Aug 31 2016, 2:54 PM
mkuper edited edge metadata.

Added early exit for target shuffles.

RKSimon added inline comments.Sep 1 2016, 3:42 AM
lib/Target/X86/X86ISelLowering.cpp
26358

Couldn't f32/f64 use this?

mkuper added inline comments.Sep 1 2016, 10:51 AM
lib/Target/X86/X86ISelLowering.cpp
26358

Right, I just wanted to exclude non-32/64-bit types, I'll add f32/f64 and appropriate tests. Thanks!

mkuper updated this revision to Diff 70031.Sep 1 2016, 11:14 AM

Added f32/f64.

RKSimon accepted this revision.Sep 1 2016, 1:25 PM
RKSimon edited edge metadata.

LGTM - with one minor.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
3090

Possibly remove this, initialize as MappedOps(PaddedMaskNumElts, -1) and then just insert MappedOps[i] = Idx in the readjustment loop.

This revision is now accepted and ready to land.Sep 1 2016, 1:25 PM
mkuper closed this revision.Sep 1 2016, 2:41 PM

Closed by rL280418.