Page MenuHomePhabricator

[X86] Improve AVX1 shuffle lowering for v8f32 shuffles where the low half comes from V1 and the high half comes from V2 and the halves do the same operation
ClosedPublic

Authored by craig.topper on Jan 5 2018, 5:05 PM.

Details

Reviewers
RKSimon
spatel
Summary

To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result.

This fixes PR35833

Diff Detail

Event Timeline

craig.topper created this revision.Jan 5 2018, 5:05 PM

currently you call lowerVectorShuffleSplitLowHigh() form lowerV8F32VectorShuffle() only.

In order to cover v4f64, v4i64, v8i32 types,
can you add a call to lowerVectorShuffleSplitLowHigh() to other 256 bit vector lowering functions?
lowerV4F64VectorShuffle()
lowerV4I64VectorShuffle()
lowerV8I32VectorShuffle()

can you add a test for "128 bit line swapped" shuffle masks: [8, 10, 12, 14, 0, 2, 4, 6,], [9, 11, 13, 15, 1, 3, 5, 7 ]

Add support for v4f64 as well.

Without AVX2, integer types are cast to FP so this supports v8i32/v4i64 without AVX2 as is.

What is the point of doing the adds/fadds in the new tests? It looks like its just to test multiple shuffle per test or is there a domain aspect to it that I'm missing?

The adds/fadds are just there cause that's what was in the bug report. And I wanted to make sure we were able to reuse the vperm2f128 shuffles that create NewV1 and NewV2.

Hi Craig, I placed simple test on the Bugzilla, your fix seems to be working fine.
The only thing, there are couple or more of shuffle masks which looks like they can be a subject of the similar optimization, but the generated code for them remained not optimized.
IN0: |0|1| | |4|5| | |
IN1: |8|9| | | | |E|F|
MASK011: 0,1,4,5,8,9,14,15
IN0: |0|1| | | | |6|7|
IN1: |8|9| | |C|D| | |
MASK012: 0,1,6,7,8,9,12,13

Can you show what code you expect for those cases and I'll see what I can do?

Four instructions perhaps:

vextractf128, vextractf128, vshufps, vblendps?

It is better than current six instructions.

The adds/fadds are just there cause that's what was in the bug report. And I wanted to make sure we were able to reuse the vperm2f128 shuffles that create NewV1 and NewV2.

OK - please can you commit the tests to trunk so the patch shows the codegen diff?

craig.topper edited the summary of this revision. (Show Details)

Rebase now that tets have been committed.

Actually rebase the patch.

This looks like there is a lot of crossover with lowerVectorShuffleByMerging128BitLanes?

lib/Target/X86/X86ISelLowering.cpp
11362

Should't RepeatMask be just LaneSize wide?

11392

Do we gain anything by relaxing this and keeping PermuteMask[i] as UNDEF if the original Mask[i] was UNDEF?

bYeah there may be some crossover with lowerVectorShuffleByMerging128BitLanes. I'll see if I can generalize lowerVectorShuffleByMerging128BitLanes more.

lib/Target/X86/X86ISelLowering.cpp
11362

Yes it should.

11392

Not sure. I was trying to create a repeated lane shuffle so its based on both lanes. If its undef in both lanes it will be undef here.

@craig.topper Are you still looking at this?

Replace the contents of lowerVectorShuffleByMerging128BitLanes. Now we try to form 2 lane correcting shuffles and a repeated shuffle mask.

The code goes through 2 passes to determine the repeated mask. First if looks for any lanes that need 2 source lanes and build a repeating mask from that if any exist. Then it will try to fit one source lanes to that mask duplicating the same source lane on both inputs if necessary to match the repeated mask.

This gets the case in PR36933 now too. I haven't added the test for it to the patch yet.

It's regressing some of the avx512 cases now though because we don't seem to be able to combine 2 source permi2q with a perm2f128 feeding it.

I forget to mention, I tried to write the code in a way that it should be somewhat straightforward to extend to 512 bits with 128 bit lanes.

Looks like X86ISD::SHUF128 isn't recognized as a target shuffle. That explains some of the regressions here. I'll submit a separate patch for that and we can rebase this one.

Thanks @craig.topper - please can you update the diff with context when you rebase? It can wait until the X86ISD::SHUF128 fix

Rebase and full context

RKSimon added inline comments.Jul 20 2018, 6:13 AM
lib/Target/X86/X86ISelLowering.cpp
13305

There is a is128BitLaneRepeatedShuffleMask(VT, Mask) version that you can use instead

Some minors and I'm a bit concerned about the regression in prefer-avx256-mask-shuffle.ll - not sure why it regresses so much.

lib/Target/X86/X86ISelLowering.cpp
13297

This comment + FIXME needs updating

13336–13341

This seems easier to grok (at least to me):

int Src;
if (Srcs[0] < 0 || Srcs[0] == LaneSrc)
  Src = 0;
else if (Srcs[1] < 0 || Srcs[1] == LaneSrc)
  Src = 1;
else
  return SDValue();

Srcs[Src] = LaneSrc;
InLaneMask[i] = (M % LaneSize) + Src * Size;
13349

for (int i = 0, e = M1.size(); i != e; ++i)

13357

for (int i = 0, e = MergedMask.size(); i != e; ++i) {

test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
210

This is the only notable regression - any idea why it breaks so badly?

Address review comments. Will investigate the test regression

craig.topper added inline comments.Jul 31 2018, 11:49 AM
test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
210

It looks like we go through lowerVectorShuffleAsLanePermuteAndBlend which makes the unary shuffle non-unary. Then we go through lowerVectorShuffleByMerging128BitLanes which creates a repeated mask. But we still weren't able to handle this repeated mask cleanly so we end up shuffling and blending.

craig.topper added inline comments.Jul 31 2018, 11:54 AM
test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
210

Why can't shuffle combining merge the two vblendds with the vpermq to create two new vpermqs? Is it because the vpermq is used twice or the fact that vblendd is v8i32 and vpermq is v4i64? Or something else?

RKSimon added inline comments.Jul 31 2018, 1:35 PM
test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
210

Target shuffle combining mainly combines to single ops only, there might be scope to improve this if lower1BitVectorShuffle avoids mask vector extension to 512-bit when prefer-256 is enabled.

craig.topper added inline comments.Jul 31 2018, 2:00 PM
test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
210

The blends are on separate dependency chains but the each use the same vpermq. But I think each blend could independently be replaced with a vpermq removing the shared vpermq

RKSimon added inline comments.Aug 14 2018, 12:58 PM
test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
210

This seems to be the only remaining issue - are we better off commiting now or finding a fix first?

I'm fine with committing it if you are.

RKSimon accepted this revision.Aug 15 2018, 2:05 PM

LGTM - I'll continue investigating how to combine VSELECT with target shuffles to fix that prefer-avx256-mask-shuffle.ll regression.

This revision is now accepted and ready to land.Aug 15 2018, 2:05 PM
RKSimon closed this revision.Aug 16 2018, 6:00 AM

Committed at rL339818 (typo in commit message prevented phab from catching it)