To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result.
This fixes PR35833
Paths
| Differential D41794
[X86] Improve AVX1 shuffle lowering for v8f32 shuffles where the low half comes from V1 and the high half comes from V2 and the halves do the same operation ClosedPublic Authored by craig.topper on Jan 5 2018, 5:05 PM.
Details Summary To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result. This fixes PR35833
Diff Detail
Event TimelineComment Actions currently you call lowerVectorShuffleSplitLowHigh() form lowerV8F32VectorShuffle() only. In order to cover v4f64, v4i64, v8i32 types, Comment Actions can you add a test for "128 bit line swapped" shuffle masks: [8, 10, 12, 14, 0, 2, 4, 6,], [9, 11, 13, 15, 1, 3, 5, 7 ] Comment Actions Add support for v4f64 as well. Without AVX2, integer types are cast to FP so this supports v8i32/v4i64 without AVX2 as is. Comment Actions What is the point of doing the adds/fadds in the new tests? It looks like its just to test multiple shuffle per test or is there a domain aspect to it that I'm missing? Comment Actions The adds/fadds are just there cause that's what was in the bug report. And I wanted to make sure we were able to reuse the vperm2f128 shuffles that create NewV1 and NewV2. Comment Actions Hi Craig, I placed simple test on the Bugzilla, your fix seems to be working fine. Comment Actions Four instructions perhaps: vextractf128, vextractf128, vshufps, vblendps? It is better than current six instructions. Comment Actions
OK - please can you commit the tests to trunk so the patch shows the codegen diff? Comment Actions bYeah there may be some crossover with lowerVectorShuffleByMerging128BitLanes. I'll see if I can generalize lowerVectorShuffleByMerging128BitLanes more.
Comment Actions Replace the contents of lowerVectorShuffleByMerging128BitLanes. Now we try to form 2 lane correcting shuffles and a repeated shuffle mask. The code goes through 2 passes to determine the repeated mask. First if looks for any lanes that need 2 source lanes and build a repeating mask from that if any exist. Then it will try to fit one source lanes to that mask duplicating the same source lane on both inputs if necessary to match the repeated mask. This gets the case in PR36933 now too. I haven't added the test for it to the patch yet. It's regressing some of the avx512 cases now though because we don't seem to be able to combine 2 source permi2q with a perm2f128 feeding it. Comment Actions I forget to mention, I tried to write the code in a way that it should be somewhat straightforward to extend to 512 bits with 128 bit lanes. Comment Actions Looks like X86ISD::SHUF128 isn't recognized as a target shuffle. That explains some of the regressions here. I'll submit a separate patch for that and we can rebase this one. Comment Actions Thanks @craig.topper - please can you update the diff with context when you rebase? It can wait until the X86ISD::SHUF128 fix
Comment Actions Some minors and I'm a bit concerned about the regression in prefer-avx256-mask-shuffle.ll - not sure why it regresses so much.
Comment Actions LGTM - I'll continue investigating how to combine VSELECT with target shuffles to fix that prefer-avx256-mask-shuffle.ll regression. This revision is now accepted and ready to land.Aug 15 2018, 2:05 PM
Revision Contents
Diff 158332 lib/Target/X86/X86ISelLowering.cpp
test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
test/CodeGen/X86/vector-shuffle-256-v16.ll
test/CodeGen/X86/vector-shuffle-256-v32.ll
test/CodeGen/X86/vector-shuffle-256-v4.ll
test/CodeGen/X86/vector-shuffle-256-v8.ll
|
Should't RepeatMask be just LaneSize wide?