To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result.
This fixes PR35833
Differential D41794
[X86] Improve AVX1 shuffle lowering for v8f32 shuffles where the low half comes from V1 and the high half comes from V2 and the halves do the same operation craig.topper on Jan 5 2018, 5:05 PM. Authored by
Details To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result. This fixes PR35833
Diff Detail
Event TimelineComment Actions currently you call lowerVectorShuffleSplitLowHigh() form lowerV8F32VectorShuffle() only. In order to cover v4f64, v4i64, v8i32 types, Comment Actions can you add a test for "128 bit line swapped" shuffle masks: [8, 10, 12, 14, 0, 2, 4, 6,], [9, 11, 13, 15, 1, 3, 5, 7 ] Comment Actions Add support for v4f64 as well. Without AVX2, integer types are cast to FP so this supports v8i32/v4i64 without AVX2 as is. Comment Actions What is the point of doing the adds/fadds in the new tests? It looks like its just to test multiple shuffle per test or is there a domain aspect to it that I'm missing? Comment Actions The adds/fadds are just there cause that's what was in the bug report. And I wanted to make sure we were able to reuse the vperm2f128 shuffles that create NewV1 and NewV2. Comment Actions Hi Craig, I placed simple test on the Bugzilla, your fix seems to be working fine. Comment Actions Four instructions perhaps: vextractf128, vextractf128, vshufps, vblendps? It is better than current six instructions. Comment Actions bYeah there may be some crossover with lowerVectorShuffleByMerging128BitLanes. I'll see if I can generalize lowerVectorShuffleByMerging128BitLanes more.
Comment Actions Replace the contents of lowerVectorShuffleByMerging128BitLanes. Now we try to form 2 lane correcting shuffles and a repeated shuffle mask. The code goes through 2 passes to determine the repeated mask. First if looks for any lanes that need 2 source lanes and build a repeating mask from that if any exist. Then it will try to fit one source lanes to that mask duplicating the same source lane on both inputs if necessary to match the repeated mask. This gets the case in PR36933 now too. I haven't added the test for it to the patch yet. It's regressing some of the avx512 cases now though because we don't seem to be able to combine 2 source permi2q with a perm2f128 feeding it. Comment Actions I forget to mention, I tried to write the code in a way that it should be somewhat straightforward to extend to 512 bits with 128 bit lanes. Comment Actions Looks like X86ISD::SHUF128 isn't recognized as a target shuffle. That explains some of the regressions here. I'll submit a separate patch for that and we can rebase this one. Comment Actions Thanks @craig.topper - please can you update the diff with context when you rebase? It can wait until the X86ISD::SHUF128 fix
Comment Actions Some minors and I'm a bit concerned about the regression in prefer-avx256-mask-shuffle.ll - not sure why it regresses so much.
Comment Actions LGTM - I'll continue investigating how to combine VSELECT with target shuffles to fix that prefer-avx256-mask-shuffle.ll regression. |
Should't RepeatMask be just LaneSize wide?