This is an archive of the discontinued LLVM Phabricator instance.

X86 AVX2: Prefer one VPERMV over ShuffleAsRepeatedMaskAndLanePermute
AbandonedPublic

Authored by zvi on Dec 5 2017, 3:21 PM.

Details

Summary

Reverse the order of precedence in lowering v8i32 shuffles by
prefering one permute by vector of indices over two shuffleis with
immediate masks. The benefit is reduction of shuffle port pressure.

Event Timeline

zvi created this revision.Dec 5 2017, 3:21 PM
spatel edited edge metadata.

Is it correct that AVX1-only targets are never affected?

The implicit assumption is that the load of the mask could be hoisted far enough ahead that load latency doesn't get in the way, right? But then we're also increasing register pressure. Ideally, we'd do this later when we have some way to calculate the machine-specific trade-offs. Eg, Ryzen probably doesn't win with this transform because it needs 3 uops to do the vpermps. Do you have perf wins on benchmarks?

ddibyend edited edge metadata.Dec 7 2017, 1:29 AM

vpermps/vpermpd/vpermd have high cost in Ryzen. I see that this patch creates cases where a vperm* is introduced where one did not exist earlier. That may cause slowdowns in Ryzen.

RKSimon edited edge metadata.Dec 7 2017, 2:02 AM

This looks like another case for a scheduler driven shuffle combine, possibly in the MC like D40602, although I'm still not sure if that is late enough to properly account for the increase in register pressure from the shuffle mask load.

zvi added a comment.Dec 7 2017, 6:00 AM

@spatel, this patch is for lowerV8I32VectorShuffle() which won't be called for AVX1-only targets. Would be nice if we could somehow get AVX covered as well, if profitable.
I did not observe any speedups with this patch, but FWIW IACA reports that (for Intel processors, of course) the throughput can be higher even if the load is not hoisted.
What triggered this patch was a case i discovered while working on deprecation of llvm.x86.avx2.permd and llvm.x86.avx2.permps. After trashing these intrinsics that case ends up with:

define <8 x i32> @shuffle_test_vpermd(<8 x i32> %a0) {
   %1 = shufflevector <8 x i32> %a0, <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
   ret <8 x i32> %1
 }

So without this patch we will be generating regressed code for Intel, but faster code for AMD, according to what was posted in the above comments.

@RKSimon, I'm not too familiar with the MachineCombiner. Are there already any shuffle cases that are handled or was that wishful thinking? :)

In D40865#948072, @zvi wrote:

@RKSimon, I'm not too familiar with the MachineCombiner. Are there already any shuffle cases that are handled or was that wishful thinking? :)

Yes, it keeps being proposed but it's a big job, part of the idea behind D40602 was to show how it'd work in principle for a much simpler case (double shifts) than shuffles. The idea would be to perform more aggressive combining to variable shuffles (PSHUFB/VPERMPS etc.) in the MC, so we'd still keep to the '3 shuffles limit' for variable mask folding in DAG lowering as that works better for AMD Jaguar/Bulldozer/Zen and older Intel cores, and then the MC driven by the scheduler models tries again later on. But there's still concerns that there will be plenty of regressions due to register pressure, load latency etc. and whether the code really is port5 bound....

A second (temporary?) option mentioned in D38318 was to add a feature flag for more recent intel cores that reduced the 'AllowVariableMask depth limit' in combineX86ShuffleChain to 2.

zvi added a comment.Dec 11 2017, 11:25 PM
In D40865#948072, @zvi wrote:

@RKSimon, I'm not too familiar with the MachineCombiner. Are there already any shuffle cases that are handled or was that wishful thinking? :)

Yes, it keeps being proposed but it's a big job, part of the idea behind D40602 was to show how it'd work in principle for a much simpler case (double shifts) than shuffles. The idea would be to perform more aggressive combining to variable shuffles (PSHUFB/VPERMPS etc.) in the MC, so we'd still keep to the '3 shuffles limit' for variable mask folding in DAG lowering as that works better for AMD Jaguar/Bulldozer/Zen and older Intel cores, and then the MC driven by the scheduler models tries again later on. But there's still concerns that there will be plenty of regressions due to register pressure, load latency etc. and whether the code really is port5 bound....

A second (temporary?) option mentioned in D38318 was to add a feature flag for more recent intel cores that reduced the 'AllowVariableMask depth limit' in combineX86ShuffleChain to 2.

Thanks for the explanation, Simon. I will update the patch with the feature you proposed as a temporary solution untill the MachineCombiner is ready to handle this case.

@zvi Are you happy with my proposal in D41323?

zvi added a comment.Dec 18 2017, 6:29 AM

@zvi Are you happy with my proposal in D41323?

Yes, thanks. Will rebase this patch after D41323 is landed.

@zvi OK to abandon this now?

@zvi Abandon this? I think the fast-variable-shuffle attribute has this covered.

zvi abandoned this revision.Jan 13 2018, 9:54 AM