This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add cpu feature for aggressive combining to variable shuffles
ClosedPublic

Authored by RKSimon on Dec 16 2017, 8:27 AM.

Details

Summary

As mentioned in D38318 and D40865, modern Intel processors prefer to combine multiple shuffles to a variable shuffle mask (PSHUFB/VPERMPS etc.) instead of having multiple stage 'fixed' shuffles which put more pressure on Port 5 (at the expense of extra shuffle mask loads).

As discussed, this patch provides a FeatureFastVariableShuffle target flag for Haswell+ CPUs that prefers combining 2 or more fixed shuffles to a single variable shuffle (default is 3 shuffles).

If everybody is happy with this approach I will refactor some of the vector-shuffle-* tests to run with -fast-variable-shuffle enabled to compare shuffles.

The long term aim is to drive more of this from schedule data (probably via the MC) but we're not close to being ready for that yet.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Dec 16 2017, 8:27 AM
spatel accepted this revision.Dec 17 2017, 8:27 AM

LGTM.

The only question is whether we prefer a 'Fast' feature or a 'Slow' feature to distinguish these cases. I don't have a preference, so you might want to wait for more feedback before committing.

This revision is now accepted and ready to land.Dec 17 2017, 8:27 AM
zvi added a comment.Dec 18 2017, 6:26 AM

LGTM. Thanks

zvi requested changes to this revision.Dec 18 2017, 6:33 AM

On second thought, i think we need to update the tests with a -mattr=+fast-variable-shuffle configuration, right?

This revision now requires changes to proceed.Dec 18 2017, 6:33 AM
In D41323#958329, @zvi wrote:

On second thought, i think we need to update the tests with a -mattr=+fast-variable-shuffle configuration, right?

Yup, that's what I'm working on now.

RKSimon updated this revision to Diff 127366.Dec 18 2017, 8:15 AM

Added slow/fast passes for AVX2/AVX512 shuffle lowering tests - did this for v8i16/v16i8 and all 256-bit types as these have useful tests to show the diffs. I can add them for the others if you wish but these will mostly not cause any codegen diffs.

zvi added a comment.Dec 18 2017, 9:58 AM

Here's the full list of tests that are affected by setting AllowVariableMask for Depth=2. I think that we should have the full list covered with the new configuration.
I would be happy to assist with the work involved.

CodeGen/X86/avx-intrinsics-fast-isel.ll
CodeGen/X86/avx-splat.ll
CodeGen/X86/avx2-conversions.ll
CodeGen/X86/avx2-vector-shifts.ll
CodeGen/X86/avx512-shuffles/broadcast-vector-int.ll
CodeGen/X86/avx512-shuffles/partial_permute.ll
CodeGen/X86/avx512-trunc.ll
CodeGen/X86/bitcast-and-setcc-512.ll
CodeGen/X86/bitcast-int-to-vector-bool-sext.ll
CodeGen/X86/bitcast-int-to-vector-bool-zext.ll
CodeGen/X86/bitcast-int-to-vector-bool.ll
CodeGen/X86/cast-vsel.ll
CodeGen/X86/combine-or.ll
CodeGen/X86/combine-shl.ll
CodeGen/X86/combine-sra.ll
CodeGen/X86/combine-srl.ll
CodeGen/X86/insertelement-duplicates.ll
CodeGen/X86/insertelement-zero.ll
CodeGen/X86/known-signbits-vector.ll
CodeGen/X86/merge-consecutive-loads-128.ll
CodeGen/X86/oddshuffles.ll
CodeGen/X86/psubus.ll
CodeGen/X86/reduce-trunc-shl.ll
CodeGen/X86/select.ll
CodeGen/X86/shuffle-of-splat-multiuses.ll
CodeGen/X86/shuffle-strided-with-offset-128.ll
CodeGen/X86/shuffle-strided-with-offset-256.ll
CodeGen/X86/shuffle-strided-with-offset-512.ll
CodeGen/X86/shuffle-vs-trunc-128.ll
CodeGen/X86/shuffle-vs-trunc-256.ll
CodeGen/X86/shuffle-vs-trunc-512.ll
CodeGen/X86/sse41.ll
CodeGen/X86/swizzle-2.ll
CodeGen/X86/trunc-ext-ld-st.ll
CodeGen/X86/vec_cast2.ll
CodeGen/X86/vec_insert-5.ll
CodeGen/X86/vec_insert-mmx.ll
CodeGen/X86/vec_set-3.ll
CodeGen/X86/vec_trunc_sext.ll
CodeGen/X86/vector-compare-results.ll
CodeGen/X86/vector-half-conversions.ll
CodeGen/X86/vector-rotate-128.ll
CodeGen/X86/vector-rotate-256.ll
CodeGen/X86/vector-sext.ll
CodeGen/X86/vector-shuffle-128-v16.ll
CodeGen/X86/vector-shuffle-128-v4.ll
CodeGen/X86/vector-shuffle-128-v8.ll
In D41323#958558, @zvi wrote:

Here's the full list of tests that are affected by setting AllowVariableMask for Depth=2. I think that we should have the full list covered with the new configuration.
I would be happy to assist with the work involved.

Sure - as long as we're testing with the fast and slow cases for them all. Are you happy with this patch with its tests changes as it is and you just update the remaining tests as followups?

zvi accepted this revision.Dec 18 2017, 2:25 PM
In D41323#958558, @zvi wrote:

Here's the full list of tests that are affected by setting AllowVariableMask for Depth=2. I think that we should have the full list covered with the new configuration.
I would be happy to assist with the work involved.

Sure - as long as we're testing with the fast and slow cases for them all. Are you happy with this patch with its tests changes as it is and you just update the remaining tests as followups?

Ok, let's do that.

This revision is now accepted and ready to land.Dec 18 2017, 2:25 PM
This revision was automatically updated to reflect the committed changes.
zvi added a comment.Dec 20 2017, 2:12 AM

Since all known processors with AVX512 will prefer this new feature turned on, can we make AVX512 imply Fast-var-shuffles?

In D41323#960589, @zvi wrote:

Since all known processors with AVX512 will prefer this new feature turned on, can we make AVX512 imply Fast-var-shuffles?

I've created D41436 - the main issue is whether KNL prefers variable shuffles the same as SkylakeServer

zvi added a comment.Dec 20 2017, 2:51 PM

I've created D41436 - the main issue is whether KNL prefers variable shuffles the same as SkylakeServer

The Intel Optimization Reference Manual is the only source of information i have for KNL. It says that of the two core's VPUs, one can execute shuffles. It says that single-source shuffles have a reciprocal throughput of 1 cycle, dual-source shuffles have a reciprocal throughput of 2 cycles. So i don't have any information indicating that KNL differs from SKX in preference of variable-shuffles.