This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Generalized matching for target shuffle combines
ClosedPublic

Authored by RKSimon on Apr 17 2016, 8:10 AM.

Details

Summary

This patch is a first step towards a more extendible method of matching combined target shuffle masks.

Initially this just pulls out the existing basic mask matches and adds support for some 256/512 bit equivalents. Future patterns will require a number of features to be added but I wanted to keep this patch simple.

I hope we can avoid duplication between shuffle lowering and combining and share more complex pattern match functions in future commits (permutes + insertps are my next targets).

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 54006.Apr 17 2016, 8:10 AM
RKSimon retitled this revision from to [X86][AVX] Generalized matching for target shuffle combines.
RKSimon updated this object.
RKSimon added reviewers: chandlerc, delena, andreadb, spatel.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
RKSimon updated this revision to Diff 56151.May 4 2016, 7:53 AM
RKSimon updated this object.
delena added inline comments.May 8 2016, 10:41 PM
lib/Target/X86/X86ISelLowering.cpp
24007 ↗(On Diff #56151)

Why float/int is so important? I think that a one cycle penalty for switching domains was on old (SSE) processors.

test/CodeGen/X86/vector-shuffle-combining-avx512bw.ll
25 ↗(On Diff #56151)

What happens when mask is not -1?

RKSimon added inline comments.May 9 2016, 3:03 AM
lib/Target/X86/X86ISelLowering.cpp
24007 ↗(On Diff #56151)

Intel SB/IV, NW/BW + SK still exhibit it (although according to Agner the number of cases is diminishing with each iteration) and AMD Jaguar / Bulldozer families still have this as well.

http://www.agner.org/optimize/microarchitecture.pdf

In the longer term this may turn into a case where we should move as much of the shuffle combining as possible to the MC pass (as discussed on PR26183) to allow shuffle combines based on target machine scheduler but nobody seems keen to take on that job......

test/CodeGen/X86/vector-shuffle-combining-avx512bw.ll
25 ↗(On Diff #56151)

I'll add tests to check what is going on.

RKSimon updated this revision to Diff 56558.May 9 2016, 5:28 AM

Added AVX512 mask tests - the masked operations are still there.

Annoyingly the shuffle comments doesn't refer to the mask kregister - is there an accepted convention that we could possibly use to make this more clear?

delena edited edge metadata.May 9 2016, 6:05 AM

Added AVX512 mask tests - the masked operations are still there.

Annoyingly the shuffle comments doesn't refer to the mask kregister - is there an accepted convention that we could possibly use to make this more clear?

We should add mask register to the printed shuffle. Not in this patch, of course.

lib/Target/X86/X86ISelLowering.cpp
24020 ↗(On Diff #56558)

I still think that when you replace a shuffle with loading indices to MOVDDUP, the MOVDDUP is better.

24067 ↗(On Diff #56558)

Can you keep the original VT here?

24105 ↗(On Diff #56558)

Unpack operation is not in FP domain. Again, even if you switch domains, the penalty in old processors is very low.

You still may see slowdown on some benchmarks, since while switching from load+shuffle to shuffle you reach another port. It is just FYI.

! In D19198#424562, @delena wrote:
We should add mask register to the printed shuffle. Not in this patch, of course.

I'll add a TODO comment to the relevant files.

lib/Target/X86/X86ISelLowering.cpp
24020 ↗(On Diff #56558)

The intention is that those will be dealt with by a permute instruction (PSHUFD-int or VPERMILPS-fp) in a future patch.

This patch is just about trying to set up a 'shuffle matching' framework - later patches will then add support for additional matches (permute / broadcast being early targets). I only added the 256/512 versions as it was a minor extension to whats already there for 128.

24067 ↗(On Diff #56558)

Annoyingly no but between bitcast combines and target shuffle combines it doesn't seem to cause any problems.

Please see my comments above about UNPCK and FloatDomain.

lib/Target/X86/X86ISelLowering.cpp
24067 ↗(On Diff #56558)

The type may vary from v16f32 to v16i32, or from v8i64 to v8f64, but the vector should remain the same. Otherwise we can't put mask on this operation.

RKSimon updated this revision to Diff 56697.May 10 2016, 5:53 AM
RKSimon edited edge metadata.

Elena - I've added a test demonstrating the MOVDDUP combine affecting a masked v16f32 shuffle. As you said this causes the shuffle and the mask move to be split apart. When is this beneficial? Is there a minimum combine depth that you think we should include for combines of EVEX shuffles where the number of vector elements changes? I can add such a depth check if you think it worthwhile.

Regarding the UNPACK cases, just adding support for integer domain combines at this stage will cause some existing PSHUFD unary permutes to be replaced with un-foldable UNPACKs with repeated inputs.

As I said previously this patch is about splitting the matching of shuffles from the combines themselves to reduce code duplication with shuffle lowering. Although I have added 256/512 bit shuffle support for some existing shuffles I intended to add support for new shuffles later on - PSHUFD/VPERMILPS permutes are one of the first cases that will be added (along with BROADCAST).

If you prefer I can change the order of this work - reduce this patch to just pulling out the existing shuffle matching code, then future patches will deal with support matching permute (and broadcast / insertps / etc.) and 256/512 bit vectors?

Elena - I've added a test demonstrating the MOVDDUP combine affecting a masked v16f32 shuffle. As you said this causes the shuffle and the mask move to be split apart. When is this beneficial? Is there a minimum combine depth that you think we should include for combines of EVEX shuffles where the number of vector elements changes? I can add such a depth check if you think it worthwhile.

I'm not sure that it is beneficial on depth 1. Because the sequence "vmovdqa32 + vpermt2ps" allows to move vmovdqa32 up, outside the loop, for example. "vmovddup + vmovaps" are two dependent instructions. Depth 2 and more should be measured, I don't know how to estimate.
I talked with out people about this optimization and I'd suggest to refrain from optimizing masked shuffles on AVX-512 on this stage.

As far as non-masked instructions, an additional issue may be in folding loads. Could you, please, check what happens with folding loads when you change VT?

if (SrcVT.is256BitVector()) {

The target may be skylake-avx512 (HasVLX) where we still have to check masks.

If you prefer I can change the order of this work - reduce this patch to just pulling out the existing shuffle matching code, then future patches will deal with support matching permute (and broadcast / insertps / etc.) and 256/512 bit vectors?

The review will be easier, of course.

RKSimon updated this revision to Diff 56746.May 10 2016, 10:24 AM

Disabled target shuffle combines on 512-bit vector of when VLX is enabled and the combined vector element size is different to the root type of the combine as this would prevent writemasks being reused. I couldn't find an easy way to detect if a writemask was actually used so this restriction may turn out to be too harsh.

Elena - are you happy with this or would you prefer if I reduced this patch to just splitting out the original code (NFC)?

ping? Elena, I don't know if you saw it but I added a check to only combine on AVX512/AVX512VL if the number of vector elements doesn't change.

delena accepted this revision.May 18 2016, 11:32 AM
delena edited edge metadata.

yes, thank you.

This revision is now accepted and ready to land.May 18 2016, 11:32 AM
This revision was automatically updated to reflect the committed changes.