This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add general shuffle combining support
ClosedPublic

Authored by RKSimon on Aug 9 2019, 4:13 AM.

Details

Summary

This patch uses partial DemandedElts masks to further simplify target shuffle chains and finally starts making target shuffle combining part of SimplifyDemandedBits/SimplifyDemandedVectorElts.

We already manage this for Depth == 0 cases, where combineX86ShuffleChain would early-out if the shuffle combined to the same op, but the patch generalizes this by manipulating the depth handling of combineX86ShufflesRecursively - calling with a new Depth = 0 and reducing the maximum shuffle combine depth accordingly.

Diff Detail

Event Timeline

RKSimon created this revision.Aug 9 2019, 4:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2019, 4:13 AM

Can this patch solve bad codegen for 'f5'?

https://godbolt.org/z/3YpVg-

Can this patch solve bad codegen for 'f5'?

https://godbolt.org/z/3YpVg-

I don't think any amount of shuffle combining is going to recover that. I think we need to look at lowerShuffleAsLanePermuteAndRepeatedMask

Can this patch solve bad codegen for 'f5'?

https://godbolt.org/z/3YpVg-

I don't think any amount of shuffle combining is going to recover that. I think we need to look at lowerShuffleAsLanePermuteAndRepeatedMask

Nevermind, that won't fix it. The two lanes have different controls for the shufpd in gcc's code. I think we need a new strategy.

They improved f1 and f2 cases to save one instruction wrt LLVM’s codegen

https://gcc.gnu.org/ml/gcc-patches/2019-08/msg01952.html

RKSimon planned changes to this revision.Oct 12 2019, 8:18 AM

WIP - PR27854 and PR43024 need to be finished first.

RKSimon updated this revision to Diff 227691.Nov 4 2019, 4:32 AM

rebase - still work to do to improve unpck vs insertps selection

RKSimon planned changes to this revision.Nov 4 2019, 5:09 AM
RKSimon updated this revision to Diff 239376.Jan 21 2020, 10:39 AM

rebase (still looking at this)

Status of this patch?

Status of this patch?

I'm still looking at this - it both affects and is affected by so much code its a yak shaving nightmare to handle it all.

The big remaining issue is the loss of INSERTPS for some BUILDVECTOR patterns, which means we lose load folding on SSE41+ targets.

Additionally, we need to stop creating nodes on the fly inside combineX86ShufflesRecursively (see PR45974) as this screws up hasOneUse checks which are often vital in SimplifyDemandedBits/SimplifyDemandedVectorElts.

RKSimon updated this revision to Diff 272307.Jun 21 2020, 6:28 AM

rebased - the variable shuffle mask simplification has helped but theres still some work to do - next is to fix the movhlps(x,movss) vs insertps regression

RKSimon updated this revision to Diff 272316.Jun 21 2020, 10:47 AM

fix movhlps(x,scalar_to_vector(y)) -> insertps handling

xbolva00 added inline comments.Jun 21 2020, 12:34 PM
llvm/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll
5637 ↗(On Diff #272316)

Regression

llvm/test/CodeGen/X86/buildvec-insertvec.ll
408

Extra instructio (vs sse2)

llvm/test/CodeGen/X86/masked_expandload.ll
1302

Regression

llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
559–565

Worse

648–654

Ouch

cheers @xbolva00 for what its worth, this is still a WIP, and SimplifyDemanded improvements do have a tendency to expose existing limitations (as well as find some awesome improvements.....)

llvm/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll
5637 ↗(On Diff #272316)

The original vzext_movl (which isel would consume into the movd) has been replaced with a zero_extend_vector_inreg.

llvm/test/CodeGen/X86/buildvec-insertvec.ll
408

The original vzext_movl (which isel would consume into the movd) has been replaced with a zero_extend_vector_inreg.

llvm/test/CodeGen/X86/masked_expandload.ll
1302

We've managed to change the order of load combines and we prematurely end up with different element sizes which we then can't combine together later. This is an existing problem, we've just exposed it more.

llvm/test/CodeGen/X86/shuffle-vs-trunc-512.ll
559–565

still looking at these regressions - but we've always been very weak at shuffle combining different src/dst vector widths.

RKSimon updated this revision to Diff 275153.Jul 2 2020, 9:53 AM
RKSimon edited the summary of this revision. (Show Details)

Fixed the vpmovzx regressions

RKSimon updated this revision to Diff 286284.Aug 18 2020, 8:18 AM
RKSimon retitled this revision from [WIP][X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add general shuffle combining support to [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add general shuffle combining support.
RKSimon edited the summary of this revision. (Show Details)

All regressions should now be covered.

Amazing work Simon!

ping - I think this is ready for commital now, I delayed pinging as I knew I'd be away on holiday this week and a patch like this is likely to cause a few unexpected regressions that will need addressing.

This revision is now accepted and ready to land.Aug 28 2020, 3:32 PM
This revision was landed with ongoing or failed builds.Sep 2 2020, 1:25 AM
This revision was automatically updated to reflect the committed changes.