This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] try to form a better extractelement
ClosedPublic

Authored by spatel on Mar 23 2020, 9:29 AM.

Details

Summary

Extracting to the same index that we are going to insert back into allows forming select ("blend") shuffles and enables further transforms.

Admittedly, this is a quick-fix for a more general problem that I'm hoping to solve by adding transforms for patterns that start with an insertelement.
But this might resolve some regressions known to be caused by the extract-extract transform (although I have not gotten more details on those yet).

In the motivating case from PR34724:
https://bugs.llvm.org/show_bug.cgi?id=34724

The combination of subsequent instcombine and codegen transforms gets us this improvement:

vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
vhaddps	%xmm1, %xmm1, %xmm4
vmovshdup	%xmm1, %xmm3    ## xmm3 = xmm1[1,1,3,3]
vaddps	%xmm0, %xmm2, %xmm0
vaddps	%xmm1, %xmm3, %xmm1
vshufps	$200, %xmm4, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm4[0,3]
vinsertps	$177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]

-->

vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
vhaddps	%xmm1, %xmm1, %xmm1
vaddps	%xmm0, %xmm2, %xmm0
vshufps	$200, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm1[0,3]

Diff Detail

Event Timeline

spatel created this revision.Mar 23 2020, 9:29 AM

While this looks good in principle, i'm starting to have second thoughts about the general design here..

Since this doesn't affect the calculated cost, i wonder if it would be better to go more generic,
and instead implement some kind of lane reordering?

While this looks good in principle, i'm starting to have second thoughts about the general design here..

Since this doesn't affect the calculated cost, i wonder if it would be better to go more generic,
and instead implement some kind of lane reordering?

Hmm...I'm not seeing what a more generic solution would look like. Do you have an example/suggestion?
If we implement the larger match that starts from an insertelement, I'm imagining this becomes something like InstCombine's canEvaluateZExtd() or AggressiveInstCombine's TruncInstCombine. So we have:
insertelement (some string of scalar ops that are seeded by extractelement and/or constants) --> series of vector ops with extract/insert removed
We could build that up in steps, so start with the simplest case of: insert undef, (binop (extract X), C) --> vector binop X, vecC

RKSimon accepted this revision.Apr 2 2020, 2:15 PM

LGTM

This revision is now accepted and ready to land.Apr 2 2020, 2:15 PM
lebedev.ri accepted this revision.Apr 2 2020, 2:59 PM

LG for now.

While this looks good in principle, i'm starting to have second thoughts about the general design here..

Since this doesn't affect the calculated cost, i wonder if it would be better to go more generic,
and instead implement some kind of lane reordering?

Hmm...I'm not seeing what a more generic solution would look like. Do you have an example/suggestion?
If we implement the larger match that starts from an insertelement, I'm imagining this becomes something like InstCombine's canEvaluateZExtd() or AggressiveInstCombine's TruncInstCombine. So we have:
insertelement (some string of scalar ops that are seeded by extractelement and/or constants) --> series of vector ops with extract/insert removed
We could build that up in steps, so start with the simplest case of: insert undef, (binop (extract X), C) --> vector binop X, vecC

Yes, that is the general idea, if we have insert-of-one-use-extract
(or, all uses are inserts into same lane?), try to make extraction
to be from the same lane. Though there is a second potential fold here:
start from extractelement, and try to minimize it's cost by moving it into lower lanes.

It does sound involved, i'm not sure if it will be worth it in the end.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2020, 11:20 AM