This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] fold extract-extract-op with different extraction indexes
ClosedPublic

Authored by spatel on Mar 5 2020, 8:56 AM.

Details

Summary

opcode (extelt V0, Ext0), (ext V1, Ext1) --> extelt (opcode (splat V0, Ext0), V1), Ext1

The first part of this patch generalizes the cost calculation to accept different extraction indexes. The second part creates a shuffle+extract before feeding into the existing code to create a vector op+extract.

The patch conservatively uses "TargetTransformInfo::SK_PermuteSingleSrc" rather than "TargetTransformInfo::SK_Broadcast" (splat specifically from element 0) because we do not have a more general "SK_Splat" currently. That does not affect any of the current regression tests, but we might be able to find some cost model target specialization where that comes into play.

I suspect that we can expose some missing x86 horizontal op codegen with this transform, so I'm speculatively adding a debug flag to disable the binop variant of this transform to allow easier testing.

The test changes show that we're sensitive to cost model diffs (as we should be), so that means that patches like D74976 should have better coverage.

Diff Detail

Event Timeline

spatel created this revision.Mar 5 2020, 8:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2020, 8:56 AM

getShuffleCost already takes an Index value (default = 0) that I meant to use for this kind of thing (its currently just used for subvector insertion/extraction), so either 'tweaking' SK_Broadcast to use Index or replacing it with a SK_Splat mode would be relatively painless.

Taking a step back, we have

; 0. scalar
%e0 = extractelement %x, C0
%e1 = extractelement %y, C1
%r = op %e0, %e1

Let's focus on C0 != C1 problem, i see three general approaches there:

  1. move either one lane to another lane
    1. move lane C0 to lane C1
    2. move lane C1 to lane C0
  2. move both lanes into lane 0

In current solution, why do we decide to only entertain the idea of replacing the costlier extract?
Don't we want to check all three variants? (there's some overlap if one of C0,C1 is already lane 0)
What if the other extract is from lane 0? We'd then be able to use TTI::SK_Broadcast.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
120

s/1/one/, else it's ambitious with 'first'

Taking a step back, we have

; 0. scalar
%e0 = extractelement %x, C0
%e1 = extractelement %y, C1
%r = op %e0, %e1

Let's focus on C0 != C1 problem, i see three general approaches there:

  1. move either one lane to another lane
    1. move lane C0 to lane C1
    2. move lane C1 to lane C0
  2. move both lanes into lane 0

In current solution, why do we decide to only entertain the idea of replacing the costlier extract?
Don't we want to check all three variants? (there's some overlap if one of C0,C1 is already lane 0)
What if the other extract is from lane 0? We'd then be able to use TTI::SK_Broadcast.

Correct - I thought about reversing the operands if it would allow the shuffle to be a broadcast, but didn't try to implement it. My guess is that won't change the costs for the examples shown here (since they don't change even if we incorrectly specify that the shuffle in the current patch is a broadcast).

Similarly, the option where we shuffle both operands to element 0 has potential to be the winner, but I'm skeptical that it would be beneficial on most x86 examples. So I agree that if we want to get the optimal sequence, we need to model all of those possibilities - and by not modeling all of the possibilities, we may be missing an optimization. We could also argue that we're still in IR, so we should select 1 of those forms as semi-canonical and let the backend handle the other transforms.

Ok, if I add a TODO comment for now?

lebedev.ri added inline comments.Mar 6 2020, 3:19 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
119

So essentially we are assuming that this is the optimal cost model.
Can we assert that?

int ExpensiveExtractCost = std::max(Extract0Cost, Extract1Cost);
(void)ExpensiveExtractCost;

// FIXME: introduce TargetTransformInfo::SK_Splat.
assert(TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy) + CheapExtractCost
       <=
       TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy) + ExpensiveExtractCost
);
lebedev.ri accepted this revision.Mar 6 2020, 4:12 AM
lebedev.ri marked an inline comment as done.

Ok, after looking at this more, i think any further improvements (less pessimistic cost computations)
will look really awkward without first abstracting away the splat shuffle cost,
I think it would be okay to add FIXME's for now.

@RKSimon thoughts?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
89–90

// FIXME: evaluate whether that always results in lowest cost

119

Err, that won't really do it i guess, so let's not do that.

This revision is now accepted and ready to land.Mar 6 2020, 4:12 AM
RKSimon accepted this revision.Mar 6 2020, 6:10 AM

LGTM - You might want to add some test cases with 256/512-bit vectors (with indices coming from the same/different 128-bit subvectors) as well to get a broader range of extraction costs.

I'm not certain a SK_Splat will make much difference - the shuffle masks with a single non-undef index would be in most cases a lot cheaper even than that, especially on early SSE targets, so maybe we need to either consider another shuffle enum type or just provide the ability to compute costs from raw shuffle masks?

spatel added a comment.Mar 6 2020, 8:13 AM

LGTM - You might want to add some test cases with 256/512-bit vectors (with indices coming from the same/different 128-bit subvectors) as well to get a broader range of extraction costs.

Yes, will add some larger vector tests.

I'm not certain a SK_Splat will make much difference - the shuffle masks with a single non-undef index would be in most cases a lot cheaper even than that, especially on early SSE targets, so maybe we need to either consider another shuffle enum type or just provide the ability to compute costs from raw shuffle masks?

There's no clear answer on how far we should take the modeling. We've said in the past that we shouldn't try that hard -- because this is IR, we can never know exactly how it will be lowered for every target.
The idea of trying different options and picking a winner sounds like VPlan to me (I haven't kept up with its current state, so need to review that.)

This revision was automatically updated to reflect the committed changes.
spatel marked 3 inline comments as done.
jgorbe added a subscriber: jgorbe.Mar 16 2020, 6:02 PM

Hi,

We're seeing some large performance regressions on Eigen (http://eigen.tuxfamily.org/) running on haswell and skylake machines with this patch. Could you please revert it?

Hi,

We're seeing some large performance regressions on Eigen (http://eigen.tuxfamily.org/) running on haswell and skylake machines with this patch.

Those are being compiled with the correct -march= for the machine that actually runs benchmarks, right?
Can you be more specific? Reproduction steps (which benchmark specifically), perhaps some regressed snippets?

Could you please revert it?

Could you please revert it?

I'd prefer more specific reproduction steps before we make changes here. Ideally, please file a bug report with a reduced example.
Also, as noted in the description/commit message, I expected that this patch could cause perf problems, so there are debug options to help narrow those down:

{-mllvm } -disable-binop-extract-shuffle / -disable-vector-combine

If we can use those (or even add more specific flags) to temporarily disable functionality while investigating, that would be more efficient than reverting.

Thanks for the note about the flags. We have tried both of them, and any (or both) of them being present causes the regression to go away, so that unblocks us while we work on producing a proper test case.

Thanks for the note about the flags. We have tried both of them, and any (or both) of them being present causes the regression to go away, so that unblocks us while we work on producing a proper test case.

Sounds good. Note that "-disable-vector-combine" is the big hammer; it effectively disables this entire pass.
If the regressions only appeared with this patch and go away with "-disable-binop-extract-shuffle", the most likely source of the problem is the cost model or this pass is interacting badly with the SLP vectorizer.