This is an archive of the discontinued LLVM Phabricator instance.

[X86][LV][TTI][Costmodel] LoopVectorizer: don't use `TTI::isLegalMaskedGather()` hook, introduce `TTI::shouldUseMaskedGatherForVectorization()`
AbandonedPublic

Authored by lebedev.ri on Oct 6 2021, 4:56 AM.

Details

Summary

On X86, gather/scatter story is sad. Native support appeared only in AVX2,
but even then, only in Skylake and newer their performance is not abysmal.
Even in Zen3 it's rather bad. So X86 says that masked gather/scatter
are not legal (except for +avx512 || +fast-gather),
and ScalarizeMaskedMemIntrin pass expands them.

But at the same time, we can model the cost of the expanded form
of gather/scatter, via X86TTIImpl::getGatherScatterOpCost(),
and most often it's better than the LV's "scalarization" cost,
but since we say the gather is illegal, LV does not even query it's cost.

I think this is not optimal. I propose to add a new TTI hook,
shouldUseMaskedGatherForVectorization(), which defaults to isLegalMaskedGather(),
but is overrided on X86 to unconditionally return true iff no variable mask is needed
(i.e. the gather/scatter sequence will not require branching).

If this makes sense i can follow up with SLP patch.

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 6 2021, 4:56 AM
lebedev.ri requested review of this revision.Oct 6 2021, 4:56 AM
lebedev.ri retitled this revision from [X86][LV][TTi][Costmodel] LoopVectorizer: don't use `TTI::isLegalMaskedGather()` hook, introduce `TTI::shouldUseMaskedGatherForVectorization()` to [X86][LV][TTI][Costmodel] LoopVectorizer: don't use `TTI::isLegalMaskedGather()` hook, introduce `TTI::shouldUseMaskedGatherForVectorization()`.Oct 6 2021, 5:04 AM
lebedev.ri planned changes to this revision.Oct 6 2021, 5:24 AM
lebedev.ri added a reviewer: delena.

Hm, actually, i don't think X86TTIImpl::getGSScalarCost() is right, let's fix that first.

lebedev.ri updated this revision to Diff 377563.Oct 6 2021, 8:50 AM

Rebased ontop of D111222.

Matt added a subscriber: Matt.Oct 6 2021, 9:12 AM

Was "never" supposed to be a different word in "only in Skylake and never their performance is not abysmal."?

lebedev.ri edited the summary of this revision. (Show Details)Oct 6 2021, 9:16 AM

Was "never" supposed to be a different word in "only in Skylake and never their performance is not abysmal."?

Right, "newer".

RKSimon added inline comments.Oct 6 2021, 12:19 PM
llvm/test/Transforms/LoopVectorize/X86/gather-cost.ll
2

pre-commit the regeneration?

17–18

Does this pass with this still here?

llvm/test/Transforms/LoopVectorize/X86/parallel-loops.ll
2

pre-commit?

lebedev.ri updated this revision to Diff 377664.Oct 6 2021, 1:25 PM
lebedev.ri marked 3 inline comments as done.

Rebased over precommitted check line regeneration.

How good is the gather/scatter scalarization at splitting vectorized geps? Doing pointer math on the vector unit (probably using a single splatted base pointer) and then having to extract every pointer is going to be pretty poor compared to scalarizing and using LEA op on x86.

How good is the gather/scatter scalarization at splitting vectorized geps? Doing pointer math on the vector unit (probably using a single splatted base pointer) and then having to extract every pointer is going to be pretty poor compared to scalarizing and using LEA op on x86.

Hmm it's not: https://godbolt.org/z/eYzP4ob3G
I guess we might want to:

  1. teach vectorcombine to split vector geps iff all their uses are extracts
  2. schedule additional run of vectorcombine after the scalarizer

But that doesn't affect cost model though, i believe.

How good is the gather/scatter scalarization at splitting vectorized geps? Doing pointer math on the vector unit (probably using a single splatted base pointer) and then having to extract every pointer is going to be pretty poor compared to scalarizing and using LEA op on x86.

Hmm it's not: https://godbolt.org/z/eYzP4ob3G
I guess we might want to:

  1. teach vectorcombine to split vector geps iff all their uses are extracts
  2. schedule additional run of vectorcombine after the scalarizer

But that doesn't affect cost model though, i believe.

Ok, D111363 should deal with most of it,
but i'm not sure it's a hard blocker here.

lebedev.ri planned changes to this revision.Oct 8 2021, 1:00 PM

I think we won't need this after all, i've found the real problem - D111460.