Page MenuHomePhabricator

[LV][X86] Sink `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()` further into TTI, disable for X86/AVX2+
Needs ReviewPublic

Authored by lebedev.ri on Nov 30 2021, 1:11 AM.

Details

Summary

D43208 extracted useEmulatedMaskMemRefHack() from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:

// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.

While i don't really understand about what specifically is completely broken
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.

This adds a similarly-named TTI hook, which is default-true,
but returns false for X86-with-AVX2.

Diff Detail

Unit TestsFailed

TimeTest
80 msx64 debian > LLVM.Bindings/Go::go.test
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llvm-go go=/usr/bin/go test llvm.org/llvm/bindings/go/llvm

Event Timeline

lebedev.ri created this revision.Nov 30 2021, 1:11 AM
lebedev.ri requested review of this revision.Nov 30 2021, 1:11 AM

Ok, so i'm looking at those optsize.ll/tripcount.ll tests, and i'm not sure what exactly they are testing.
They don't specify the triple/attributes, so what costs do they expect to get?

I think what was happening is that they relied on the fact that opt doesn't default to -march=native
and for no machine there is a support for masked gather/scatter/load/store for those types (i8) in baseline ISA,
so useEmulatedMaskMemRefHack() hack would always happen and the instructions(and thus the vectorization cost)
would be bogusly high, and that would prevent them from vectorizing?

I'm not really sure what to do about them. Is my analysis wrong?
Can perhaps someone familiar with those tests comment?

tripcount.ll was added in D32451 and modified in D42946;
cc @twoh, @tejohnson, @davidxl, @mtrofin, @yamauchi

lebedev.ri updated this revision to Diff 391129.Dec 1 2021, 1:45 PM
lebedev.ri edited the summary of this revision. (Show Details)

Hide the problem by defaulting TargetTransformInfoImplBase::useEmulatedMaskMemRefHack() to true.
If no triple is specifically requested, that is the TTI that is used.
This highlights that those tests are somewhat of a lie,
and raises questions about the implementation of those features.

Ok, so i'm looking at those optsize.ll/tripcount.ll tests, and i'm not sure what exactly they are testing.
They don't specify the triple/attributes, so what costs do they expect to get?

I think what was happening is that they relied on the fact that opt doesn't default to -march=native
and for no machine there is a support for masked gather/scatter/load/store for those types (i8) in baseline ISA,
so useEmulatedMaskMemRefHack() hack would always happen and the instructions(and thus the vectorization cost)
would be bogusly high, and that would prevent them from vectorizing?

I'm not really sure what to do about them. Is my analysis wrong?
Can perhaps someone familiar with those tests comment?

tripcount.ll was added in D32451 and modified in D42946;
cc @twoh, @tejohnson, @davidxl, @mtrofin, @yamauchi

IIRC, re D42946, the fix in that patch was fundamentally about trip counts not being computed correctly, and then the regression tests are variations of the pre-existing cases. I think you are correct, they all rely on there being a large cost to vectorization that makes it profitable only in certain cases (and the way the trip count is computed changes that)

I think the test should intentionally specify a triple, where the cost is high, and maybe we need replace the instructions with some with a high cost?

(I'd wait for others to chime in though, it's been a while since D42946 and that was basically my brief encounter with vectorization)

Make tests that broke X86-specific.
Ping.

Hm, don't clobber the existing tests though.

As per @fhahn's "someone more familiar with X86 cost modeling should review the claim the cost model is fixed :)"

@RKSimon given all the costmodel fixes, i claim that nowadays the relevant parts of
the X86 costmodel (at least as of AVX2+) are correct, and the hack is no longer needed.
(D111460 already landed under the same pretense.)

Do you agree with my assessment?

As per @fhahn's "someone more familiar with X86 cost modeling should review the claim the cost model is fixed :)"

Sorry - I haven't been following this ticket much at all. Where did @fhahn says this and in what context? I can't see that here or D43208

As per @fhahn's "someone more familiar with X86 cost modeling should review the claim the cost model is fixed :)"

Sorry - I haven't been following this ticket much at all. Where did @fhahn says this and in what context? I can't see that here or D43208

Sorry, that was an IRC disscussion.