This is an archive of the discontinued LLVM Phabricator instance.

[LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`
AbandonedPublic

Authored by lebedev.ri on Nov 30 2021, 1:11 AM.

Details

Summary

D43208 extracted useEmulatedMaskMemRefHack() from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:

// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.

While i don't really understand about what specifically is completely broken
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.

But since this was added for X86 specifically, let's just instead completely remove this hack.

Diff Detail

Event Timeline

lebedev.ri created this revision.Nov 30 2021, 1:11 AM
lebedev.ri requested review of this revision.Nov 30 2021, 1:11 AM

Ok, so i'm looking at those optsize.ll/tripcount.ll tests, and i'm not sure what exactly they are testing.
They don't specify the triple/attributes, so what costs do they expect to get?

I think what was happening is that they relied on the fact that opt doesn't default to -march=native
and for no machine there is a support for masked gather/scatter/load/store for those types (i8) in baseline ISA,
so useEmulatedMaskMemRefHack() hack would always happen and the instructions(and thus the vectorization cost)
would be bogusly high, and that would prevent them from vectorizing?

I'm not really sure what to do about them. Is my analysis wrong?
Can perhaps someone familiar with those tests comment?

tripcount.ll was added in D32451 and modified in D42946;
cc @twoh, @tejohnson, @davidxl, @mtrofin, @yamauchi

lebedev.ri updated this revision to Diff 391129.Dec 1 2021, 1:45 PM
lebedev.ri edited the summary of this revision. (Show Details)

Hide the problem by defaulting TargetTransformInfoImplBase::useEmulatedMaskMemRefHack() to true.
If no triple is specifically requested, that is the TTI that is used.
This highlights that those tests are somewhat of a lie,
and raises questions about the implementation of those features.

Ok, so i'm looking at those optsize.ll/tripcount.ll tests, and i'm not sure what exactly they are testing.
They don't specify the triple/attributes, so what costs do they expect to get?

I think what was happening is that they relied on the fact that opt doesn't default to -march=native
and for no machine there is a support for masked gather/scatter/load/store for those types (i8) in baseline ISA,
so useEmulatedMaskMemRefHack() hack would always happen and the instructions(and thus the vectorization cost)
would be bogusly high, and that would prevent them from vectorizing?

I'm not really sure what to do about them. Is my analysis wrong?
Can perhaps someone familiar with those tests comment?

tripcount.ll was added in D32451 and modified in D42946;
cc @twoh, @tejohnson, @davidxl, @mtrofin, @yamauchi

IIRC, re D42946, the fix in that patch was fundamentally about trip counts not being computed correctly, and then the regression tests are variations of the pre-existing cases. I think you are correct, they all rely on there being a large cost to vectorization that makes it profitable only in certain cases (and the way the trip count is computed changes that)

I think the test should intentionally specify a triple, where the cost is high, and maybe we need replace the instructions with some with a high cost?

(I'd wait for others to chime in though, it's been a while since D42946 and that was basically my brief encounter with vectorization)

Make tests that broke X86-specific.
Ping.

Hm, don't clobber the existing tests though.

As per @fhahn's "someone more familiar with X86 cost modeling should review the claim the cost model is fixed :)"

@RKSimon given all the costmodel fixes, i claim that nowadays the relevant parts of
the X86 costmodel (at least as of AVX2+) are correct, and the hack is no longer needed.
(D111460 already landed under the same pretense.)

Do you agree with my assessment?

As per @fhahn's "someone more familiar with X86 cost modeling should review the claim the cost model is fixed :)"

Sorry - I haven't been following this ticket much at all. Where did @fhahn says this and in what context? I can't see that here or D43208

As per @fhahn's "someone more familiar with X86 cost modeling should review the claim the cost model is fixed :)"

Sorry - I haven't been following this ticket much at all. Where did @fhahn says this and in what context? I can't see that here or D43208

Sorry, that was an IRC disscussion.

lebedev.ri updated this revision to Diff 405177.Feb 2 2022, 2:03 AM
lebedev.ri retitled this revision from [LV][X86] Sink `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()` further into TTI, disable for X86/AVX2+ to [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`.
lebedev.ri edited the summary of this revision. (Show Details)

Post-branch ping.
I believe we can now proceed? :)

RKSimon accepted this revision.Feb 7 2022, 4:53 AM

LGTM - based on the offline discussion with the various stakeholders this is the way to go, but please be ready to assist with regression cases that do come up

This revision is now accepted and ready to land.Feb 7 2022, 4:53 AM

LGTM - based on the offline discussion with the various stakeholders this is the way to go, but please be ready to assist with regression cases that do come up

Thanks! I'm going to land this now, given that we just branched
we have optimal headroom for dealing with the fallout here.

This revision was landed with ongoing or failed builds.Feb 7 2022, 5:09 AM
This revision was automatically updated to reflect the committed changes.

Even though the initial intention of the "hack" might have been to prevent predication on some X86 subtargets, it was added in a way that affected all targets, so it could expose cost model issues on various targets if suddenly removed. I wasn't aware of this and just noticed it....has any performance runs been done on other targets (eg. Power)?

I'm interested - what kind of performance measuring did you do for this patch, and what were the numbers it gave?

Hi,

Even though the initial intention of the "hack" might have been to prevent predication on some X86 subtargets, it was added in a way that affected all targets, so it could expose cost model issues on various targets if suddenly removed. I wasn't aware of this and just noticed it....has any performance runs been done on other targets (eg. Power)?

I'm interested - what kind of performance measuring did you do for this patch, and what were the numbers it gave?

This is a correctness fix. It will not be at all surprising to learn that
some other architecture has started to unintentionally rely on this
erroneous behavior instead of implementing a correct and precise cost model.

If you have identified one of such places, please feel free to file a bug,
and optionally ask for a temporary revert until you've had a chance to fix said bug.
Though, it would be best to just fix the uncovered issues, it's not like the branch is soon.

OK Cool.

This is a correctness fix. It will not be at all surprising to learn that
some other architecture has started to unintentionally rely on this
erroneous behavior instead of implementing a correct and precise cost model.

Sure. I just think that one of the architectures relying on this, at least in places, is X86. I was honestly surprised this didn't hurt performance in more places, I thought it was still pretty important. Hence me asking what kind of performance you had run. But it looks from these very tests that it's still needed for some things and I got reports of a few more. The X86 cases did seem to be hidden more by other costs, like the cost of a vector mul being 6 :)

Could you point me to the bugreport that you have opened before temporarily reverting this? :)

OK Cool.

This is a correctness fix. It will not be at all surprising to learn that
some other architecture has started to unintentionally rely on this
erroneous behavior instead of implementing a correct and precise cost model.

Sure. I just think that one of the architectures relying on this, at least in places, is X86.

Could you please be more specific?

I was honestly surprised this didn't hurt performance in more places, I thought it was still pretty important. Hence me asking what kind of performance you had run.

But it looks from these very tests that it's still needed for some things

Could you please be more specific? :)

and I got reports of a few more. The X86 cases did seem to be hidden more by other costs, like the cost of a vector mul being 6 :)

lebedev.ri reopened this revision.Feb 9 2022, 12:20 PM
This revision is now accepted and ready to land.Feb 9 2022, 12:20 PM

Yeah sure, will do. But it will have to be tomorrow morning. It's been a very long day :(

I am still interested in what numbers you had for this change though?

bjope added a subscriber: bjope.Feb 10 2022, 12:30 AM

OK. I think the problems fit into 4 or maybe 5 different categories.

First up - there were some very large downstream regressions we had which certainly fit into the target-dependant bucket. We were allowing tail folding but no masked gather scatter, which was causing some very bad codegen from scalarized loads/stores. That's was simple enough to fix on Tuesday though by disallowing the tail folding in those cases. I would still hope that the cost model would handle it, but at least those problems are no more.

The next two most obvious from the tests here are with optsize and with low trip counts. The optsize tests in LoopVectorize/optsize.ll look both much larger, and much slower to me with all that branching. Not something that you ideally would want to do at Os. The tests in LoopVectorize/tripcount.ll also look worse to me, and is similar to one of the reports I got about this patch. A loop with a very low trip count is often not worth vectorizing, especially so if it is going to produce very inefficient predicated branching code.

The code in LoopVectorize/AArch64/tail-fold-uniform-memops.ll also looks worse to me. It's difficult to see why a target with a gather/scatter should choose to use predicated scalarized load/stores instead. But I haven't looked into the details. Perhaps that one fits into the target-indepentant cost model going wrong, but whatever it it sounds like something that should be fixed before we remove the hack.

Of the other cases I have, one may be similar to the low-trip-count cases. I can't really share the original, but it had a lot of other intrinsic code in it for producing matrix multiplies. There is hopefully a cut-down version here: https://godbolt.org/z/c311Y8j39, where its difficult to see that the vectorized version will be better with all those broadcasts/blends/branches, even if it is doing more per iteration. Some of these required specific targets for the problem to come up - they could easily be hidden by other costs, like the cost of a VF2 mul being 6 under base x86. That one is under skylake, the original was AArch64 and the performance was apparently upto 160% worse, even with all the other code in the original.

The last case is more straight forward with what is getting predicated, but I'm having trouble at the moment seeing why it isn't a problem for any target. The code has some predicated loads, like this: https://godbolt.org/z/E7PdYrn4T. The vectorization seems a lot worse with so many difficult to predict branches.

In that case the vplan it is executing looks like this:

VPlan 'Initial VPlan for VF={2,4},UF>=1' {
Live-in vp<%0> = vector-trip-count

<x1> vector loop: {
  for.body:
    EMIT vp<%1> = CANONICAL-INDUCTION
    WIDEN-INDUCTION %indvars.iv = phi 0, %indvars.iv.next
    WIDEN-REDUCTION-PHI ir<%nz.055> = phi ir<0>, ir<%or>
    CLONE ir<%arrayidx> = getelementptr ir<%dct>, ir<%indvars.iv>
    WIDEN ir<%0> = load ir<%arrayidx>
    WIDEN ir<%conv> = sext ir<%0>
    WIDEN ir<%cmp1> = icmp ir<%0>, ir<0>
    CLONE ir<%arrayidx4> = getelementptr ir<%bias>, ir<%indvars.iv>
    WIDEN ir<%1> = load ir<%arrayidx4>
    WIDEN ir<%conv5> = zext ir<%1>
  Successor(s): if.else

  if.else:
    WIDEN ir<%sub> = sub ir<%conv5>, ir<%conv>
    EMIT vp<%12> = not ir<%cmp1>
  Successor(s): pred.load

  <xVFxUF> pred.load: {
    pred.load.entry:
      BRANCH-ON-MASK vp<%12>
    Successor(s): pred.load.if, pred.load.continue
    CondBit: vp<%12> (if.else)

    pred.load.if:
      REPLICATE ir<%arrayidx22> = getelementptr ir<%mf>, ir<%indvars.iv>
      REPLICATE ir<%4> = load ir<%arrayidx22> (S->V)
    Successor(s): pred.load.continue

    pred.load.continue:
      PHI-PREDICATED-INSTRUCTION vp<%15> = ir<%4>
    No successors
  }
  Successor(s): if.else.0

  if.else.0:
    WIDEN ir<%conv23> = zext vp<%15>
    WIDEN ir<%mul24> = mul ir<%sub>, ir<%conv23>
    WIDEN ir<%5> = lshr ir<%mul24>, ir<16>
    WIDEN ir<%6> = trunc ir<%5>
    WIDEN ir<%conv27> = sub ir<0>, ir<%6>
  Successor(s): if.then

  if.then:
    WIDEN ir<%add> = add ir<%conv5>, ir<%conv>
  Successor(s): pred.load

  <xVFxUF> pred.load: {
    pred.load.entry:
      BRANCH-ON-MASK ir<%cmp1>
    Successor(s): pred.load.if, pred.load.continue
    CondBit: ir<%cmp1>

    pred.load.if:
      REPLICATE ir<%arrayidx10> = getelementptr ir<%mf>, ir<%indvars.iv>
      REPLICATE ir<%2> = load ir<%arrayidx10> (S->V)
    Successor(s): pred.load.continue

    pred.load.continue:
      PHI-PREDICATED-INSTRUCTION vp<%24> = ir<%2>
    No successors
  }
  Successor(s): if.then.0

  if.then.0:
    WIDEN ir<%conv11> = zext vp<%24>
    WIDEN ir<%mul> = mul ir<%add>, ir<%conv11>
    WIDEN ir<%3> = lshr ir<%mul>, ir<16>
    WIDEN ir<%conv12> = trunc ir<%3>
  Successor(s): if.end

  if.end:
    BLEND %storemerge = ir<%conv27>/vp<%12> ir<%conv12>/ir<%cmp1>
    WIDEN store ir<%arrayidx>, ir<%storemerge>
    WIDEN ir<%conv32> = sext ir<%storemerge>
    WIDEN ir<%or> = or ir<%nz.055>, ir<%conv32>
    EMIT vp<%33> = VF * UF +(nuw)  vp<%1>
    EMIT branch-on-count  vp<%33> vp<%0>
  No successors
}

The costs look like:

LV: Found an estimated cost of 0 for VF 4 For instruction:   %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %nz.055 = phi i32 [ 0, %entry ], [ %or, %if.end ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx = getelementptr inbounds i16, i16* %dct, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 4 For instruction:   %0 = load i16, i16* %arrayidx, align 2, !tbaa !3
LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv = sext i16 %0 to i32
LV: Found an estimated cost of 1 for VF 4 For instruction:   %cmp1 = icmp sgt i16 %0, 0
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx4 = getelementptr inbounds i16, i16* %bias, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 4 For instruction:   %1 = load i16, i16* %arrayidx4, align 2, !tbaa !3
LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv5 = zext i16 %1 to i32
LV: Found an estimated cost of 4 for VF 4 For instruction:   br i1 %cmp1, label %if.then, label %if.else
LV: Found an estimated cost of 1 for VF 4 For instruction:   %sub = sub nsw i32 %conv5, %conv
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx22 = getelementptr inbounds i16, i16* %mf, i64 %indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   %4 = load i16, i16* %arrayidx22, align 2, !tbaa !3
LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv23 = zext i16 %4 to i32
LV: Found an estimated cost of 2 for VF 4 For instruction:   %mul24 = mul nsw i32 %sub, %conv23
LV: Found an estimated cost of 1 for VF 4 For instruction:   %5 = lshr i32 %mul24, 16
LV: Found an estimated cost of 2 for VF 4 For instruction:   %6 = trunc i32 %5 to i16
LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv27 = sub i16 0, %6
LV: Found an estimated cost of 0 for VF 4 For instruction:   br label %if.end
LV: Found an estimated cost of 1 for VF 4 For instruction:   %add = add nsw i32 %conv5, %conv
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx10 = getelementptr inbounds i16, i16* %mf, i64 %indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   %2 = load i16, i16* %arrayidx10, align 2, !tbaa !3
LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv11 = zext i16 %2 to i32
LV: Found an estimated cost of 2 for VF 4 For instruction:   %mul = mul nsw i32 %add, %conv11
LV: Found an estimated cost of 1 for VF 4 For instruction:   %3 = lshr i32 %mul, 16
LV: Found an estimated cost of 2 for VF 4 For instruction:   %conv12 = trunc i32 %3 to i16
LV: Found an estimated cost of 0 for VF 4 For instruction:   br label %if.end
LV: Found an estimated cost of 1 for VF 4 For instruction:   %storemerge = phi i16 [ %conv27, %if.else ], [ %conv12, %if.then ]
LV: Found an estimated cost of 1 for VF 4 For instruction:   store i16 %storemerge, i16* %arrayidx, align 2, !tbaa !3
LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv32 = sext i16 %storemerge to i32
LV: Found an estimated cost of 1 for VF 4 For instruction:   %or = or i32 %nz.055, %conv32
LV: Found an estimated cost of 1 for VF 4 For instruction:   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond.not = icmp eq i64 %indvars.iv.next, 16
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !7
LV: Vector loop of width 4 costs: 9.
LV: Selecting VF: 4.

So the cost of all that extra branching is accounted for in either the br i1 %cmp1 or the costs of the loads? The cost of any branch is usually 0 in llvm, but adding this many difficult-to-predict branches into an inner loop is going to cause problems, no matter how good the core is at predicting them. The cost of the predicated scalarized loads comes from getMemInstScalarizationCost as far as I understand.

That is what I thought this "useEmulatedMaskMemRefHack" was protecting against - the fact that the costs via getMemInstScalarizationCost for predicated loads/stores wasn't really good enough. And that the vplan for predication can end up quite differently from the original code, but at current the costs are all just added up from the original instructions. I'm surprised this doesn't come up in more cases, to be honest. The cases we had where this was making things worse were not as wide-spread as I would have imagined they would be. But the -Os vectorization and low trip counts are pretty significant regressions.

The real best long term fix for this (that doesn't introduce other hacks) might be to properly implement a vplan-based cost-model in the vectorizer. So that it is really adding up the costs of the things that will be produced by the vectorizer, not trying to guess them from the original instructions. I suspect we may need something to add a cost for all the predicated branching too, although I'm not sure what exactly (what is the cost of a branch? :) ) There will be a point where the benefit of vectorization will make some branching profitable, so it would be great to remove the Hack if we can do so.

lebedev.ri abandoned this revision.Oct 18 2022, 5:49 PM