Page MenuHomePhabricator

[ARM] remove cost-kind predicate for cmp/sel costs

Authored by spatel on Nov 4 2020, 11:30 AM.



This is the cmp/sel sibling to D90692.
Again, the reasoning is: the throughput cost is number of instructions/uops, so size/blended costs are identical except in special cases (for example, fdiv or other known-expensive machine instructions or things like MVE that may require cracking into >1 uops).

We need to check for a valid (non-null) condition type parameter because SimplifyCFG may pass nullptr for that (and so we will crash multiple regression tests without that check). I'm not sure if passing nullptr makes sense, but other code in the cost model does appear to check if that param is set or not.

Diff Detail

Event Timeline

spatel created this revision.Nov 4 2020, 11:30 AM
spatel requested review of this revision.Nov 4 2020, 11:30 AM
samparker added inline comments.Nov 5 2020, 12:38 AM

I know this is a tiny change, but it's drawn my attention because it's so high the number is so high. Does this look right to you @dmgreen ? I would have thought we were able to break this up more efficiently with our native support, or is this because we'd have to copy the GPR registers into an FPRs to perform some final scalar maxs..?

spatel added inline comments.Nov 5 2020, 6:32 AM

Stepping through this, the cost is derived from the BasicTTIImpl calling back to the target for shuffle+cmp+sel:

// Assume the pairwise shuffles add a cost.
ShuffleCost +=
    (IsPairwise + 1) * thisT()->getShuffleCost(TTI::SK_ExtractSubvector,
                                               Ty, NumVecElts, SubTy);
MinMaxCost +=
    thisT()->getCmpSelInstrCost(CmpOpcode, SubTy, CondTy,
                                CmpInst::BAD_ICMP_PREDICATE, CostKind) +
    thisT()->getCmpSelInstrCost(Instruction::Select, SubTy, CondTy,
                                CmpInst::BAD_ICMP_PREDICATE, CostKind);

And this progresses for v16f32 as: 384 -> 480 -> 608 for the shuffles and 8 -> 12 -> 20 for the cmp/sel, so 608 + 20 = 628.
The shuffle cost seems to be expanded as a series of insert/extract based on the number of elements in the vector, so it's exploding.

dmgreen accepted this revision.Nov 5 2020, 8:44 AM

This looks fine to me, if Sam doesn't disagree. The costs look better to me, and the MVE costs should be our problem not yours :)


Our scalarization overhead is set very high, super high to protect us against regressions. I think it's actually O(n^2). It ends up meaning that wide vectors get very high costs, and was useful early on when we were getting some pretty ropy codegen. We have plans to set it to something more sensible but every time I try it still gives regressions somewhere. We only have 8 registers so sometimes these high costs are useful, especially in the vectorizer.

It sounds like we could do with some better SK_ExtractSubvector costs for extracting simple legal vector, although it may be better to just have custom reduction costs in this specific case, as I don't think that code will model how we actually lower reductions. I will try and look into that at some point, but for the moment they are very deliberately very wrong :)

This revision is now accepted and ready to land.Nov 5 2020, 8:44 AM

and the MVE costs should be our problem not yours :)

Indeed! Thanks both for the explanations.

This revision was automatically updated to reflect the committed changes.