Page MenuHomePhabricator

[LoopVectorize][CostModel] Update cost model for fmuladd intrinsic

Authored by RosieSumpter on Oct 12 2021, 4:25 AM.



This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.

Depends on D111555

Diff Detail

Event Timeline

RosieSumpter created this revision.Oct 12 2021, 4:25 AM
RosieSumpter requested review of this revision.Oct 12 2021, 4:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 4:25 AM
fhahn added a subscriber: fhahn.Oct 12 2021, 4:27 AM
fhahn added inline comments.

Could this be handled in TTI.getArithmeticReductionCost? What about other potential users of fmuladd reductions, like the SLP vectorizer?

RosieSumpter added inline comments.Oct 12 2021, 6:01 AM

Hi Florian, I did discuss this option with @david-arm, but it would mean changing the interface of getArithmeticReductionCost (e.g. by adding an optional Instruction* argument) to be able to determine if it's a call to the fmuladd intrinsic. David also made the point that the fmul isn't actually part of the reduction cost, so perhaps it doesn't make sense to ask for an fmuladd reduction cost? If you would prefer it to be there though I'm happy to make the change.

For the SLP vectorizer, it doesn't handle the fmuladd at the moment (I've added an assert for this in D111555 for safety)

Created getFMulAddReductionCost. This means the code which calculates the total cost of the fmuladd is moved out of the vectorizer, but avoids changing the interface for getArithmeticReductionCost.

david-arm added inline comments.Oct 15 2021, 4:13 AM
398 ↗(On Diff #379740)

Hi @RosieSumpter, do you know why these CHECK lines have changed? It doesn't seem like your patch should affect these tests because these loops are forced to use a certain VF anyway.

RosieSumpter added inline comments.Oct 15 2021, 5:47 AM
398 ↗(On Diff #379740)

Hi @david-arm, good point. The reason the test has changed is because of adding FMulAdd as an allowed recurrence kind to AArch64TTIImpl::isLegalToVectorizeReduction. I now see that it probably makes more sense for this particular change to be in D111555, so I'll do that now.

  • Moved the addition of FMulAdd as a recurrence kind in AArch64TTIImpl::isLegalToVectorizeReduction to D111555
  • There are no longer changes to the scalable-strict-fadd.ll test
paulwalker-arm added inline comments.
2177–2185 ↗(On Diff #380003)

Do we need a new TTI interface for this? To my mind the costing side of TTI exists to cost real entities and in this instance the IR has no discrete concept for an FMA reduction.

Instead what we have is LoopVectorize pretending such a concept exists that is knows ahead of time it will simulate with separate fmul and ordered_fadd_reduce operations. For this reason I think it would be better to explicitly cost that exact idiom within the same code that is using it.

So I guess I'm saying that if you move these three lines into LoopVectorize.cpp the patch can be much smaller and you're not creating a new interface for something that doesn't really exist.

RosieSumpter added inline comments.Oct 15 2021, 9:04 AM
2177–2185 ↗(On Diff #380003)

Hi Paul, thanks for having a look at this. Initially I did put the cost calculation into the vectorizer, but there was some discussion (also a comment from @fhahn) about whether it would make more sense in TTI.getArithmeticReductionCost and, to avoid changing the interface for getArithmeticReductionCost, @david-arm and @sdesmalen suggested adding the new getFMulAddReductionCost. I am happy to change it back to LoopVectorize.cpp if that seems like the better option.

paulwalker-arm added a comment.EditedOct 15 2021, 9:13 AM

Thanks for the extra info @RosieSumpter. As I see it @fhahn was just asking a question of which I personally would have answered no, presenting the same rational as my previous comment. I guess we'll have to wait for others to respond.

david-arm added inline comments.Oct 18 2021, 1:17 AM
2177–2185 ↗(On Diff #380003)

For what it's worth, I personally think it makes a bit more sense to calculate the costs separately in the vectoriser because in my mind at least the fmul isn't part of the reduction, since we are always going to widen it into a normal vector operation. I suggested adding a new interface as a compromise, so we can avoid over-complicating (and avoid making this patch significantly larger) the existing getArithmeticReductionCost method. I wouldn't block the patch over this though!

  • Removed the new getFMulAddReductionCost TTI interface
  • Moved the cost calculation back to the vectorizer

Thanks for the comments @paulwalker-arm and @david-arm. I've moved the fmul cost calculation back to the vectorizer since this seems like the more favourable option.

paulwalker-arm added inline comments.Oct 18 2021, 9:23 AM

Perhaps I've misunderstood something? because this code looks more complicated (or perhaps just more verbose) than the previous patch. I guess I'm struggling why different versions of getArithmeticInstrCost are being called between the two. I just assumed you'd add something like:

   InstructionCost BaseCost = TTI.getArithmeticReductionCost(                        
       RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);         
+  // For llvm.fmuladd based reductions we must include the cost of the normal       
+  // vector fmul that will occur prior to the fadd reduction.                       
+  if (RdxDesc.getRecurrenceKind() == RecurKind::FMulAdd)                            
+    BaseCost += TTI.getArithmeticInstrCost(Instruction::FMul, VectorTy, CostKind);  

   // If we're using ordered reductions then we can just return the base cost
   // here, since getArithmeticReductionCost calculates the full ordered
   // reduction cost when FP reassociation is not allowed.
   if (useOrderedReductions(RdxDesc))                                                
     return BaseCost;
RosieSumpter marked an inline comment as done.
RosieSumpter added a reviewer: paulwalker-arm.
  • Make use of default parameters when calling getArithmeticInstrCost
  • Move addition of FMul cost

This cost calculation seems correct to me. For in-loop reductions it calculates the cost as a single fmul + fadd *reduction*. If this is not an in-loop reduction, or if fmuladd is not used in a reduction, it will follow the regular code-path to get the cost of this intrinsic (either as an FMA or separate fmul+fadd).

So, LGTM! (Please address the minor nits before you commit)


nit: redundant comment (i.e. the code says as much)


nit: unnecessary curly braces.


nit: redundant comment.

sdesmalen accepted this revision.Oct 29 2021, 7:34 AM
This revision is now accepted and ready to land.Oct 29 2021, 7:34 AM
paulwalker-arm accepted this revision.Nov 1 2021, 6:05 AM
This revision was landed with ongoing or failed builds.Wed, Nov 24, 1:00 AM
This revision was automatically updated to reflect the committed changes.