Download Raw Diff

Details

Reviewers

david-arm
kmclaughlin
sdesmalen
dmgreen
spatel
Ayal
fhahn
CarolineConcatto
paulwalker-arm

Commits

rGdf32a39dd0f6: [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic

Summary

This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.

Depends on D111555

Diff Detail

Event Timeline

RosieSumpter created this revision.Oct 12 2021, 4:25 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptOct 12 2021, 4:25 AM

RosieSumpter requested review of this revision.Oct 12 2021, 4:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 4:25 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B128324: Diff 378958.Oct 12 2021, 4:26 AM

fhahn added a subscriber: fhahn.Oct 12 2021, 4:27 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7276	Could this be handled in `TTI.getArithmeticReductionCost`? What about other potential users of fmuladd reductions, like the SLP vectorizer?

RosieSumpter added inline comments.Oct 12 2021, 6:01 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7276	Hi Florian, I did discuss this option with @david-arm, but it would mean changing the interface of `getArithmeticReductionCost` (e.g. by adding an optional `Instruction*` argument) to be able to determine if it's a call to the fmuladd intrinsic. David also made the point that the fmul isn't actually part of the reduction cost, so perhaps it doesn't make sense to ask for an fmuladd reduction cost? If you would prefer it to be there though I'm happy to make the change. For the SLP vectorizer, it doesn't handle the fmuladd at the moment (I've added an assert for this in D111555 for safety)

Created getFMulAddReductionCost. This means the code which calculates the total cost of the fmuladd is moved out of the vectorizer, but avoids changing the interface for getArithmeticReductionCost.

Harbormaster completed remote builds in B128883: Diff 379740.Oct 14 2021, 9:12 AM

RosieSumpter added a reviewer: fhahn.Oct 14 2021, 9:13 AM

RosieSumpter added a reviewer: CarolineConcatto.Oct 15 2021, 2:10 AM

david-arm added inline comments.Oct 15 2021, 4:13 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
398 ↗	(On Diff #379740)	Hi @RosieSumpter, do you know why these CHECK lines have changed? It doesn't seem like your patch should affect these tests because these loops are forced to use a certain VF anyway.

RosieSumpter added inline comments.Oct 15 2021, 5:47 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
398 ↗	(On Diff #379740)	Hi @david-arm, good point. The reason the test has changed is because of adding `FMulAdd` as an allowed recurrence kind to `AArch64TTIImpl::isLegalToVectorizeReduction`. I now see that it probably makes more sense for this particular change to be in D111555, so I'll do that now.

Moved the addition of FMulAdd as a recurrence kind in AArch64TTIImpl::isLegalToVectorizeReduction to D111555
There are no longer changes to the scalable-strict-fadd.ll test

Harbormaster completed remote builds in B129066: Diff 380003.Oct 15 2021, 7:51 AM

paulwalker-arm added a subscriber: paulwalker-arm.Oct 15 2021, 8:30 AM

paulwalker-arm added inline comments.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2177–2185	Do we need a new TTI interface for this? To my mind the costing side of TTI exists to cost real entities and in this instance the IR has no discrete concept for an FMA reduction. Instead what we have is LoopVectorize pretending such a concept exists that is knows ahead of time it will simulate with separate fmul and ordered_fadd_reduce operations. For this reason I think it would be better to explicitly cost that exact idiom within the same code that is using it. So I guess I'm saying that if you move these three lines into LoopVectorize.cpp the patch can be much smaller and you're not creating a new interface for something that doesn't really exist.

RosieSumpter added inline comments.Oct 15 2021, 9:04 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2177–2185	Hi Paul, thanks for having a look at this. Initially I did put the cost calculation into the vectorizer, but there was some discussion (also a comment from @fhahn) about whether it would make more sense in `TTI.getArithmeticReductionCost` and, to avoid changing the interface for `getArithmeticReductionCost`, @david-arm and @sdesmalen suggested adding the new `getFMulAddReductionCost`. I am happy to change it back to LoopVectorize.cpp if that seems like the better option.

Thanks for the extra info @RosieSumpter. As I see it @fhahn was just asking a question of which I personally would have answered no, presenting the same rational as my previous comment. I guess we'll have to wait for others to respond.

paulwalker-arm mentioned this in D111555: [LoopVectorize] Add vector reduction support for fmuladd intrinsic.Oct 15 2021, 10:19 AM

david-arm added inline comments.Oct 18 2021, 1:17 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2177–2185	For what it's worth, I personally think it makes a bit more sense to calculate the costs separately in the vectoriser because in my mind at least the fmul isn't part of the reduction, since we are always going to widen it into a normal vector operation. I suggested adding a new interface as a compromise, so we can avoid over-complicating (and avoid making this patch significantly larger) the existing getArithmeticReductionCost method. I wouldn't block the patch over this though!

Removed the new getFMulAddReductionCost TTI interface
Moved the cost calculation back to the vectorizer

Thanks for the comments @paulwalker-arm and @david-arm. I've moved the fmul cost calculation back to the vectorizer since this seems like the more favourable option.

Harbormaster completed remote builds in B129349: Diff 380404.Oct 18 2021, 8:23 AM

paulwalker-arm added inline comments.Oct 18 2021, 9:23 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

7267–7272

Perhaps I've misunderstood something? because this code looks more complicated (or perhaps just more verbose) than the previous patch. I guess I'm struggling why different versions of getArithmeticInstrCost are being called between the two. I just assumed you'd add something like:

   InstructionCost BaseCost = TTI.getArithmeticReductionCost(                        
       RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);         
                                                                                    
+  // For llvm.fmuladd based reductions we must include the cost of the normal       
+  // vector fmul that will occur prior to the fadd reduction.                       
+  if (RdxDesc.getRecurrenceKind() == RecurKind::FMulAdd)                            
+    BaseCost += TTI.getArithmeticInstrCost(Instruction::FMul, VectorTy, CostKind);  

   // If we're using ordered reductions then we can just return the base cost
   // here, since getArithmeticReductionCost calculates the full ordered
   // reduction cost when FP reassociation is not allowed.
   if (useOrderedReductions(RdxDesc))                                                
     return BaseCost;

Make use of default parameters when calling getArithmeticInstrCost
Move addition of FMul cost

Harbormaster completed remote builds in B130725: Diff 382346.Oct 26 2021, 9:11 AM

RosieSumpter edited the summary of this revision. (Show Details)Oct 28 2021, 1:35 AM

RosieSumpter added a parent revision: D111555: [LoopVectorize] Add vector reduction support for fmuladd intrinsic.

This cost calculation seems correct to me. For in-loop reductions it calculates the cost as a single fmul + fadd *reduction*. If this is not an in-loop reduction, or if fmuladd is not used in a reduction, it will follow the regular code-path to get the cost of this intrinsic (either as an FMA or separate fmul+fadd).

So, LGTM! (Please address the minor nits before you commit)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7938	nit: redundant comment (i.e. the code says as much)
7939	nit: unnecessary curly braces.
7940	nit: redundant comment.

sdesmalen accepted this revision.Oct 29 2021, 7:34 AM

This revision is now accepted and ready to land.Oct 29 2021, 7:34 AM

paulwalker-arm accepted this revision.Nov 1 2021, 6:05 AM

This revision was landed with ongoing or failed builds.Nov 24 2021, 1:00 AM

Closed by commit rGdf32a39dd0f6: [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic (authored by RosieSumpter). · Explain Why

This revision was automatically updated to reflect the committed changes.

RosieSumpter added a commit: rGdf32a39dd0f6: [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic.

Diff 380003

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 1,187 Lines • ▼ Show 20 Lines
InstructionCost getArithmeticReductionCost(		InstructionCost getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty, Optional<FastMathFlags> FMF,		unsigned Opcode, VectorType *Ty, Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

InstructionCost getMinMaxReductionCost(		InstructionCost getMinMaxReductionCost(
VectorType Ty, VectorType CondTy, bool IsUnsigned,		VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

		/// Calculate the cost of a call to the llvm.fmuladd intrinsic. This is
		/// modeled as the cost of a normal fmul instruction plus the cost of an fadd
		/// reduction.
		InstructionCost getFMulAddReductionCost(
		VectorType *Ty, Optional<FastMathFlags> FMF,
		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

/// Calculate the cost of an extended reduction pattern, similar to		/// Calculate the cost of an extended reduction pattern, similar to
/// getArithmeticReductionCost of an Add reduction with an extension and		/// getArithmeticReductionCost of an Add reduction with an extension and
/// optional multiply. This is the cost of as:		/// optional multiply. This is the cost of as:
/// ResTy vecreduce.add(ext(Ty A)), or if IsMLA flag is set then:		/// ResTy vecreduce.add(ext(Ty A)), or if IsMLA flag is set then:
/// ResTy vecreduce.add(mul(ext(Ty A), ext(Ty B)). The reduction happens		/// ResTy vecreduce.add(mul(ext(Ty A), ext(Ty B)). The reduction happens
/// on a VectorType with ResTy elements and Ty lanes.		/// on a VectorType with ResTy elements and Ty lanes.
InstructionCost getExtendedAddReductionCost(		InstructionCost getExtendedAddReductionCost(
bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
▲ Show 20 Lines • Show All 453 Lines • ▼ Show 20 Lines	virtual InstructionCost getInterleavedMemoryOpCost(
bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;		bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;
virtual InstructionCost		virtual InstructionCost
getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
Optional<FastMathFlags> FMF,		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual InstructionCost		virtual InstructionCost
getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,		getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
		virtual InstructionCost
		getFMulAddReductionCost(VectorType *Ty, Optional<FastMathFlags> FMF,
		TTI::TargetCostKind CostKind) = 0;
virtual InstructionCost getExtendedAddReductionCost(		virtual InstructionCost getExtendedAddReductionCost(
bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) = 0;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) = 0;
virtual InstructionCost		virtual InstructionCost
getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual InstructionCost getCallInstrCost(Function F, Type RetTy,		virtual InstructionCost getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
▲ Show 20 Lines • Show All 499 Lines • ▼ Show 20 Lines	getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getArithmeticReductionCost(Opcode, Ty, FMF, CostKind);		return Impl.getArithmeticReductionCost(Opcode, Ty, FMF, CostKind);
}		}
InstructionCost		InstructionCost
getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,		getMinMaxReductionCost(VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getMinMaxReductionCost(Ty, CondTy, IsUnsigned, CostKind);		return Impl.getMinMaxReductionCost(Ty, CondTy, IsUnsigned, CostKind);
}		}
		InstructionCost
		getFMulAddReductionCost(VectorType *Ty, Optional<FastMathFlags> FMF,
		TTI::TargetCostKind CostKind) override {
		return Impl.getFMulAddReductionCost(Ty, FMF, CostKind);
		}
InstructionCost getExtendedAddReductionCost(		InstructionCost getExtendedAddReductionCost(
bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) override {		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) override {
return Impl.getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, Ty,		return Impl.getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, Ty,
CostKind);		CostKind);
}		}
InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
▲ Show 20 Lines • Show All 228 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 631 Lines • ▼ Show 20 Lines	InstructionCost getArithmeticReductionCost(unsigned, VectorType *,
return 1;		return 1;
}		}

InstructionCost getMinMaxReductionCost(VectorType , VectorType , bool,		InstructionCost getMinMaxReductionCost(VectorType , VectorType , bool,
TTI::TargetCostKind) const {		TTI::TargetCostKind) const {
return 1;		return 1;
}		}

		InstructionCost getFMulAddReductionCost(VectorType *, Optional<FastMathFlags>,
		TTI::TargetCostKind) const {
		return 1;
		}

InstructionCost		InstructionCost
getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned, Type *ResTy,		getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned, Type *ResTy,
VectorType *Ty,		VectorType *Ty,
TTI::TargetCostKind CostKind) const {		TTI::TargetCostKind CostKind) const {
return 1;		return 1;
}		}

InstructionCost getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {		InstructionCost getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {
▲ Show 20 Lines • Show All 538 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 2,168 Lines • ▼ Show 20 Lines	MinMaxCost +=
thisT()->getCmpSelInstrCost(Instruction::Select, Ty, CondTy,		thisT()->getCmpSelInstrCost(Instruction::Select, Ty, CondTy,
CmpInst::BAD_ICMP_PREDICATE, CostKind));		CmpInst::BAD_ICMP_PREDICATE, CostKind));
// The last min/max should be in vector registers and we counted it above.		// The last min/max should be in vector registers and we counted it above.
// So just need a single extractelement.		// So just need a single extractelement.
return ShuffleCost + MinMaxCost +		return ShuffleCost + MinMaxCost +
thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0);		thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0);
}		}

		InstructionCost getFMulAddReductionCost(VectorType *Ty,
		Optional<FastMathFlags> FMF,
		TTI::TargetCostKind CostKind) {
		InstructionCost FAddReductionCost = thisT()->getArithmeticReductionCost(
		Instruction::FAdd, Ty, FMF, CostKind);
		InstructionCost FMulCost =
		thisT()->getArithmeticInstrCost(Instruction::FMul, Ty, CostKind);
		return FMulCost + FAddReductionCost;
		}
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Do we need a new TTI interface for this? To my mind the costing side of TTI exists to cost real entities and in this instance the IR has no discrete concept for an FMA reduction. Instead what we have is LoopVectorize pretending such a concept exists that is knows ahead of time it will simulate with separate fmul and ordered_fadd_reduce operations. For this reason I think it would be better to explicitly cost that exact idiom within the same code that is using it. So I guess I'm saying that if you move these three lines into LoopVectorize.cpp the patch can be much smaller and you're not creating a new interface for something that doesn't really exist. paulwalker-arm: Do we need a new TTI interface for this? To my mind the costing side of TTI exists to cost…
		RosieSumpterAuthorUnsubmitted Done Reply Inline Actions Hi Paul, thanks for having a look at this. Initially I did put the cost calculation into the vectorizer, but there was some discussion (also a comment from @fhahn) about whether it would make more sense in `TTI.getArithmeticReductionCost` and, to avoid changing the interface for `getArithmeticReductionCost`, @david-arm and @sdesmalen suggested adding the new `getFMulAddReductionCost`. I am happy to change it back to LoopVectorize.cpp if that seems like the better option. RosieSumpter: Hi Paul, thanks for having a look at this. Initially I did put the cost calculation into the…
		david-armUnsubmitted Not Done Reply Inline Actions For what it's worth, I personally think it makes a bit more sense to calculate the costs separately in the vectoriser because in my mind at least the fmul isn't part of the reduction, since we are always going to widen it into a normal vector operation. I suggested adding a new interface as a compromise, so we can avoid over-complicating (and avoid making this patch significantly larger) the existing getArithmeticReductionCost method. I wouldn't block the patch over this though! david-arm: For what it's worth, I personally think it makes a bit more sense to calculate the costs…

InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,		InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
Type ResTy, VectorType Ty,		Type ResTy, VectorType Ty,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
// Without any native support, this is equivalent to the cost of		// Without any native support, this is equivalent to the cost of
// vecreduce.add(ext) or if IsMLA vecreduce.add(mul(ext, ext))		// vecreduce.add(ext) or if IsMLA vecreduce.add(mul(ext, ext))
VectorType *ExtTy = VectorType::get(ResTy, Ty);		VectorType *ExtTy = VectorType::get(ResTy, Ty);
InstructionCost RedCost = thisT()->getArithmeticReductionCost(		InstructionCost RedCost = thisT()->getArithmeticReductionCost(
Instruction::Add, ExtTy, None, CostKind);		Instruction::Add, ExtTy, None, CostKind);
Show All 38 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 911 Lines • ▼ Show 20 Lines	InstructionCost TargetTransformInfo::getMinMaxReductionCost(
VectorType Ty, VectorType CondTy, bool IsUnsigned,		VectorType Ty, VectorType CondTy, bool IsUnsigned,
TTI::TargetCostKind CostKind) const {		TTI::TargetCostKind CostKind) const {
InstructionCost Cost =		InstructionCost Cost =
TTIImpl->getMinMaxReductionCost(Ty, CondTy, IsUnsigned, CostKind);		TTIImpl->getMinMaxReductionCost(Ty, CondTy, IsUnsigned, CostKind);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

		InstructionCost TargetTransformInfo::getFMulAddReductionCost(
		VectorType *Ty, Optional<FastMathFlags> FMF,
		TTI::TargetCostKind CostKind) const {
		InstructionCost Cost = TTIImpl->getFMulAddReductionCost(Ty, FMF, CostKind);
		assert(Cost >= 0 && "TTI should not produce negative costs!");
		return Cost;
		}

InstructionCost TargetTransformInfo::getExtendedAddReductionCost(		InstructionCost TargetTransformInfo::getExtendedAddReductionCost(
bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
TTI::TargetCostKind CostKind) const {		TTI::TargetCostKind CostKind) const {
return TTIImpl->getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, Ty,		return TTIImpl->getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, Ty,
CostKind);		CostKind);
}		}

InstructionCost		InstructionCost
▲ Show 20 Lines • Show All 246 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,249 Lines • ▼ Show 20 Lines	Optional<InstructionCost> LoopVectorizationCostModel::getReductionPatternCost(
Instruction *LastChain = InLoopReductionImmediateChains[RetI];		Instruction *LastChain = InLoopReductionImmediateChains[RetI];
Instruction *ReductionPhi = LastChain;		Instruction *ReductionPhi = LastChain;
while (!isa<PHINode>(ReductionPhi))		while (!isa<PHINode>(ReductionPhi))
ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];		ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];

const RecurrenceDescriptor &RdxDesc =		const RecurrenceDescriptor &RdxDesc =
Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];		Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];

InstructionCost BaseCost = TTI.getArithmeticReductionCost(		InstructionCost BaseCost;
		if (RdxDesc.getRecurrenceKind() == RecurKind::FMulAdd)
		// Recognize a call to the llvm.fmuladd intrinsic.
		BaseCost = TTI.getFMulAddReductionCost(VectorTy, RdxDesc.getFastMathFlags(),
		CostKind);
		else
		BaseCost = TTI.getArithmeticReductionCost(
RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);		RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);

// If we're using ordered reductions then we can just return the base cost		// If we're using ordered reductions then we can just return the base cost
// here, since getArithmeticReductionCost calculates the full ordered		// here, since getArithmeticReductionCost calculates the full ordered
// reduction cost when FP reassociation is not allowed.		// reduction cost when FP reassociation is not allowed.
if (useOrderedReductions(RdxDesc))		if (useOrderedReductions(RdxDesc))
return BaseCost;		return BaseCost;

		paulwalker-armUnsubmitted Done Reply Inline Actions Perhaps I've misunderstood something? because this code looks more complicated (or perhaps just more verbose) than the previous patch. I guess I'm struggling why different versions of `getArithmeticInstrCost` are being called between the two. I just assumed you'd add something like: InstructionCost BaseCost = TTI.getArithmeticReductionCost( RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind); + // For llvm.fmuladd based reductions we must include the cost of the normal + // vector fmul that will occur prior to the fadd reduction. + if (RdxDesc.getRecurrenceKind() == RecurKind::FMulAdd) + BaseCost += TTI.getArithmeticInstrCost(Instruction::FMul, VectorTy, CostKind); // If we're using ordered reductions then we can just return the base cost // here, since getArithmeticReductionCost calculates the full ordered // reduction cost when FP reassociation is not allowed. if (useOrderedReductions(RdxDesc)) return BaseCost; paulwalker-arm: Perhaps I've misunderstood something? because this code looks more complicated (or perhaps just…
// Get the operand that was not the reduction chain and match it to one of the		// Get the operand that was not the reduction chain and match it to one of the
// patterns, returning the better cost if it is found.		// patterns, returning the better cost if it is found.
Instruction *RedOp = RetI->getOperand(1) == LastChain		Instruction *RedOp = RetI->getOperand(1) == LastChain
? dyn_cast<Instruction>(RetI->getOperand(0))		? dyn_cast<Instruction>(RetI->getOperand(0))
		fhahnUnsubmitted Not Done Reply Inline Actions Could this be handled in `TTI.getArithmeticReductionCost`? What about other potential users of fmuladd reductions, like the SLP vectorizer? fhahn: Could this be handled in `TTI.getArithmeticReductionCost`? What about other potential users of…
		RosieSumpterAuthorUnsubmitted Not Done Reply Inline Actions Hi Florian, I did discuss this option with @david-arm, but it would mean changing the interface of `getArithmeticReductionCost` (e.g. by adding an optional `Instruction` argument) to be able to determine if it's a call to the fmuladd intrinsic. David also made the point that the fmul isn't actually part of the reduction cost, so perhaps it doesn't make sense to ask for an fmuladd reduction cost? If you would prefer it to be there though I'm happy to make the change. For the SLP vectorizer, it doesn't handle the fmuladd at the moment (I've added an assert for this in D111555 for safety) RosieSumpter:* Hi Florian, I did discuss this option with @david-arm, but it would mean changing the interface…
: dyn_cast<Instruction>(RetI->getOperand(1));		: dyn_cast<Instruction>(RetI->getOperand(1));

VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);		VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);

Instruction Op0, Op1;		Instruction Op0, Op1;
if (RedOp &&		if (RedOp &&
match(RedOp,		match(RedOp,
m_ZExtOrSExt(m_Mul(m_Instruction(Op0), m_Instruction(Op1)))) &&		m_ZExtOrSExt(m_Mul(m_Instruction(Op0), m_Instruction(Op1)))) &&
▲ Show 20 Lines • Show All 645 Lines • ▼ Show 20 Lines	if (canTruncateToMinimalBitwidth(I, VF)) {
VectorTy =		VectorTy =
smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);		smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
}		}
}		}

return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);		return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);
}		}
case Instruction::Call: {		case Instruction::Call: {
		// Recognize a call to the llvm.fmuladd intrinsic.
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: redundant comment (i.e. the code says as much) sdesmalen: nit: redundant comment (i.e. the code says as much)
		if (RecurrenceDescriptor::isFMulAddIntrinsic(I)) {
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: unnecessary curly braces. sdesmalen: nit: unnecessary curly braces.
		// Detect reduction patterns.
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: redundant comment. sdesmalen: nit: redundant comment.
		if (auto RedCost = getReductionPatternCost(I, VF, VectorTy, CostKind))
		return *RedCost;
		}
bool NeedToScalarize;		bool NeedToScalarize;
CallInst *CI = cast<CallInst>(I);		CallInst *CI = cast<CallInst>(I);
InstructionCost CallCost = getVectorCallCost(CI, VF, NeedToScalarize);		InstructionCost CallCost = getVectorCallCost(CI, VF, NeedToScalarize);
if (getVectorIntrinsicIDForCall(CI, TLI)) {		if (getVectorIntrinsicIDForCall(CI, TLI)) {
InstructionCost IntrinsicCost = getVectorIntrinsicCost(CI, VF);		InstructionCost IntrinsicCost = getVectorIntrinsicCost(CI, VF);
return std::min(CallCost, IntrinsicCost);		return std::min(CallCost, IntrinsicCost);
}		}
return CallCost;		return CallCost;
▲ Show 20 Lines • Show All 2,706 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	for.body:
%add = fadd double %0, %sum.07		%add = fadd double %0, %sum.07
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, %n		%exitcond.not = icmp eq i64 %iv.next, %n
br i1 %exitcond.not, label %for.end, label %for.body		br i1 %exitcond.not, label %for.end, label %for.body

for.end:		for.end:
ret double %add		ret double %add
}		}

		; CHECK-VF4: Found an estimated cost of 23 for VF 4 For instruction: %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
		; CHECK-VF8: Found an estimated cost of 46 for VF 8 For instruction: %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)

		define float @fmuladd_strict32(float* %a, float* %b, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
		%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
		%0 = load float, float* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
		%1 = load float, float* %arrayidx2, align 4
		%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond.not = icmp eq i64 %iv.next, %n
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		ret float %muladd
		}

		declare float @llvm.fmuladd.f32(float, float, float)

		; CHECK-VF4: Found an estimated cost of 22 for VF 4 For instruction: %muladd = tail call double @llvm.fmuladd.f64(double %0, double %1, double %sum.07)
		; CHECK-VF8: Found an estimated cost of 44 for VF 8 For instruction: %muladd = tail call double @llvm.fmuladd.f64(double %0, double %1, double %sum.07)

		define double @fmuladd_strict64(double* %a, double* %b, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%sum.07 = phi double [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
		%arrayidx = getelementptr inbounds double, double* %a, i64 %iv
		%0 = load double, double* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds double, double* %b, i64 %iv
		%1 = load double, double* %arrayidx2, align 4
		%muladd = tail call double @llvm.fmuladd.f64(double %0, double %1, double %sum.07)
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond.not = icmp eq i64 %iv.next, %n
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		ret double %muladd
		}

		declare double @llvm.fmuladd.f64(double, double, double)

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize][CostModel] Update cost model for fmuladd intrinsic
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 380003

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize][CostModel] Update cost model for fmuladd intrinsicClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 380003

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

[LoopVectorize][CostModel] Update cost model for fmuladd intrinsic
ClosedPublic