Download Raw Diff

Details

Reviewers

david-arm
kmclaughlin
sdesmalen
dmgreen
spatel
Ayal
fhahn
CarolineConcatto
paulwalker-arm

Commits

rGdf32a39dd0f6: [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic

Summary

This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.

Depends on D111555

Diff Detail

Event Timeline

RosieSumpter created this revision.Oct 12 2021, 4:25 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptOct 12 2021, 4:25 AM

RosieSumpter requested review of this revision.Oct 12 2021, 4:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 4:25 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B128324: Diff 378958.Oct 12 2021, 4:26 AM

fhahn added a subscriber: fhahn.Oct 12 2021, 4:27 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7265	Could this be handled in `TTI.getArithmeticReductionCost`? What about other potential users of fmuladd reductions, like the SLP vectorizer?

RosieSumpter added inline comments.Oct 12 2021, 6:01 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7265	Hi Florian, I did discuss this option with @david-arm, but it would mean changing the interface of `getArithmeticReductionCost` (e.g. by adding an optional `Instruction*` argument) to be able to determine if it's a call to the fmuladd intrinsic. David also made the point that the fmul isn't actually part of the reduction cost, so perhaps it doesn't make sense to ask for an fmuladd reduction cost? If you would prefer it to be there though I'm happy to make the change. For the SLP vectorizer, it doesn't handle the fmuladd at the moment (I've added an assert for this in D111555 for safety)

Created getFMulAddReductionCost. This means the code which calculates the total cost of the fmuladd is moved out of the vectorizer, but avoids changing the interface for getArithmeticReductionCost.

Harbormaster completed remote builds in B128883: Diff 379740.Oct 14 2021, 9:12 AM

RosieSumpter added a reviewer: fhahn.Oct 14 2021, 9:13 AM

RosieSumpter added a reviewer: CarolineConcatto.Oct 15 2021, 2:10 AM

david-arm added inline comments.Oct 15 2021, 4:13 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
398 ↗	(On Diff #379740)	Hi @RosieSumpter, do you know why these CHECK lines have changed? It doesn't seem like your patch should affect these tests because these loops are forced to use a certain VF anyway.

RosieSumpter added inline comments.Oct 15 2021, 5:47 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
398 ↗	(On Diff #379740)	Hi @david-arm, good point. The reason the test has changed is because of adding `FMulAdd` as an allowed recurrence kind to `AArch64TTIImpl::isLegalToVectorizeReduction`. I now see that it probably makes more sense for this particular change to be in D111555, so I'll do that now.

Moved the addition of FMulAdd as a recurrence kind in AArch64TTIImpl::isLegalToVectorizeReduction to D111555
There are no longer changes to the scalable-strict-fadd.ll test

Harbormaster completed remote builds in B129066: Diff 380003.Oct 15 2021, 7:51 AM

paulwalker-arm added a subscriber: paulwalker-arm.Oct 15 2021, 8:30 AM

paulwalker-arm added inline comments.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2177–2185 ↗	(On Diff #380003)	Do we need a new TTI interface for this? To my mind the costing side of TTI exists to cost real entities and in this instance the IR has no discrete concept for an FMA reduction. Instead what we have is LoopVectorize pretending such a concept exists that is knows ahead of time it will simulate with separate fmul and ordered_fadd_reduce operations. For this reason I think it would be better to explicitly cost that exact idiom within the same code that is using it. So I guess I'm saying that if you move these three lines into LoopVectorize.cpp the patch can be much smaller and you're not creating a new interface for something that doesn't really exist.

RosieSumpter added inline comments.Oct 15 2021, 9:04 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2177–2185 ↗	(On Diff #380003)	Hi Paul, thanks for having a look at this. Initially I did put the cost calculation into the vectorizer, but there was some discussion (also a comment from @fhahn) about whether it would make more sense in `TTI.getArithmeticReductionCost` and, to avoid changing the interface for `getArithmeticReductionCost`, @david-arm and @sdesmalen suggested adding the new `getFMulAddReductionCost`. I am happy to change it back to LoopVectorize.cpp if that seems like the better option.

Thanks for the extra info @RosieSumpter. As I see it @fhahn was just asking a question of which I personally would have answered no, presenting the same rational as my previous comment. I guess we'll have to wait for others to respond.

paulwalker-arm mentioned this in D111555: [LoopVectorize] Add vector reduction support for fmuladd intrinsic.Oct 15 2021, 10:19 AM

david-arm added inline comments.Oct 18 2021, 1:17 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
2177–2185 ↗	(On Diff #380003)	For what it's worth, I personally think it makes a bit more sense to calculate the costs separately in the vectoriser because in my mind at least the fmul isn't part of the reduction, since we are always going to widen it into a normal vector operation. I suggested adding a new interface as a compromise, so we can avoid over-complicating (and avoid making this patch significantly larger) the existing getArithmeticReductionCost method. I wouldn't block the patch over this though!

Removed the new getFMulAddReductionCost TTI interface
Moved the cost calculation back to the vectorizer

Thanks for the comments @paulwalker-arm and @david-arm. I've moved the fmul cost calculation back to the vectorizer since this seems like the more favourable option.

Harbormaster completed remote builds in B129349: Diff 380404.Oct 18 2021, 8:23 AM

paulwalker-arm added inline comments.Oct 18 2021, 9:23 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

7256–7261

Perhaps I've misunderstood something? because this code looks more complicated (or perhaps just more verbose) than the previous patch. I guess I'm struggling why different versions of getArithmeticInstrCost are being called between the two. I just assumed you'd add something like:

   InstructionCost BaseCost = TTI.getArithmeticReductionCost(                        
       RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);         
                                                                                    
+  // For llvm.fmuladd based reductions we must include the cost of the normal       
+  // vector fmul that will occur prior to the fadd reduction.                       
+  if (RdxDesc.getRecurrenceKind() == RecurKind::FMulAdd)                            
+    BaseCost += TTI.getArithmeticInstrCost(Instruction::FMul, VectorTy, CostKind);  

   // If we're using ordered reductions then we can just return the base cost
   // here, since getArithmeticReductionCost calculates the full ordered
   // reduction cost when FP reassociation is not allowed.
   if (useOrderedReductions(RdxDesc))                                                
     return BaseCost;

Make use of default parameters when calling getArithmeticInstrCost
Move addition of FMul cost

Harbormaster completed remote builds in B130725: Diff 382346.Oct 26 2021, 9:11 AM

RosieSumpter edited the summary of this revision. (Show Details)Oct 28 2021, 1:35 AM

RosieSumpter added a parent revision: D111555: [LoopVectorize] Add vector reduction support for fmuladd intrinsic.

This cost calculation seems correct to me. For in-loop reductions it calculates the cost as a single fmul + fadd *reduction*. If this is not an in-loop reduction, or if fmuladd is not used in a reduction, it will follow the regular code-path to get the cost of this intrinsic (either as an FMA or separate fmul+fadd).

So, LGTM! (Please address the minor nits before you commit)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7927	nit: redundant comment (i.e. the code says as much)
7928	nit: unnecessary curly braces.
7929	nit: redundant comment.

sdesmalen accepted this revision.Oct 29 2021, 7:34 AM

This revision is now accepted and ready to land.Oct 29 2021, 7:34 AM

paulwalker-arm accepted this revision.Nov 1 2021, 6:05 AM

This revision was landed with ongoing or failed builds.Nov 24 2021, 1:00 AM

Closed by commit rGdf32a39dd0f6: [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic (authored by RosieSumpter). · Explain Why

This revision was automatically updated to reflect the committed changes.

RosieSumpter added a commit: rGdf32a39dd0f6: [LoopVectorize][CostModel] Update cost model for fmuladd intrinsic.

Diff 382346

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,241 Lines • ▼ Show 20 Lines	while (!isa<PHINode>(ReductionPhi))
ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];		ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];

const RecurrenceDescriptor &RdxDesc =		const RecurrenceDescriptor &RdxDesc =
Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];		Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];

InstructionCost BaseCost = TTI.getArithmeticReductionCost(		InstructionCost BaseCost = TTI.getArithmeticReductionCost(
RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);		RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);

		// For a call to the llvm.fmuladd intrinsic we need to add the cost of a
		// normal fmul instruction to the cost of the fadd reduction.
		if (RdxDesc.getRecurrenceKind() == RecurKind::FMulAdd)
		BaseCost +=
		TTI.getArithmeticInstrCost(Instruction::FMul, VectorTy, CostKind);

// If we're using ordered reductions then we can just return the base cost		// If we're using ordered reductions then we can just return the base cost
// here, since getArithmeticReductionCost calculates the full ordered		// here, since getArithmeticReductionCost calculates the full ordered
// reduction cost when FP reassociation is not allowed.		// reduction cost when FP reassociation is not allowed.
if (useOrderedReductions(RdxDesc))		if (useOrderedReductions(RdxDesc))
return BaseCost;		return BaseCost;

		paulwalker-armUnsubmitted Done Reply Inline Actions Perhaps I've misunderstood something? because this code looks more complicated (or perhaps just more verbose) than the previous patch. I guess I'm struggling why different versions of `getArithmeticInstrCost` are being called between the two. I just assumed you'd add something like: InstructionCost BaseCost = TTI.getArithmeticReductionCost( RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind); + // For llvm.fmuladd based reductions we must include the cost of the normal + // vector fmul that will occur prior to the fadd reduction. + if (RdxDesc.getRecurrenceKind() == RecurKind::FMulAdd) + BaseCost += TTI.getArithmeticInstrCost(Instruction::FMul, VectorTy, CostKind); // If we're using ordered reductions then we can just return the base cost // here, since getArithmeticReductionCost calculates the full ordered // reduction cost when FP reassociation is not allowed. if (useOrderedReductions(RdxDesc)) return BaseCost; paulwalker-arm: Perhaps I've misunderstood something? because this code looks more complicated (or perhaps just…
// Get the operand that was not the reduction chain and match it to one of the		// Get the operand that was not the reduction chain and match it to one of the
// patterns, returning the better cost if it is found.		// patterns, returning the better cost if it is found.
Instruction *RedOp = RetI->getOperand(1) == LastChain		Instruction *RedOp = RetI->getOperand(1) == LastChain
? dyn_cast<Instruction>(RetI->getOperand(0))		? dyn_cast<Instruction>(RetI->getOperand(0))
		fhahnUnsubmitted Not Done Reply Inline Actions Could this be handled in `TTI.getArithmeticReductionCost`? What about other potential users of fmuladd reductions, like the SLP vectorizer? fhahn: Could this be handled in `TTI.getArithmeticReductionCost`? What about other potential users of…
		RosieSumpterAuthorUnsubmitted Not Done Reply Inline Actions Hi Florian, I did discuss this option with @david-arm, but it would mean changing the interface of `getArithmeticReductionCost` (e.g. by adding an optional `Instruction` argument) to be able to determine if it's a call to the fmuladd intrinsic. David also made the point that the fmul isn't actually part of the reduction cost, so perhaps it doesn't make sense to ask for an fmuladd reduction cost? If you would prefer it to be there though I'm happy to make the change. For the SLP vectorizer, it doesn't handle the fmuladd at the moment (I've added an assert for this in D111555 for safety) RosieSumpter:* Hi Florian, I did discuss this option with @david-arm, but it would mean changing the interface…
: dyn_cast<Instruction>(RetI->getOperand(1));		: dyn_cast<Instruction>(RetI->getOperand(1));

VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);		VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);

Instruction Op0, Op1;		Instruction Op0, Op1;
if (RedOp &&		if (RedOp &&
match(RedOp,		match(RedOp,
m_ZExtOrSExt(m_Mul(m_Instruction(Op0), m_Instruction(Op1)))) &&		m_ZExtOrSExt(m_Mul(m_Instruction(Op0), m_Instruction(Op1)))) &&
▲ Show 20 Lines • Show All 645 Lines • ▼ Show 20 Lines	if (canTruncateToMinimalBitwidth(I, VF)) {
VectorTy =		VectorTy =
smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);		smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
}		}
}		}

return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);		return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);
}		}
case Instruction::Call: {		case Instruction::Call: {
		// Recognize a call to the llvm.fmuladd intrinsic.
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: redundant comment (i.e. the code says as much) sdesmalen: nit: redundant comment (i.e. the code says as much)
		if (RecurrenceDescriptor::isFMulAddIntrinsic(I)) {
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: unnecessary curly braces. sdesmalen: nit: unnecessary curly braces.
		// Detect reduction patterns.
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: redundant comment. sdesmalen: nit: redundant comment.
		if (auto RedCost = getReductionPatternCost(I, VF, VectorTy, CostKind))
		return *RedCost;
		}
bool NeedToScalarize;		bool NeedToScalarize;
CallInst *CI = cast<CallInst>(I);		CallInst *CI = cast<CallInst>(I);
InstructionCost CallCost = getVectorCallCost(CI, VF, NeedToScalarize);		InstructionCost CallCost = getVectorCallCost(CI, VF, NeedToScalarize);
if (getVectorIntrinsicIDForCall(CI, TLI)) {		if (getVectorIntrinsicIDForCall(CI, TLI)) {
InstructionCost IntrinsicCost = getVectorIntrinsicCost(CI, VF);		InstructionCost IntrinsicCost = getVectorIntrinsicCost(CI, VF);
return std::min(CallCost, IntrinsicCost);		return std::min(CallCost, IntrinsicCost);
}		}
return CallCost;		return CallCost;
▲ Show 20 Lines • Show All 2,715 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	for.body:
%add = fadd double %0, %sum.07		%add = fadd double %0, %sum.07
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, %n		%exitcond.not = icmp eq i64 %iv.next, %n
br i1 %exitcond.not, label %for.end, label %for.body		br i1 %exitcond.not, label %for.end, label %for.body

for.end:		for.end:
ret double %add		ret double %add
}		}

		; CHECK-VF4: Found an estimated cost of 23 for VF 4 For instruction: %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
		; CHECK-VF8: Found an estimated cost of 46 for VF 8 For instruction: %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)

		define float @fmuladd_strict32(float* %a, float* %b, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%sum.07 = phi float [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
		%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
		%0 = load float, float* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
		%1 = load float, float* %arrayidx2, align 4
		%muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond.not = icmp eq i64 %iv.next, %n
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		ret float %muladd
		}

		declare float @llvm.fmuladd.f32(float, float, float)

		; CHECK-VF4: Found an estimated cost of 22 for VF 4 For instruction: %muladd = tail call double @llvm.fmuladd.f64(double %0, double %1, double %sum.07)
		; CHECK-VF8: Found an estimated cost of 44 for VF 8 For instruction: %muladd = tail call double @llvm.fmuladd.f64(double %0, double %1, double %sum.07)

		define double @fmuladd_strict64(double* %a, double* %b, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%sum.07 = phi double [ 0.000000e+00, %entry ], [ %muladd, %for.body ]
		%arrayidx = getelementptr inbounds double, double* %a, i64 %iv
		%0 = load double, double* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds double, double* %b, i64 %iv
		%1 = load double, double* %arrayidx2, align 4
		%muladd = tail call double @llvm.fmuladd.f64(double %0, double %1, double %sum.07)
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond.not = icmp eq i64 %iv.next, %n
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		ret double %muladd
		}

		declare double @llvm.fmuladd.f64(double, double, double)

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize][CostModel] Update cost model for fmuladd intrinsic
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 382346

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize][CostModel] Update cost model for fmuladd intrinsicClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 382346

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll

[LoopVectorize][CostModel] Update cost model for fmuladd intrinsic
ClosedPublic