This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Account for cost of removing FMA opportunities
Needs ReviewPublic

Authored by wjschmidt on May 19 2022, 9:32 AM.

Download Raw Diff

Details

Reviewers

vdmitrie
ABataev
fhahn
RKSimon
vporpo
spatel

Summary

When we generate a horizontal reduction of floating adds fed by a vectorized tree rooted at floating multiplies, or when we vectorize a list of floating multiplies that each feeds a floating add/subtract, account for the cost of no longer being able to generate scalar FMAs.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wjschmidt created this revision.May 19 2022, 9:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 9:32 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

wjschmidt requested review of this revision.May 19 2022, 9:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 9:32 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

I think this loss is not only relevant for horizontal reductions, but can pessimize things in general, right?

ABataev added inline comments.May 19 2022, 9:37 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9675	I think it must be a part of TTI::getArithmeticReductionCost

In D125987#3525414, @fhahn wrote:

I think this loss is not only relevant for horizontal reductions, but can pessimize things in general, right?

Yes. I have a second patch that covers that case. I split the patches so we could see that we still have the pessimization in the second test case after the reduction is avoided, which the second patch will fix.

On what target are you seeing this problem?

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
61	why is this a constant?

wjschmidt added inline comments.May 19 2022, 9:54 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9675	Because this will also be called in a case where it's not part of a reduction, I'd like to keep it separate. (See my comment in the summary.) Perhaps I shouldn't have split the patches so this would be more evident.

In D125987#3525451, @RKSimon wrote:

On what target are you seeing this problem?

-mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu

wjschmidt added inline comments.May 19 2022, 9:56 AM

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
61	This was reduced from a big messy function. Bugpoint ended up with an undef here and elsewhere, and previous reviewers asked me to remove the undefs.

ABataev added inline comments.May 19 2022, 10:00 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9675	Anyway, maybe better to make it a part of TTI rather than put it to SLPVectorizer?

wjschmidt added inline comments.May 19 2022, 10:06 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9675	That's certainly possible, and probably a good idea. We'd need to pull out the final multiplication by Width and the SLP-specific debug stuff, but with that done it would fit nicely in TTI, and there might be potential for reuse.

RKSimon added inline comments.May 19 2022, 10:12 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9675	Ideally the getArithmeticReductionCost costs wouldn't be affected but getArithmeticInstrCost could be adjusted to account for potential FMA patterns in the scalar/vector FADD costs

ABataev added inline comments.May 19 2022, 10:20 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9675	Yep, good idea!

Harbormaster completed remote builds in B165352: Diff 430714.May 19 2022, 11:00 AM

wjschmidt added inline comments.May 19 2022, 11:03 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9675	I'll look into possibilities. That might be a bit awkward for the other use case. It's clear that I should have combined the two patches to make this easier to judge, so I'll do that in the next revision. Thanks for the ideas!

This revision now includes fixes for both cases: a horizontal reduction of fadds fed by a vectorized tree rooted at fmuls, and a vectorized tree of fmuls that feeds arbitrary fadds and/or fsubs. It was misguided to break this up into two patches without showing you both...

The reviewers of the previous revision requested that the FMA cost calculation be moved into TTI. I've done that here, but I kept it as a separate calculation rather than incorporating it somehow into getArithmeticInstrCost, for a couple of reasons. If I understood the proposal correctly, it was to have getArithmeticInstrCost when called on an FAdd (or FSub) produce a lesser cost when fed by an FMul as one of the operands, reporting that the FAdd/FSub is essentially free in those circumstances. (If I've misunderstood, please correct me.)

First, I didn't find this particularly helpful for the use cases I'm addressing here. For example, for the tryToVectorizeList() case, we have a collection of FMul instructions. We can follow the defs to get the FAdd/FSubs to ask about their cost, but that will just get us a raw number. It doesn't tell us how much we ought to add back to the cost for breaking the FMA. (I don't think that just having the altered cost for FAdd/FSub will avoid us considering the FMul vectorization in the first place.)

Second, I worry about unintended effects throughout the middle end by altering the cost model this way. I think it's better to provide the additional interface so that the lost FMA cost can be taken into account only where it's known to matter.

Thanks for the reviews!

Harbormaster completed remote builds in B165826: Diff 431351.May 23 2022, 7:35 AM

ABataev added inline comments.May 25 2022, 7:55 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	Can this be hidden in the TTI completely anyhow? To avoid the extra calls of the adjustForFMAs completely in SLP.

RKSimon added inline comments.May 25 2022, 9:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	+1 - the only problem if that this will probably end up having to be added on a target-by-target basis in getArithmeticInstrCost - hopefully we can provide a helper wrapper so its only a one-liner in each target.

wjschmidt added inline comments.May 25 2022, 9:56 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	Hi -- sorry, I'll be out of town for the next week, so can't look at this right away. In the meantime, can you please give me a little bit more idea exactly what you're looking for? I've expressed my concerns about making direct changes to getArithmeticInstrCost for this in the comments for this revision -- can you please address those? It's not at all clear to me what you're looking for other than "hide it from SLP somehow."

wjschmidt added inline comments.Jun 3 2022, 7:00 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
921	Pasto here, this should be Instruction::FAdd.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	I took a run at prototyping this yesterday for X86 TTI. I put a shim in X86TTIImpl::getArithmeticInstrCost() to look for cases of FAdd and FSub that we expect to be folded into FMAs, and reduce their cost appropriately. This was sufficient to prevent the horizontal reduction from happening, but as I expected, this doesn't help us with the tryToVectorizeList() case, where the cost modeling is only looking at the multiplies and not the add/sub fed by them. (I.e., the second variant in the test case isn't fixed.) I don't see any way to avoid explicit FMA checking in the SLP vectorizer to solve this, as explained in my summary for this revision. Given this result, I'd like to proceed with the patch provided here. It would be pleasant if we could hide the issue entirely from SLP, but it doesn't seem practical. Thanks!

Rebased; fixed pasto in getFMACostSavings(); ensured that getFMACostSavings() returns nonnegative values; and made a slight simplification in adjustForFMAs(). Thanks!

Harbormaster completed remote builds in B169845: Diff 436946.Jun 14 2022, 4:24 PM

ABataev added inline comments.Jun 15 2022, 6:10 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	Could you provide more details why this does not help with tryToVectorizeList?

wjschmidt marked an inline comment as not done.Jun 15 2022, 6:14 AM

wjschmidt added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	Because tryToVectorizeList doesn't see the adds at all. The root of the tree is the multiplies. Each of those multiplies happens to feed an add that's outside the vectorizable tree. There's never any call to getArithmeticInstrCost for the adds, so modifying the cost there is of no help.

ABataev added inline comments.Jun 15 2022, 6:15 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	Can we look through the users of these multiplies?

wjschmidt added inline comments.Jun 15 2022, 6:19 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	That's exactly what this patch is doing.

wjschmidt added inline comments.Jun 15 2022, 6:20 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	To be clear, the current patch does fix the tryToVectorizeList case. It just doesn't hide that aspect of things in TTI.

ABataev added inline comments.Jun 15 2022, 6:21 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	I mean, can we do it in TTI?

wjschmidt added inline comments.Jun 15 2022, 6:25 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	The problem is that you have to pick one way or the other to look at things if you do it in TTI. You either have to say, the cost of an add is free if it's fed by a multiply, or the cost of a multiply is free if it feeds into an add. If you do both of these things, you end up undercounting the cost. I don't see a practical way to do this in TTI that solves both problems in SLP. One is looking at things from a multiply point of view and the other from the add point of view.

ABataev added inline comments.Jun 15 2022, 6:29 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	Yes, agree. Can we try to have free muls if they are used by adds?Do you foresee any problems with this approach?

wjschmidt added inline comments.Jun 15 2022, 6:37 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	I implemented a prototype that gave us free adds if they use multiplies. I didn't see any regressions from this. I haven't tried it the other way around (free multiplies if used only by a single add each). If we did that, my guess is we probably still wouldn't see regressions. However, we also wouldn't see much benefit, because now the other case (the horizontal reduction case) will no longer be fixed by the TTI solution. Either way we end up having to have code in SLP that special-cases FMAs somewhere. My feeling is that it's sensible for TTI to provide the interface that calculates the savings of an FMA, but the burden needs to be on the users of TTI (SLP for now, but perhaps others) to determine when to use it. I really do understand why it would be preferable to hide everything in the TTI layer, so nobody ever has to worry about it again, but because users can come at the problem from two directions, it's hard for a simple interface like getArithmeticInstrCost to hide the complexity.

ABataev added inline comments.Jun 15 2022, 6:42 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	However, we also wouldn't see much benefit, because now the other case (the horizontal reduction case) will no longer be fixed by the TTI solution. What's the problem with the reduction here? I understand, but let's try to evaluate possible solutions before having final decision.

wjschmidt added inline comments.Jun 15 2022, 6:53 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction. It may be that looking at it from the multiply side will solve the problem for the horizontal reduction also, because the vectorizable tree cost will be increased. But let's be careful. Suppose some other optimization pass is looking at code where FMul feeds FAdd, and it is considering an optimization that removes the FAdd in place of some other code sequence. This pass will just do its cost calculation based on getArithmeticInstrCost for FAdd, since FMul is not part of the modification it's making. If we make the FMA cost savings calculation apply only for getArithmeticInstrCost for FMul, then this hypothetical optimization will not account for the cost of the broken FMA opportunity. I.e., we might be able to hide everything in TTI specifically for today's needs for SLP, but it doesn't appear to me to be a satisfactory general solution that hides everything in TTI for all possible uses. That's what makes me reluctant to do this.

ABataev added inline comments.Jun 15 2022, 6:57 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction. We can add a check for this in TTI too. It may be that looking at it from the multiply side will solve the problem for the horizontal reduction also, because the vectorizable tree cost will be increased. But let's be careful. Suppose some other optimization pass is looking at code where FMul feeds FAdd, and it is considering an optimization that removes the FAdd in place of some other code sequence. This pass will just do its cost calculation based on getArithmeticInstrCost for FAdd, since FMul is not part of the modification it's making. If we make the FMA cost savings calculation apply only for getArithmeticInstrCost for FMul, then this hypothetical optimization will not account for the cost of the broken FMA opportunity. Same reason applies to SLP. If miss something, we need to fix it for all passes. We'd better to have one interface for all transformation passes, if possible.

wjschmidt added inline comments.Jun 15 2022, 7:00 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction. We can add a check for this in TTI too. But you can't, because then you've said both the multiply and the add are free, as I said before. This will break some cost calculations.

ABataev added inline comments.Jun 15 2022, 7:02 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	But for this we'll need to modify a different function (I thought about getArithemticReductionCost()), though I'm not sure that we even need to that. Everything still should be handled by the getArithmeticCost function.

wjschmidt added inline comments.Jun 15 2022, 7:11 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9668	I remain very uncomfortable with trying to hide this in getArithmeticInstrCost for the reasons stated already; it's just too myopic an interface to be able to "look both ways" without making a costing error somewhere. But I may just be unable to see an obvious solution. Perhaps someone with more expertise in TTI would like to take a run at this.

dmgreen mentioned this in D131028: [AArch64] Fix cost model for FADD vector reduction.Aug 4 2022, 12:56 AM

fhahn mentioned this in D132872: [SLP] Account for loss due to FMAs when estimating fmul costs..Aug 29 2022, 11:01 AM

In D125987#3525414, @fhahn wrote:

I think this loss is not only relevant for horizontal reductions, but can pessimize things in general, right?

I stumbled upon an issue on AArch64 where ignoring scalar fusion opportunities notably regresses performance of some SLP sensitive applications. I put up D132872 to sketch a potential solution for the non-horizontal reduction case.

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptAug 29 2022, 11:02 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

4 lines

Transforms/

Vectorize/

SLPVectorizer.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

24 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

49 lines

test/

Transforms/

SLPVectorizer/

X86/

slp-fma-loss.ll

34 lines

Diff 431351

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 1,249 Lines • ▼ Show 20 Lines	InstructionCost getExtendedAddReductionCost(
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

/// \returns The cost of Intrinsic instructions. Analyses the real arguments.		/// \returns The cost of Intrinsic instructions. Analyses the real arguments.
/// Three cases are handled: 1. scalar instruction 2. vector instruction		/// Three cases are handled: 1. scalar instruction 2. vector instruction
/// 3. scalar instruction which is to be vectorized.		/// 3. scalar instruction which is to be vectorized.
InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) const;		TTI::TargetCostKind CostKind) const;

		/// \returns The cost savings from using a floating-point contract
		/// instead of separate multiply and add/subtract.
		InstructionCost getFMACostSavings(Type *Ty, FastMathFlags FMF) const;

/// \returns The cost of Call instructions.		/// \returns The cost of Call instructions.
InstructionCost getCallInstrCost(		InstructionCost getCallInstrCost(
Function F, Type RetTy, ArrayRef<Type *> Tys,		Function F, Type RetTy, ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency) const;		TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency) const;

/// \returns The number of pieces into which the provided type must be		/// \returns The number of pieces into which the provided type must be
/// split during legalization. Zero is returned when the answer is unknown.		/// split during legalization. Zero is returned when the answer is unknown.
unsigned getNumberOfParts(Type *Tp) const;		unsigned getNumberOfParts(Type *Tp) const;
▲ Show 20 Lines • Show All 1,277 Lines • Show Last 20 Lines

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show All 30 Lines
class CmpInst;		class CmpInst;
class DemandedBits;		class DemandedBits;
class DominatorTree;		class DominatorTree;
class Function;		class Function;
class GetElementPtrInst;		class GetElementPtrInst;
class InsertElementInst;		class InsertElementInst;
class InsertValueInst;		class InsertValueInst;
class Instruction;		class Instruction;
		class InstructionCost;
class LoopInfo;		class LoopInfo;
class OptimizationRemarkEmitter;		class OptimizationRemarkEmitter;
class PHINode;		class PHINode;
class ScalarEvolution;		class ScalarEvolution;
class StoreInst;		class StoreInst;
class TargetLibraryInfo;		class TargetLibraryInfo;
class TargetTransformInfo;		class TargetTransformInfo;
class Value;		class Value;
Show All 36 Lines	private:
/// according to the underlying object of their pointer operands. We sort the		/// according to the underlying object of their pointer operands. We sort the
/// instructions by their underlying objects to reduce the cost of		/// instructions by their underlying objects to reduce the cost of
/// consecutive access queries.		/// consecutive access queries.
///		///
/// TODO: We can further reduce this cost if we flush the chain creation		/// TODO: We can further reduce this cost if we flush the chain creation
/// every time we run into a memory barrier.		/// every time we run into a memory barrier.
void collectSeedInstructions(BasicBlock *BB);		void collectSeedInstructions(BasicBlock *BB);

		/// Adjust cost if vectorizing a tree will lose FMA opportunities.
		void adjustForFMAs(InstructionCost &Cost, ArrayRef<Value *> &VL);

/// Try to vectorize a chain that starts at two arithmetic instrs.		/// Try to vectorize a chain that starts at two arithmetic instrs.
bool tryToVectorizePair(Value A, Value B, slpvectorizer::BoUpSLP &R);		bool tryToVectorizePair(Value A, Value B, slpvectorizer::BoUpSLP &R);

/// Try to vectorize a list of operands.		/// Try to vectorize a list of operands.
/// \param LimitForRegisterSize Vectorize only using maximal allowed register		/// \param LimitForRegisterSize Vectorize only using maximal allowed register
/// size.		/// size.
/// \returns true if a value was vectorized.		/// \returns true if a value was vectorized.
bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R,		bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R,
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 902 Lines • ▼ Show 20 Lines
	TargetTransformInfo::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,			TargetTransformInfo::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
	TTI::TargetCostKind CostKind) const {			TTI::TargetCostKind CostKind) const {
	InstructionCost Cost = TTIImpl->getIntrinsicInstrCost(ICA, CostKind);			InstructionCost Cost = TTIImpl->getIntrinsicInstrCost(ICA, CostKind);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	InstructionCost			InstructionCost
				TargetTransformInfo::getFMACostSavings(Type *Ty, FastMathFlags FMF) const {

				if (!FMF.allowContract())
				return 0;

				// Cost = Cost(FMul) + Cost(FAdd) - Cost(FMA).
				TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
				InstructionCost MulCost =
				getArithmeticInstrCost(Instruction::FMul, Ty, CostKind);
				InstructionCost AddCost =
				getArithmeticInstrCost(Instruction::FMul, Ty, CostKind);
				wjschmidtAuthorUnsubmitted Not Done Reply Inline Actions Pasto here, this should be Instruction::FAdd. wjschmidt: Pasto here, this should be Instruction::FAdd.

				Intrinsic::ID ID = Intrinsic::fmuladd;
				SmallVector<Type *, 2> Tys;
				Tys.push_back(Ty);
				Tys.push_back(Ty);
				ArrayRef<Type *> ArgTys(Tys);
				IntrinsicCostAttributes CostAttrs(ID, Ty, ArgTys);
				InstructionCost FMACost = getIntrinsicInstrCost(CostAttrs, CostKind);

				return MulCost + AddCost - FMACost;
				}

				InstructionCost
	TargetTransformInfo::getCallInstrCost(Function F, Type RetTy,			TargetTransformInfo::getCallInstrCost(Function F, Type RetTy,
	ArrayRef<Type *> Tys,			ArrayRef<Type *> Tys,
	TTI::TargetCostKind CostKind) const {			TTI::TargetCostKind CostKind) const {
	InstructionCost Cost = TTIImpl->getCallInstrCost(F, RetTy, Tys, CostKind);			InstructionCost Cost = TTIImpl->getCallInstrCost(F, RetTy, Tys, CostKind);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	▲ Show 20 Lines • Show All 298 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,659 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::tryToVectorizePair(Value A, Value B, BoUpSLP &R) {
if (!A \|\| !B)		if (!A \|\| !B)
return false;		return false;
if (isa<InsertElementInst>(A) \|\| isa<InsertElementInst>(B))		if (isa<InsertElementInst>(A) \|\| isa<InsertElementInst>(B))
return false;		return false;
Value *VL[] = {A, B};		Value *VL[] = {A, B};
return tryToVectorizeList(VL, R);		return tryToVectorizeList(VL, R);
}		}

		void SLPVectorizerPass::adjustForFMAs(InstructionCost &Cost,
		ABataevUnsubmitted Not Done Reply Inline Actions Can this be hidden in the TTI completely anyhow? To avoid the extra calls of the adjustForFMAs completely in SLP. ABataev: Can this be hidden in the TTI completely anyhow? To avoid the extra calls of the adjustForFMAs…
		RKSimonUnsubmitted Not Done Reply Inline Actions +1 - the only problem if that this will probably end up having to be added on a target-by-target basis in getArithmeticInstrCost - hopefully we can provide a helper wrapper so its only a one-liner in each target. RKSimon: +1 - the only problem if that this will probably end up having to be added on a target-by…
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions Hi -- sorry, I'll be out of town for the next week, so can't look at this right away. In the meantime, can you please give me a little bit more idea exactly what you're looking for? I've expressed my concerns about making direct changes to getArithmeticInstrCost for this in the comments for this revision -- can you please address those? It's not at all clear to me what you're looking for other than "hide it from SLP somehow." wjschmidt: Hi -- sorry, I'll be out of town for the next week, so can't look at this right away. In the…
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions I took a run at prototyping this yesterday for X86 TTI. I put a shim in X86TTIImpl::getArithmeticInstrCost() to look for cases of FAdd and FSub that we expect to be folded into FMAs, and reduce their cost appropriately. This was sufficient to prevent the horizontal reduction from happening, but as I expected, this doesn't help us with the tryToVectorizeList() case, where the cost modeling is only looking at the multiplies and not the add/sub fed by them. (I.e., the second variant in the test case isn't fixed.) I don't see any way to avoid explicit FMA checking in the SLP vectorizer to solve this, as explained in my summary for this revision. Given this result, I'd like to proceed with the patch provided here. It would be pleasant if we could hide the issue entirely from SLP, but it doesn't seem practical. Thanks! wjschmidt: I took a run at prototyping this yesterday for X86 TTI. I put a shim in X86TTIImpl…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you provide more details why this does not help with tryToVectorizeList? ABataev: Could you provide more details why this does not help with tryToVectorizeList?
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions Because tryToVectorizeList doesn't see the adds at all. The root of the tree is the multiplies. Each of those multiplies happens to feed an add that's outside the vectorizable tree. There's never any call to getArithmeticInstrCost for the adds, so modifying the cost there is of no help. wjschmidt: Because tryToVectorizeList doesn't see the adds at all. The root of the tree is the multiplies.
		ABataevUnsubmitted Not Done Reply Inline Actions Can we look through the users of these multiplies? ABataev: Can we look through the users of these multiplies?
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions That's exactly what this patch is doing. wjschmidt: That's exactly what this patch is doing.
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions To be clear, the current patch does fix the tryToVectorizeList case. It just doesn't hide that aspect of things in TTI. wjschmidt: To be clear, the current patch does fix the tryToVectorizeList case. It just doesn't hide that…
		ABataevUnsubmitted Not Done Reply Inline Actions I mean, can we do it in TTI? ABataev: I mean, can we do it in TTI?
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions The problem is that you have to pick one way or the other to look at things if you do it in TTI. You either have to say, the cost of an add is free if it's fed by a multiply, or the cost of a multiply is free if it feeds into an add. If you do both of these things, you end up undercounting the cost. I don't see a practical way to do this in TTI that solves both problems in SLP. One is looking at things from a multiply point of view and the other from the add point of view. wjschmidt: The problem is that you have to pick one way or the other to look at things if you do it in TTI.
		ABataevUnsubmitted Not Done Reply Inline Actions Yes, agree. Can we try to have free muls if they are used by adds?Do you foresee any problems with this approach? ABataev: Yes, agree. Can we try to have free muls if they are used by adds?Do you foresee any problems…
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions I implemented a prototype that gave us free adds if they use multiplies. I didn't see any regressions from this. I haven't tried it the other way around (free multiplies if used only by a single add each). If we did that, my guess is we probably still wouldn't see regressions. However, we also wouldn't see much benefit, because now the other case (the horizontal reduction case) will no longer be fixed by the TTI solution. Either way we end up having to have code in SLP that special-cases FMAs somewhere. My feeling is that it's sensible for TTI to provide the interface that calculates the savings of an FMA, but the burden needs to be on the users of TTI (SLP for now, but perhaps others) to determine when to use it. I really do understand why it would be preferable to hide everything in the TTI layer, so nobody ever has to worry about it again, but because users can come at the problem from two directions, it's hard for a simple interface like getArithmeticInstrCost to hide the complexity. wjschmidt: I implemented a prototype that gave us free adds if they use multiplies. I didn't see any…
		ABataevUnsubmitted Not Done Reply Inline Actions However, we also wouldn't see much benefit, because now the other case (the horizontal reduction case) will no longer be fixed by the TTI solution. What's the problem with the reduction here? I understand, but let's try to evaluate possible solutions before having final decision. ABataev: > However, we also wouldn't see much benefit, because now the other case (the horizontal…
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction. It may be that looking at it from the multiply side will solve the problem for the horizontal reduction also, because the vectorizable tree cost will be increased. But let's be careful. Suppose some other optimization pass is looking at code where FMul feeds FAdd, and it is considering an optimization that removes the FAdd in place of some other code sequence. This pass will just do its cost calculation based on getArithmeticInstrCost for FAdd, since FMul is not part of the modification it's making. If we make the FMA cost savings calculation apply only for getArithmeticInstrCost for FMul, then this hypothetical optimization will not account for the cost of the broken FMA opportunity. I.e., we might be able to hide everything in TTI specifically for today's needs for SLP, but it doesn't appear to me to be a satisfactory general solution that hides everything in TTI for all possible uses. That's what makes me reluctant to do this. wjschmidt: We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we…
		ABataevUnsubmitted Not Done Reply Inline Actions We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction. We can add a check for this in TTI too. It may be that looking at it from the multiply side will solve the problem for the horizontal reduction also, because the vectorizable tree cost will be increased. But let's be careful. Suppose some other optimization pass is looking at code where FMul feeds FAdd, and it is considering an optimization that removes the FAdd in place of some other code sequence. This pass will just do its cost calculation based on getArithmeticInstrCost for FAdd, since FMul is not part of the modification it's making. If we make the FMA cost savings calculation apply only for getArithmeticInstrCost for FMul, then this hypothetical optimization will not account for the cost of the broken FMA opportunity. Same reason applies to SLP. If miss something, we need to fix it for all passes. We'd better to have one interface for all transformation passes, if possible. ABataev: > We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we…
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we need to check whether this removes an FMA opportunity, as otherwise we are going to overstate the cost savings provided by the horizontal reduction. We can add a check for this in TTI too. But you can't, because then you've said both the multiply and the add are free, as I said before. This will break some cost calculations. wjschmidt: > > We are doing a horizontal reduction of FAdds. At the time that we perform the reduction, we…
		ABataevUnsubmitted Not Done Reply Inline Actions But for this we'll need to modify a different function (I thought about getArithemticReductionCost()), though I'm not sure that we even need to that. Everything still should be handled by the getArithmeticCost function. ABataev: But for this we'll need to modify a different function (I thought about…
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions I remain very uncomfortable with trying to hide this in getArithmeticInstrCost for the reasons stated already; it's just too myopic an interface to be able to "look both ways" without making a costing error somewhere. But I may just be unable to see an obvious solution. Perhaps someone with more expertise in TTI would like to take a run at this. wjschmidt: I remain very uncomfortable with trying to hide this in getArithmeticInstrCost for the reasons…
		ArrayRef<Value *> &VL) {
		FastMathFlags FMF;
		FMF.set();

		// Only increase cost if every scalar in VL is a floating multiply feeding
		// exactly one floating add/subtract, and contracts are enabled.
		for (Value *U : VL) {
		ABataevUnsubmitted Not Done Reply Inline Actions I think it must be a part of TTI::getArithmeticReductionCost ABataev: I think it must be a part of TTI::getArithmeticReductionCost
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions Because this will also be called in a case where it's not part of a reduction, I'd like to keep it separate. (See my comment in the summary.) Perhaps I shouldn't have split the patches so this would be more evident. wjschmidt: Because this will also be called in a case where it's not part of a reduction, I'd like to keep…
		ABataevUnsubmitted Not Done Reply Inline Actions Anyway, maybe better to make it a part of TTI rather than put it to SLPVectorizer? ABataev: Anyway, maybe better to make it a part of TTI rather than put it to SLPVectorizer?
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions That's certainly possible, and probably a good idea. We'd need to pull out the final multiplication by Width and the SLP-specific debug stuff, but with that done it would fit nicely in TTI, and there might be potential for reuse. wjschmidt: That's certainly possible, and probably a good idea. We'd need to pull out the final…
		RKSimonUnsubmitted Not Done Reply Inline Actions Ideally the getArithmeticReductionCost costs wouldn't be affected but getArithmeticInstrCost could be adjusted to account for potential FMA patterns in the scalar/vector FADD costs RKSimon: Ideally the getArithmeticReductionCost costs wouldn't be affected but getArithmeticInstrCost…
		ABataevUnsubmitted Not Done Reply Inline Actions Yep, good idea! ABataev: Yep, good idea!
		wjschmidtAuthorUnsubmitted Done Reply Inline Actions I'll look into possibilities. That might be a bit awkward for the other use case. It's clear that I should have combined the two patches to make this easier to judge, so I'll do that in the next revision. Thanks for the ideas! wjschmidt: I'll look into possibilities. That might be a bit awkward for the other use case. It's clear…
		auto *FPMO = dyn_cast<FPMathOperator>(U);
		if (!(FPMO &&
		FPMO->getOpcode() == Instruction::FMul &&
		FPMO->hasAllowContract() &&
		FPMO->hasOneUse()))
		return;
		auto *I = dyn_cast<Instruction>(FPMO->user_back());
		if (!(I &&
		(I->getOpcode() == Instruction::FAdd \|\|
		I->getOpcode() == Instruction::FSub) &&
		I->hasAllowContract()))
		return;
		FMF &= FPMO->getFastMathFlags();
		}

		auto *I = dyn_cast<Instruction>(VL[0]);
		InstructionCost FMASavings = TTI->getFMACostSavings(I->getType(), FMF);

		if (FMASavings > 0) {
		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << VL.size() * FMASavings
		<< " for vectorization that breaks " << VL.size() << " FMAs\n");
		Cost += FMASavings * VL.size();
		}
		}

bool SLPVectorizerPass::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,		bool SLPVectorizerPass::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,
bool LimitForRegisterSize) {		bool LimitForRegisterSize) {
if (VL.size() < 2)		if (VL.size() < 2)
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize a list of length = "		LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize a list of length = "
<< VL.size() << ".\n");		<< VL.size() << ".\n");

▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;
R.reorderTopToBottom();		R.reorderTopToBottom();
R.reorderBottomToTop(!isa<InsertElementInst>(Ops.front()));		R.reorderBottomToTop(!isa<InsertElementInst>(Ops.front()));
R.buildExternalUses();		R.buildExternalUses();

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost();
		adjustForFMAs(Cost, Ops);
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);

if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");
R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList",		R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList",
cast<Instruction>(Ops[0]))		cast<Instruction>(Ops[0]))
<< "SLP vectorized with cost " << ore::NV("Cost", Cost)		<< "SLP vectorized with cost " << ore::NV("Cost", Cost)
▲ Show 20 Lines • Show All 1,017 Lines • ▼ Show 20 Lines	InstructionCost getReductionCost(TargetTransformInfo *TTI,
case RecurKind::Xor:		case RecurKind::Xor:
case RecurKind::FAdd:		case RecurKind::FAdd:
case RecurKind::FMul: {		case RecurKind::FMul: {
unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);		unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);
if (!AllConsts)		if (!AllConsts)
VectorCost =		VectorCost =
TTI->getArithmeticReductionCost(RdxOpcode, VectorTy, FMF, CostKind);		TTI->getArithmeticReductionCost(RdxOpcode, VectorTy, FMF, CostKind);
ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy, CostKind);		ScalarCost = TTI->getArithmeticInstrCost(RdxOpcode, ScalarTy, CostKind);

		if (RdxKind == RecurKind::FAdd) {
		auto I = dyn_cast<Instruction>(FirstReducedVal);
		if (I &&
		I->getOpcode() == Instruction::FMul &&
		I->hasAllowContract()) {
		InstructionCost FMASavings = TTI->getFMACostSavings(ScalarTy, FMF);
		if (FMASavings > 0) {
		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << ReduxWidth * FMASavings
		<< " for horizontal reduction that breaks "
		<< ReduxWidth << " FMAs\n");
		VectorCost += FMASavings * ReduxWidth;
		}
		}
		}
break;		break;
}		}
case RecurKind::FMax:		case RecurKind::FMax:
case RecurKind::FMin: {		case RecurKind::FMin: {
auto *SclCondTy = CmpInst::makeCmpResultType(ScalarTy);		auto *SclCondTy = CmpInst::makeCmpResultType(ScalarTy);
if (!AllConsts) {		if (!AllConsts) {
auto *VecCondTy =		auto *VecCondTy =
cast<VectorType>(CmpInst::makeCmpResultType(VectorTy));		cast<VectorType>(CmpInst::makeCmpResultType(VectorTy));
▲ Show 20 Lines • Show All 1,035 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -slp-vectorizer -S -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu -slp-threshold=-2 < %s \| FileCheck %s		; RUN: opt -slp-vectorizer -S -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu -slp-threshold=-2 < %s \| FileCheck %s

; This test checks for a case when a horizontal reduction of floating-point		; This test checks for a case when a horizontal reduction of floating-point
; adds may look profitable, but is not because it eliminates generation of		; adds may look profitable, but is not because it eliminates generation of
; floating-point FMAs that would be more profitable.		; floating-point FMAs that would be more profitable.

; FIXME: We generate a horizontal reduction today.

define void @hr() {		define void @hr() {
; CHECK-LABEL: @hr(		; CHECK-LABEL: @hr(
; CHECK-NEXT: br label [[LOOP:%.*]]		; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:		; CHECK: loop:
; CHECK-NEXT: [[PHI0:%.]] = phi double [ 0.000000e+00, [[TMP0:%.]] ], [ [[OP_RDX:%.*]], [[LOOP]] ]		; CHECK-NEXT: [[PHI0:%.]] = phi double [ 0.000000e+00, [[TMP0:%.]] ], [ [[ADD3:%.*]], [[LOOP]] ]
; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 0 to double		; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 0 to double
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> <double poison, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00>, double [[CVT0]], i32 0		; CHECK-NEXT: [[MUL0:%.*]] = fmul fast double 0.000000e+00, [[CVT0]]
; CHECK-NEXT: [[TMP2:%.*]] = fmul fast <4 x double> zeroinitializer, [[TMP1]]		; CHECK-NEXT: [[ADD0:%.*]] = fadd fast double [[MUL0]], [[PHI0]]
; CHECK-NEXT: [[TMP3:%.*]] = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> [[TMP2]])		; CHECK-NEXT: [[MUL1:%.*]] = fmul fast double 0.000000e+00, 0.000000e+00
; CHECK-NEXT: [[OP_RDX]] = fadd fast double [[TMP3]], [[PHI0]]		; CHECK-NEXT: [[ADD1:%.*]] = fadd fast double [[MUL1]], [[ADD0]]
		; CHECK-NEXT: [[MUL2:%.*]] = fmul fast double 0.000000e+00, 0.000000e+00
		; CHECK-NEXT: [[ADD2:%.*]] = fadd fast double [[MUL2]], [[ADD1]]
		; CHECK-NEXT: [[MUL3:%.*]] = fmul fast double 0.000000e+00, 0.000000e+00
		; CHECK-NEXT: [[ADD3]] = fadd fast double [[MUL3]], [[ADD2]]
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[LOOP]]		; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[LOOP]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
br label %loop		br label %loop

loop:		loop:
%phi0 = phi double [ 0.000000e+00, %0 ], [ %add3, %loop ]		%phi0 = phi double [ 0.000000e+00, %0 ], [ %add3, %loop ]
Show All 12 Lines	exit:
ret void		ret void
}		}

; This test checks for a case when either a horizontal reduction of		; This test checks for a case when either a horizontal reduction of
; floating-point adds, or vectorizing a tree of floating-point multiplies,		; floating-point adds, or vectorizing a tree of floating-point multiplies,
; may look profitable; but both are not because this eliminates generation		; may look profitable; but both are not because this eliminates generation
; of floating-point FMAs that would be more profitable.		; of floating-point FMAs that would be more profitable.

; FIXME: We generate a horizontal reduction today, and if that's disabled, we
; still vectorize some of the multiplies.

define double @hr_or_mul() {		define double @hr_or_mul() {
; CHECK-LABEL: @hr_or_mul(		; CHECK-LABEL: @hr_or_mul(
; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 3 to double		; CHECK-NEXT: [[CVT0:%.*]] = uitofp i16 3 to double
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> poison, double [[CVT0]], i32 0		; CHECK-NEXT: [[MUL0:%.*]] = fmul fast double 7.000000e+00, [[CVT0]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x double> [[TMP1]], <4 x double> poison, <4 x i32> zeroinitializer		; CHECK-NEXT: [[ADD0:%.*]] = fadd fast double [[MUL0]], [[CVT0]]
; CHECK-NEXT: [[TMP2:%.*]] = fmul fast <4 x double> <double 7.000000e+00, double -4.300000e+01, double 2.200000e-02, double 9.500000e+00>, [[SHUFFLE]]		; CHECK-NEXT: [[MUL1:%.*]] = fmul fast double -4.300000e+01, [[CVT0]]
; CHECK-NEXT: [[TMP3:%.*]] = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> [[TMP2]])		; CHECK-NEXT: [[ADD1:%.*]] = fadd fast double [[MUL1]], [[ADD0]]
; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast double [[TMP3]], [[CVT0]]		; CHECK-NEXT: [[MUL2:%.*]] = fmul fast double 2.200000e-02, [[CVT0]]
; CHECK-NEXT: ret double [[OP_RDX]]		; CHECK-NEXT: [[ADD2:%.*]] = fadd fast double [[MUL2]], [[ADD1]]
		; CHECK-NEXT: [[MUL3:%.*]] = fmul fast double 9.500000e+00, [[CVT0]]
		; CHECK-NEXT: [[ADD3:%.*]] = fadd fast double [[MUL3]], [[ADD2]]
		; CHECK-NEXT: ret double [[ADD3]]
;		;
%cvt0 = uitofp i16 3 to double		%cvt0 = uitofp i16 3 to double
RKSimonUnsubmitted Not Done Reply Inline Actions why is this a constant? RKSimon: why is this a constant?
wjschmidtAuthorUnsubmitted Done Reply Inline Actions This was reduced from a big messy function. Bugpoint ended up with an undef here and elsewhere, and previous reviewers asked me to remove the undefs. wjschmidt: This was reduced from a big messy function. Bugpoint ended up with an undef here and elsewhere…
%mul0 = fmul fast double 7.000000e+00, %cvt0		%mul0 = fmul fast double 7.000000e+00, %cvt0
%add0 = fadd fast double %mul0, %cvt0		%add0 = fadd fast double %mul0, %cvt0
%mul1 = fmul fast double -4.300000e+01, %cvt0		%mul1 = fmul fast double -4.300000e+01, %cvt0
%add1 = fadd fast double %mul1, %add0		%add1 = fadd fast double %mul1, %add0
%mul2 = fmul fast double 2.200000e-02, %cvt0		%mul2 = fmul fast double 2.200000e-02, %cvt0
%add2 = fadd fast double %mul2, %add1		%add2 = fadd fast double %mul2, %add1
%mul3 = fmul fast double 9.500000e+00, %cvt0		%mul3 = fmul fast double 9.500000e+00, %cvt0
%add3 = fadd fast double %mul3, %add2		%add3 = fadd fast double %mul3, %add2
ret double %add3		ret double %add3
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Account for cost of removing FMA opportunitiesNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 431351

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll

[SLP] Account for cost of removing FMA opportunities
Needs ReviewPublic