This is an archive of the discontinued LLVM Phabricator instance.

LoopVectorizer: let target limit memory intensive loops
Needs ReviewPublic

Authored by jonpa on Mar 8 2017, 3:55 AM.

Download Raw Diff

Details

Reviewers

rengolin
uweigand
javed.absar
mssimpso
hfinkel

Summary

On SystemZ it is imperative during loop unrolling that the number of stores in the resulting loop do not exceed the point where the processor can't handle them all and as a result severely slows down. This is the result when store tags run out. To avoid this during loop unrolling, the SystemZ backend counts the number of stores, and computes based on that sum the limit of number of iterations to produce.

This problem should be handled during loop vectorization as well. The loop vectorizer may decide to vectorize a loop while scalarizing a particular (store) instruction, which means the number of stores is increased. It can also perform unrolling "interleaving"), which also increases the number of stores.

In order to handle the case of scalarization, the widening decision must be available via a call to getWideningDecision(). Therefore this check could either be implemented in LoopVectorize.cpp, or the LoopVectorizationCostModel class must somehow be factored out of the file so that the target can get the InstWidening result for each store. I have begun with the simpler task of implementing this directly in LoopVectorizer, in the hope that this does not prove to be too crude to accept.

checkVectorizationFactorForMem() must be called after expectedCost(), so that the widening decisions for each VF are available.

Since getWideningDecision() is parameterized with VF, checkVectorizationFactorForMem() is called with each VF considered.

limitUnrollForMem() computes the max unroll factor in a similar fashion by counting stores. I felt I had to avoid the name limitInterleaveFactorForMem, because 'interleaving' is already used (in my opinion in a confusing way) for both memory-interleaving and unrolling.

Diff Detail

Event Timeline

jonpa created this revision.Mar 8 2017, 3:55 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMar 8 2017, 3:55 AM

I think it might be better if the store-counting part of this was located in the SystemZ TTI implementation. So the hook would look something like TTI.hookName(WideningDecions, VF). What do you think? Letting the targets see all the memory access decisions would probably be more generally useful. And I believe the only thing that would have to be exposed is the InstWidening enum.

Let me make sure that I understand: Are you trying to limit the number of store instructions/loop or are you trying to limit the numbers of stores/cycle?

In D30732#695820, @mssimpso wrote:

I think it might be better if the store-counting part of this was located in the SystemZ TTI implementation. So the hook would look something like TTI.hookName(WideningDecions, VF). What do you think? Letting the targets see all the memory access decisions would probably be more generally useful. And I believe the only thing that would have to be exposed is the InstWidening enum.

I was suspecting this might be generally preferred... :-)

Made the new hook instead return a pair of values {Max, Count}, with the Loop, WideningDecisions and VF as arguments (put new hook by getOperandsScalarizationOverhead() and enableAggressiveInterleaving(), even though this seems to be in the 'Scalar' section of the header file which makes me think this is actually not the right place?) .

Factored out InstWidening and made a new class DecisionList with a member getWideningDecision() in order to avoid having to duplicate that code. Put it in TargetInstrInfo.h. Is it ok to make these in the open, or should they perhaps be wrapped in a namespace or something?

Moved the actual counting of stores to the SystemZ implementation.

Added a short debug output in case this triggers.

Let me make sure that I understand: Are you trying to limit the number of store instructions/loop or are you trying to limit the numbers of stores/cycle?

It's about limiting the general input of stores into the processor.

In D30732#696453, @jonpa wrote:

In D30732#695820, @mssimpso wrote:

...

Let me make sure that I understand: Are you trying to limit the number of store instructions/loop or are you trying to limit the numbers of stores/cycle?

It's about limiting the general input of stores into the processor.

Please be very clear. Let me explain:

Imagine a processor that has a special cache for loop instructions, and this loop cache can only hold N stores. In this case, it is really important that the loop have no more than N store instructions.
Imagine a processor that cannot sustain one store/cycle because of limited bandwidth or other resources. Over N cycles, only M (< N) stores can be processed. This might or might not be related to the size of the store.

I think you're in situation (2) where the size of the store does not matter (since you described some kind of tag resource). In that case, please don't count stores in the loop. You'll want to get the anticipated loop cost (which is the closest thing we have to a cycle count) for each VF and go from there. You might also limit the VF based on the number of stores that would appear in a row, because that will always exceed your limit and stall something.

If you're in (1), then your limit probably makes sense only for loops that will fit in the relevant cache.

In short, vectorized loops can be quite large and, whatever you're doing, just counting stores sounds like an insufficiently-precise model.

In D30732#696479, @hfinkel wrote:

Please be very clear. Let me explain:

Imagine a processor that has a special cache for loop instructions, and this loop cache can only hold N stores. In this case, it is really important that the loop have no more than N store instructions.

Imagine a processor that cannot sustain one store/cycle because of limited bandwidth or other resources. Over N cycles, only M (< N) stores can be processed. This might or might not be related to the size of the store.

I think you're in situation (2) where the size of the store does not matter (since you described some kind of tag resource). In that case, please don't count stores in the loop. You'll want to get the anticipated loop cost (which is the closest thing we have to a cycle count) for each VF and go from there. You might also limit the VF based on the number of stores that would appear in a row, because that will always exceed your limit and stall something.

If you're in (1), then your limit probably makes sense only for loops that will fit in the relevant cache.

In short, vectorized loops can be quite large and, whatever you're doing, just counting stores sounds like an insufficiently-precise model.

The specific issue is neither quite (1) nor (2) as described above, but rather the following: while stores are "in flight" during the out-of-order execution, each store needs to be tagged with a store tag, which is used to make sure the out-of-order execution doesn't violate ordering constraints implied by the original instruction order. There is only a limited number of those store tags available, and if this is exceeded, the processor may have to flush the pipeline.

To avoid this, we want to ensure that at any point, the number of stores "in flight" simultaneously does not exceed this limit. The problem is that it is hard to predict *precisely* when each store is actually in flight, this depends on decisions of the out-of-order engine (e.g. when which instruction is issued) that are in general impossibly to predict in the compiler. So we have to use some heuristic to achieve the effect we want in most cases.

When we originally looked into this issue, this was in the context of loop unrolling. We wanted to unroll loops, in particular small loops, in order to increase overall the execution rate, e.g. due to reducing the total number of branches (and thereby pressure on the branch prediction subsystem). However, in some cases we noted that the unrolled loop performed worse than the original one. Investigation showed that due to the overall higher execution rate the number of stores simultaneously in flight was also increased, and now exceeded the limit of store tags, causing pipeline flushes.

To fix this, we wanted to stop loop unrolling before that happened, and tried to design a heuristic to detect that case. Using a heuristic modeled after your situation (2) didn't work well, since the ratio of stores to total instructions is mostly unaffected by unrolling; this metric doesn't track the secondary effects that were causing the difference we're seeing (e.g. prediction rate of the loop backwards branch). So we ended up with a simple heuristic modeled after your situation (1), i.e. simply don't unroll loops (any further) if the total number of stores in the loop exceeds a threshold.

Now of course in a very large loop, the presence of more stores than this threshold would be unlikely to cause any problems, and in that sense the heuristic is very conservative. However, for the purpose of loop unrolling this was acceptable, since for large loops it generally doesn't matter anyway if they're unrolled or not -- only for small loops is it really critical.

However, I can see your point that for a *vectorization* decision, this last statement is no longer true; we certainly do want to vectorize large loops! Jonas, I think this means we may indeed have to do something better than just counting the total number of stores in the loop ... (As always with these heuristics, this probably means tuning the parameters via actual measurements.)

In D30732#696561, @uweigand wrote:

In D30732#696479, @hfinkel wrote:

Please be very clear. Let me explain:

Imagine a processor that has a special cache for loop instructions, and this loop cache can only hold N stores. In this case, it is really important that the loop have no more than N store instructions.

Imagine a processor that cannot sustain one store/cycle because of limited bandwidth or other resources. Over N cycles, only M (< N) stores can be processed. This might or might not be related to the size of the store.

I think you're in situation (2) where the size of the store does not matter (since you described some kind of tag resource). In that case, please don't count stores in the loop. You'll want to get the anticipated loop cost (which is the closest thing we have to a cycle count) for each VF and go from there. You might also limit the VF based on the number of stores that would appear in a row, because that will always exceed your limit and stall something.

If you're in (1), then your limit probably makes sense only for loops that will fit in the relevant cache.

In short, vectorized loops can be quite large and, whatever you're doing, just counting stores sounds like an insufficiently-precise model.

The specific issue is neither quite (1) nor (2) as described above, but rather the following: while stores are "in flight" during the out-of-order execution, each store needs to be tagged with a store tag, which is used to make sure the out-of-order execution doesn't violate ordering constraints implied by the original instruction order. There is only a limited number of those store tags available, and if this is exceeded, the processor may have to flush the pipeline.

To avoid this, we want to ensure that at any point, the number of stores "in flight" simultaneously does not exceed this limit. The problem is that it is hard to predict *precisely* when each store is actually in flight, this depends on decisions of the out-of-order engine (e.g. when which instruction is issued) that are in general impossibly to predict in the compiler. So we have to use some heuristic to achieve the effect we want in most cases.

When we originally looked into this issue, this was in the context of loop unrolling. We wanted to unroll loops, in particular small loops, in order to increase overall the execution rate, e.g. due to reducing the total number of branches (and thereby pressure on the branch prediction subsystem). However, in some cases we noted that the unrolled loop performed worse than the original one. Investigation showed that due to the overall higher execution rate the number of stores simultaneously in flight was also increased, and now exceeded the limit of store tags, causing pipeline flushes.

To fix this, we wanted to stop loop unrolling before that happened, and tried to design a heuristic to detect that case. Using a heuristic modeled after your situation (2) didn't work well, since the ratio of stores to total instructions is mostly unaffected by unrolling; this metric doesn't track the secondary effects that were causing the difference we're seeing (e.g. prediction rate of the loop backwards branch). So we ended up with a simple heuristic modeled after your situation (1), i.e. simply don't unroll loops (any further) if the total number of stores in the loop exceeds a threshold.

Now of course in a very large loop, the presence of more stores than this threshold would be unlikely to cause any problems, and in that sense the heuristic is very conservative. However, for the purpose of loop unrolling this was acceptable, since for large loops it generally doesn't matter anyway if they're unrolled or not -- only for small loops is it really critical.

However, I can see your point that for a *vectorization* decision, this last statement is no longer true; we certainly do want to vectorize large loops! Jonas, I think this means we may indeed have to do something better than just counting the total number of stores in the loop ... (As always with these heuristics, this probably means tuning the parameters via actual measurements.)

This makes a lot of sense to me, thanks for explaining!

mkuper added a subscriber: mkuper.Mar 9 2017, 10:27 PM

I reworked this so that the decisions for VF and IC are handled separately:

New method checkVectorizationFactorForMem()
New argument: unsigned getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions = nullptr) const;

< Jonas, I think this means we may indeed have to do something better than just counting the total number of stores in the loop ...
See SystemZTTIImpl::checkVectorizationFactorForMem()

Herald added a reviewer: javed.absar. · View Herald TranscriptMar 22 2017, 3:44 AM

Herald added subscribers: nhaehnle, nemanjai, arsenm. · View Herald Transcript

jonpa added a reviewer: hfinkel.Mar 22 2017, 3:45 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

52 lines

TargetTransformInfoImpl.h

11 lines

CodeGen/

BasicTTIImpl.h

5 lines

lib/

Analysis/

TargetTransformInfo.cpp

22 lines

Target/

AArch64/

AArch64TargetTransformInfo.h

2 lines

AArch64TargetTransformInfo.cpp

3 lines

AMDGPU/

AMDGPUTargetTransformInfo.h

2 lines

AMDGPUTargetTransformInfo.cpp

3 lines

ARM/

ARMTargetTransformInfo.h

3 lines

PowerPC/

PPCTargetTransformInfo.h

3 lines

PPCTargetTransformInfo.cpp

3 lines

SystemZ/

SystemZTargetTransformInfo.h

14 lines

SystemZTargetTransformInfo.cpp

116 lines

X86/

X86TargetTransformInfo.h

3 lines

X86TargetTransformInfo.cpp

3 lines

Transforms/

Vectorize/

LoopVectorize.cpp

42 lines

Diff 92616

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	struct MemIntrinsicInfo {

/// This is the pointer that the intrinsic is loading from or storing to.		/// This is the pointer that the intrinsic is loading from or storing to.
/// If this is non-null, then analysis/optimization passes can assume that		/// If this is non-null, then analysis/optimization passes can assume that
/// this intrinsic is functionally equivalent to a load/store from this		/// this intrinsic is functionally equivalent to a load/store from this
/// pointer.		/// pointer.
Value *PtrVal;		Value *PtrVal;
};		};

		/// Decision that was taken by vectorizer during cost calculation for memory
		/// instruction.
		enum InstWidening {
		CM_Unknown,
		CM_Widen,
		CM_Interleave,
		CM_GatherScatter,
		CM_Scalarize
		};

		/// A map for cost model vectorization decision and cost for instructions.
		struct DecisionList : DenseMap<std::pair<Instruction *, unsigned>,
		std::pair<InstWidening, unsigned>> {

		/// Return the cost model decision for the given instruction \p I and
		/// vector width \p VF. Return CM_Unknown if this instruction did not
		/// pass through the cost modeling.
		InstWidening getWideningDecision(Instruction *I, unsigned VF);
		};

/// \brief This pass provides access to the codegen interfaces that are needed		/// \brief This pass provides access to the codegen interfaces that are needed
/// for IR-level transformations.		/// for IR-level transformations.
class TargetTransformInfo {		class TargetTransformInfo {
public:		public:
/// \brief Construct a TTI object using a type implementing the \c Concept		/// \brief Construct a TTI object using a type implementing the \c Concept
/// API below.		/// API below.
///		///
/// This is used by targets to construct a TTI wrapping their target-specific		/// This is used by targets to construct a TTI wrapping their target-specific
▲ Show 20 Lines • Show All 355 Lines • ▼ Show 20 Lines	public:
/// containing this constant value for the target.		/// containing this constant value for the target.
bool shouldBuildLookupTablesForConstant(Constant *C) const;		bool shouldBuildLookupTablesForConstant(Constant *C) const;

unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;		unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;

unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) const;		unsigned VF) const;

		/// If a target wants to avoid vectorization of a loop due to memory
		/// considerations, it should return false from this method.
		/// WideningDecisions is provided because if e.g. a store is scalarized, it
		/// will result in VF stores. This is needed when a target is known to
		/// suffer from store-tags depletion in cases of loops with too many
		/// stores.
		bool checkVectorizationFactorForMem(DecisionList &WideningDecisions,
		unsigned VF, unsigned Cost,
		float ScalarCost) const;

bool supportsVectorElementLoadStore() const;		bool supportsVectorElementLoadStore() const;

/// \brief Don't restrict interleaved unrolling to small loops.		/// \brief Don't restrict interleaved unrolling to small loops.
bool enableAggressiveInterleaving(bool LoopHasReductions) const;		bool enableAggressiveInterleaving(bool LoopHasReductions) const;

/// \brief Enable matching of interleaved access groups.		/// \brief Enable matching of interleaved access groups.
bool enableInterleavedAccessVectorization() const;		bool enableInterleavedAccessVectorization() const;

▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	public:
/// \return The maximum number of iterations to prefetch ahead. If the		/// \return The maximum number of iterations to prefetch ahead. If the
/// required number of iterations is more than this number, no prefetching is		/// required number of iterations is more than this number, no prefetching is
/// performed.		/// performed.
unsigned getMaxPrefetchIterationsAhead() const;		unsigned getMaxPrefetchIterationsAhead() const;

/// \return The maximum interleave factor that any transform should try to		/// \return The maximum interleave factor that any transform should try to
/// perform for this target. This number depends on the level of parallelism		/// perform for this target. This number depends on the level of parallelism
/// and the number of execution units in the CPU.		/// and the number of execution units in the CPU.
unsigned getMaxInterleaveFactor(unsigned VF) const;		unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions = nullptr) const;

/// \return The expected cost of arithmetic ops, such as mul, xor, fsub, etc.		/// \return The expected cost of arithmetic ops, such as mul, xor, fsub, etc.
/// \p Args is an optional argument which holds the instruction operands		/// \p Args is an optional argument which holds the instruction operands
/// values so the TTI can analyize those values searching for special		/// values so the TTI can analyize those values searching for special
/// cases\optimizations based on those values.		/// cases\optimizations based on those values.
int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty, OperandValueKind Opd1Info = OK_AnyValue,		unsigned Opcode, Type *Ty, OperandValueKind Opd1Info = OK_AnyValue,
OperandValueKind Opd2Info = OK_AnyValue,		OperandValueKind Opd2Info = OK_AnyValue,
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	public:
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
virtual unsigned getJumpBufSize() = 0;		virtual unsigned getJumpBufSize() = 0;
virtual bool shouldBuildLookupTables() = 0;		virtual bool shouldBuildLookupTables() = 0;
virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;		virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;
virtual unsigned		virtual unsigned
getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) = 0;		getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) = 0;
virtual unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		virtual unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) = 0;		unsigned VF) = 0;
		virtual bool checkVectorizationFactorForMem(DecisionList &WideningDecisions,
		unsigned VF, unsigned Cost,
		float ScalarCost) = 0;
virtual bool supportsVectorElementLoadStore() = 0;		virtual bool supportsVectorElementLoadStore() = 0;
virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;		virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
virtual bool enableInterleavedAccessVectorization() = 0;		virtual bool enableInterleavedAccessVectorization() = 0;
virtual bool isFPVectorizationPotentiallyUnsafe() = 0;		virtual bool isFPVectorizationPotentiallyUnsafe() = 0;
virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,		virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
unsigned BitWidth,		unsigned BitWidth,
unsigned AddressSpace,		unsigned AddressSpace,
unsigned Alignment,		unsigned Alignment,
Show All 9 Lines	public:
virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual unsigned getNumberOfRegisters(bool Vector) = 0;		virtual unsigned getNumberOfRegisters(bool Vector) = 0;
virtual unsigned getRegisterBitWidth(bool Vector) = 0;		virtual unsigned getRegisterBitWidth(bool Vector) = 0;
virtual unsigned getCacheLineSize() = 0;		virtual unsigned getCacheLineSize() = 0;
virtual unsigned getPrefetchDistance() = 0;		virtual unsigned getPrefetchDistance() = 0;
virtual unsigned getMinPrefetchStride() = 0;		virtual unsigned getMinPrefetchStride() = 0;
virtual unsigned getMaxPrefetchIterationsAhead() = 0;		virtual unsigned getMaxPrefetchIterationsAhead() = 0;
virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;		virtual unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions) = 0;
virtual unsigned		virtual unsigned
getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,		getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,
OperandValueKind Opd2Info,		OperandValueKind Opd2Info,
OperandValueProperties Opd1PropInfo,		OperandValueProperties Opd1PropInfo,
OperandValueProperties Opd2PropInfo,		OperandValueProperties Opd2PropInfo,
ArrayRef<const Value *> Args) = 0;		ArrayRef<const Value *> Args) = 0;
virtual int getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,		virtual int getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp) = 0;		Type *SubTp) = 0;
▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	public:
unsigned getScalarizationOverhead(Type *Ty, bool Insert,		unsigned getScalarizationOverhead(Type *Ty, bool Insert,
bool Extract) override {		bool Extract) override {
return Impl.getScalarizationOverhead(Ty, Insert, Extract);		return Impl.getScalarizationOverhead(Ty, Insert, Extract);
}		}
unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) override {		unsigned VF) override {
return Impl.getOperandsScalarizationOverhead(Args, VF);		return Impl.getOperandsScalarizationOverhead(Args, VF);
}		}
		bool checkVectorizationFactorForMem(DecisionList &WideningDecisions,
		unsigned VF, unsigned Cost,
		float ScalarCost) override {
		return Impl.checkVectorizationFactorForMem(WideningDecisions, VF,
		Cost, ScalarCost);
		}
bool supportsVectorElementLoadStore() override {		bool supportsVectorElementLoadStore() override {
return Impl.supportsVectorElementLoadStore();		return Impl.supportsVectorElementLoadStore();
}		}

bool enableAggressiveInterleaving(bool LoopHasReductions) override {		bool enableAggressiveInterleaving(bool LoopHasReductions) override {
return Impl.enableAggressiveInterleaving(LoopHasReductions);		return Impl.enableAggressiveInterleaving(LoopHasReductions);
}		}
bool enableInterleavedAccessVectorization() override {		bool enableInterleavedAccessVectorization() override {
return Impl.enableInterleavedAccessVectorization();		return Impl.enableInterleavedAccessVectorization();
}		}
bool isFPVectorizationPotentiallyUnsafe() override {		bool isFPVectorizationPotentiallyUnsafe() override {
return Impl.isFPVectorizationPotentiallyUnsafe();		return Impl.isFPVectorizationPotentiallyUnsafe();
Show All 38 Lines	public:
}		}
unsigned getPrefetchDistance() override { return Impl.getPrefetchDistance(); }		unsigned getPrefetchDistance() override { return Impl.getPrefetchDistance(); }
unsigned getMinPrefetchStride() override {		unsigned getMinPrefetchStride() override {
return Impl.getMinPrefetchStride();		return Impl.getMinPrefetchStride();
}		}
unsigned getMaxPrefetchIterationsAhead() override {		unsigned getMaxPrefetchIterationsAhead() override {
return Impl.getMaxPrefetchIterationsAhead();		return Impl.getMaxPrefetchIterationsAhead();
}		}
unsigned getMaxInterleaveFactor(unsigned VF) override {		unsigned getMaxInterleaveFactor(unsigned VF,
return Impl.getMaxInterleaveFactor(VF);		DecisionList *WideningDecisions) override {
		return Impl.getMaxInterleaveFactor(VF, WideningDecisions);
}		}
unsigned		unsigned
getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,		getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,
OperandValueKind Opd2Info,		OperandValueKind Opd2Info,
OperandValueProperties Opd1PropInfo,		OperandValueProperties Opd1PropInfo,
OperandValueProperties Opd2PropInfo,		OperandValueProperties Opd2PropInfo,
ArrayRef<const Value *> Args) override {		ArrayRef<const Value *> Args) override {
return Impl.getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,		return Impl.getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,
▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	public:

unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) {		unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) {
return 0;		return 0;
}		}

unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) { return 0; }		unsigned VF) { return 0; }

		bool checkVectorizationFactorForMem(DecisionList &WideningDecisions,
		unsigned VF, unsigned Cost,
		float ScalarCost) {
		return true;
		}

bool supportsVectorElementLoadStore() { return false; }		bool supportsVectorElementLoadStore() { return false; }

bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }		bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }

bool enableInterleavedAccessVectorization() { return false; }		bool enableInterleavedAccessVectorization() { return false; }

bool isFPVectorizationPotentiallyUnsafe() { return false; }		bool isFPVectorizationPotentiallyUnsafe() { return false; }

Show All 35 Lines	public:
unsigned getCacheLineSize() { return 0; }		unsigned getCacheLineSize() { return 0; }

unsigned getPrefetchDistance() { return 0; }		unsigned getPrefetchDistance() { return 0; }

unsigned getMinPrefetchStride() { return 1; }		unsigned getMinPrefetchStride() { return 1; }

unsigned getMaxPrefetchIterationsAhead() { return UINT_MAX; }		unsigned getMaxPrefetchIterationsAhead() { return UINT_MAX; }

unsigned getMaxInterleaveFactor(unsigned VF) { return 1; }		unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions = nullptr) {
		return 1;
		}

unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,		unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info,		TTI::OperandValueKind Opd1Info,
TTI::OperandValueKind Opd2Info,		TTI::OperandValueKind Opd2Info,
TTI::OperandValueProperties Opd1PropInfo,		TTI::OperandValueProperties Opd1PropInfo,
TTI::OperandValueProperties Opd2PropInfo,		TTI::OperandValueProperties Opd2PropInfo,
ArrayRef<const Value *> Args) {		ArrayRef<const Value *> Args) {
return 1;		return 1;
▲ Show 20 Lines • Show All 354 Lines • Show Last 20 Lines

include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 343 Lines • ▼ Show 20 Lines	unsigned getScalarizationOverhead(Type VecTy, ArrayRef<const Value > Args) {
else		else
// When no information on arguments is provided, we add the cost		// When no information on arguments is provided, we add the cost
// associated with one argument as a heuristic.		// associated with one argument as a heuristic.
Cost += getScalarizationOverhead(VecTy, false, true);		Cost += getScalarizationOverhead(VecTy, false, true);

return Cost;		return Cost;
}		}

unsigned getMaxInterleaveFactor(unsigned VF) { return 1; }		unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions = nullptr) {
		return 1;
		}

unsigned getArithmeticInstrCost(		unsigned getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >()) {		ArrayRef<const Value > Args = ArrayRef<const Value >()) {
▲ Show 20 Lines • Show All 748 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines
	}			}

	unsigned TargetTransformInfo::			unsigned TargetTransformInfo::
	getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,			getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
	unsigned VF) const {			unsigned VF) const {
	return TTIImpl->getOperandsScalarizationOverhead(Args, VF);			return TTIImpl->getOperandsScalarizationOverhead(Args, VF);
	}			}

				InstWidening DecisionList::
				getWideningDecision(Instruction *I, unsigned VF) {
				assert(VF >= 2 && "Expected VF >=2");
				std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);
				auto Itr = find(InstOnVF);
				if (Itr == end())
				return CM_Unknown;
				return Itr->second.first;
				}

				bool TargetTransformInfo::checkVectorizationFactorForMem(
				DecisionList &WideningDecisions, unsigned VF, unsigned Cost,
				float ScalarCost) const {
				return TTIImpl->checkVectorizationFactorForMem(WideningDecisions, VF,
				Cost, ScalarCost);
				}

	bool TargetTransformInfo::supportsVectorElementLoadStore() const {			bool TargetTransformInfo::supportsVectorElementLoadStore() const {
	return TTIImpl->supportsVectorElementLoadStore();			return TTIImpl->supportsVectorElementLoadStore();
	}			}

	bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {			bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {
	return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);			return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);
	}			}

	▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines
	unsigned TargetTransformInfo::getMinPrefetchStride() const {			unsigned TargetTransformInfo::getMinPrefetchStride() const {
	return TTIImpl->getMinPrefetchStride();			return TTIImpl->getMinPrefetchStride();
	}			}

	unsigned TargetTransformInfo::getMaxPrefetchIterationsAhead() const {			unsigned TargetTransformInfo::getMaxPrefetchIterationsAhead() const {
	return TTIImpl->getMaxPrefetchIterationsAhead();			return TTIImpl->getMaxPrefetchIterationsAhead();
	}			}

	unsigned TargetTransformInfo::getMaxInterleaveFactor(unsigned VF) const {			unsigned TargetTransformInfo::
	return TTIImpl->getMaxInterleaveFactor(VF);			getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions) const {
				return TTIImpl->getMaxInterleaveFactor(VF, WideningDecisions);
	}			}

	int TargetTransformInfo::getArithmeticInstrCost(			int TargetTransformInfo::getArithmeticInstrCost(
	unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,			unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,
	OperandValueKind Opd2Info, OperandValueProperties Opd1PropInfo,			OperandValueKind Opd2Info, OperandValueProperties Opd1PropInfo,
	OperandValueProperties Opd2PropInfo,			OperandValueProperties Opd2PropInfo,
	ArrayRef<const Value *> Args) const {			ArrayRef<const Value *> Args) const {
	int Cost = TTIImpl->getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,			int Cost = TTIImpl->getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,
	▲ Show 20 Lines • Show All 235 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	unsigned getRegisterBitWidth(bool Vector) {
if (Vector) {		if (Vector) {
if (ST->hasNEON())		if (ST->hasNEON())
return 128;		return 128;
return 0;		return 0;
}		}
return 64;		return 64;
}		}

unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions = nullptr);

int getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		int getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

int getExtractWithExtendCost(unsigned Opcode, Type Dst, VectorType VecTy,		int getExtractWithExtendCost(unsigned Opcode, Type Dst, VectorType VecTy,
unsigned Index);		unsigned Index);

int getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index);		int getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index);
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 526 Lines • ▼ Show 20 Lines	if (!I->isVectorTy())
continue;		continue;
if (I->getScalarSizeInBits() * I->getVectorNumElements() == 128)		if (I->getScalarSizeInBits() * I->getVectorNumElements() == 128)
Cost += getMemoryOpCost(Instruction::Store, I, 128, 0) +		Cost += getMemoryOpCost(Instruction::Store, I, 128, 0) +
getMemoryOpCost(Instruction::Load, I, 128, 0);		getMemoryOpCost(Instruction::Load, I, 128, 0);
}		}
return Cost;		return Cost;
}		}

unsigned AArch64TTIImpl::getMaxInterleaveFactor(unsigned VF) {		unsigned AArch64TTIImpl::
		getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions) {
return ST->getMaxInterleaveFactor();		return ST->getMaxInterleaveFactor();
}		}

void AArch64TTIImpl::getUnrollingPreferences(Loop *L,		void AArch64TTIImpl::getUnrollingPreferences(Loop *L,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// Enable partial unrolling and runtime unrolling.		// Enable partial unrolling and runtime unrolling.
BaseT::getUnrollingPreferences(L, UP);		BaseT::getUnrollingPreferences(L, UP);

▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	bool isLegalToVectorizeMemChain(unsigned ChainSizeInBytes,
unsigned AddrSpace) const;		unsigned AddrSpace) const;
bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,		bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,
unsigned Alignment,		unsigned Alignment,
unsigned AddrSpace) const;		unsigned AddrSpace) const;
bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,		bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,
unsigned Alignment,		unsigned Alignment,
unsigned AddrSpace) const;		unsigned AddrSpace) const;

unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions = nullptr);

int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >());		ArrayRef<const Value > Args = ArrayRef<const Value >());
Show All 21 Lines

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

	Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines
	}			}

	bool AMDGPUTTIImpl::isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,			bool AMDGPUTTIImpl::isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,
	unsigned Alignment,			unsigned Alignment,
	unsigned AddrSpace) const {			unsigned AddrSpace) const {
	return isLegalToVectorizeMemChain(ChainSizeInBytes, Alignment, AddrSpace);			return isLegalToVectorizeMemChain(ChainSizeInBytes, Alignment, AddrSpace);
	}			}

	unsigned AMDGPUTTIImpl::getMaxInterleaveFactor(unsigned VF) {			unsigned AMDGPUTTIImpl::
				getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions) {
	// Disable unrolling if the loop is not vectorized.			// Disable unrolling if the loop is not vectorized.
	if (VF == 1)			if (VF == 1)
	return 1;			return 1;

	// Semi-arbitrary large amount.			// Semi-arbitrary large amount.
	return 64;			return 64;
	}			}

	▲ Show 20 Lines • Show All 217 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	if (Vector) {
if (ST->hasNEON())		if (ST->hasNEON())
return 128;		return 128;
return 0;		return 0;
}		}

return 32;		return 32;
}		}

unsigned getMaxInterleaveFactor(unsigned VF) {		unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions = nullptr) {
return ST->getMaxInterleaveFactor();		return ST->getMaxInterleaveFactor();
}		}

int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);		int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);

int getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		int getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

Show All 40 Lines

lib/Target/PowerPC/PPCTargetTransformInfo.h

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	public:
/// @{		/// @{

bool enableAggressiveInterleaving(bool LoopHasReductions);		bool enableAggressiveInterleaving(bool LoopHasReductions);
bool enableInterleavedAccessVectorization();		bool enableInterleavedAccessVectorization();
unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector);		unsigned getRegisterBitWidth(bool Vector);
unsigned getCacheLineSize();		unsigned getCacheLineSize();
unsigned getPrefetchDistance();		unsigned getPrefetchDistance();
unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions = nullptr);
int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >());		ArrayRef<const Value > Args = ArrayRef<const Value >());
int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);		int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);
Show All 19 Lines

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

	Show First 20 Lines • Show All 244 Lines • ▼ Show 20 Lines
	}			}

	unsigned PPCTTIImpl::getPrefetchDistance() {			unsigned PPCTTIImpl::getPrefetchDistance() {
	// This seems like a reasonable default for the BG/Q (this pass is enabled, by			// This seems like a reasonable default for the BG/Q (this pass is enabled, by
	// default, only on the BG/Q).			// default, only on the BG/Q).
	return 300;			return 300;
	}			}

	unsigned PPCTTIImpl::getMaxInterleaveFactor(unsigned VF) {			unsigned PPCTTIImpl::
				getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions) {
	unsigned Directive = ST->getDarwinDirective();			unsigned Directive = ST->getDarwinDirective();
	// The 440 has no SIMD support, but floating-point instructions			// The 440 has no SIMD support, but floating-point instructions
	// have a 5-cycle latency, so unroll by 5x for latency hiding.			// have a 5-cycle latency, so unroll by 5x for latency hiding.
	if (Directive == PPC::DIR_440)			if (Directive == PPC::DIR_440)
	return 5;			return 5;

	// The A2 has no SIMD support, but floating-point instructions			// The A2 has no SIMD support, but floating-point instructions
	// have a 6-cycle latency, so unroll by 6x for latency hiding.			// have a 6-cycle latency, so unroll by 6x for latency hiding.
	▲ Show 20 Lines • Show All 188 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZTargetTransformInfo.h

Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	public:
/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector);		unsigned getRegisterBitWidth(bool Vector);

bool supportsVectorElementLoadStore() { return true; }		bool checkVectorizationFactorForMem(DecisionList &WideningDecisions,
		unsigned VF, unsigned Cost, float ScalarCost);

		bool supportsVectorElementLoadStore() { return true; }
		bool enableAggressiveInterleaving(bool LoopHasReductions) {
		return LoopHasReductions;
		}
bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }
bool isFPVectorizationPotentiallyUnsafe() { return false; }		bool isFPVectorizationPotentiallyUnsafe() { return false; }

int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
Show All 10 Lines	public:
int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace);		unsigned AddressSpace);

int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
unsigned Factor,		unsigned Factor,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace);		unsigned AddressSpace);

		std::pair<unsigned, unsigned>
		countStores(DecisionList &WideningDecisions, unsigned VF);

		unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions = nullptr);

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

Show All 19 Lines
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Target/CostTable.h"		#include "llvm/Target/CostTable.h"
#include "llvm/Target/TargetLowering.h"		#include "llvm/Target/TargetLowering.h"
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "systemztti"		#define DEBUG_TYPE "systemztti"

		static cl::opt<unsigned> MaxILFactor("max-interleave-factor",
		cl::Hidden,
		cl::init(1));

		// Experimental
		static cl::opt<float> SafeStoresRatio("safe-stores-ratio",
		cl::Hidden,
		cl::init(0.125));

		static cl::opt<float> MaxStoresRatioIncrease("max-stores-ratio-increase",
		cl::Hidden,
		cl::init(0.15));

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// SystemZ cost model.		// SystemZ cost model.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

int SystemZTTIImpl::getIntImmCost(const APInt &Imm, Type *Ty) {		int SystemZTTIImpl::getIntImmCost(const APInt &Imm, Type *Ty) {
assert(Ty->isIntegerTy());		assert(Ty->isIntegerTy());
▲ Show 20 Lines • Show All 740 Lines • ▼ Show 20 Lines	int SystemZTTIImpl::getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,

// Also add costs of producing the final vector operands.		// Also add costs of producing the final vector operands.
int NumOps = std::max(getNumberOfParts(VecTy), Factor);		int NumOps = std::max(getNumberOfParts(VecTy), Factor);

// (Checking for gaps in Indices produces nearly no change on benchmarks)		// (Checking for gaps in Indices produces nearly no change on benchmarks)

return Cost + NumOps;		return Cost + NumOps;
}		}

		std::pair<unsigned, unsigned> SystemZTTIImpl::
		countStores(DecisionList &WideningDecisions, unsigned VF) {
		unsigned NumScalarStores = 0;
		unsigned NumVecStores = 0;
		unsigned InterleavedBits = 0;

		for (auto &I : WideningDecisions) {
		Instruction *Ins = I.first.first;
		if (I.first.second != VF)
		continue;
		InstWidening Decision = I.second.first;

		if (Ins->getOpcode() != Instruction::Store)
		continue;

		Type *MemAccessTy = Ins->getOperand(0)->getType();
		VectorType *VecTy = VectorType::get(MemAccessTy, VF);
		unsigned ScalarCount = getMemoryOpCost(Instruction::Store, MemAccessTy, 0, 0);
		NumScalarStores += ScalarCount;

		switch (Decision) {
		case InstWidening::CM_Scalarize:
		NumVecStores += (ScalarCount * VF);
		break;
		case InstWidening::CM_Widen:
		NumVecStores += getMemoryOpCost(Instruction::Store, VecTy, 0, 0);
		break;
		case InstWidening::CM_Interleave:
		InterleavedBits += VecTy->getBitWidth();
		break;
		case InstWidening::CM_GatherScatter:
		assert(0 && "GatherScatter needs handling here");
		break;
		case InstWidening::CM_Unknown:
		assert(0 && "Store not processed?");
		break;
		}
		}

		NumVecStores += InterleavedBits / 128;
		if (InterleavedBits % 128)
		NumVecStores++;

		return std::make_pair(NumScalarStores, NumVecStores);
		}

		bool SystemZTTIImpl::checkVectorizationFactorForMem(
		DecisionList &WideningDecisions, unsigned VF,
		unsigned Cost, float ScalarCost) {
		std::pair<unsigned, unsigned> StoresCount = countStores(WideningDecisions, VF);
		unsigned NumScalarStores = StoresCount.first;
		unsigned NumVecStores = StoresCount.second;

		float StoresRatioScalar = NumScalarStores / ScalarCost;
		float StoresRatioVec = NumVecStores / (float) Cost;
		assert (StoresRatioScalar <= 1.0 && StoresRatioVec <= 1.0 &&
		"Bad ratio calculation.");

		if (StoresRatioVec < SafeStoresRatio)
		return true;

		if (StoresRatioVec <= StoresRatioScalar)
		return true;

		float StoresRatioIncrease = StoresRatioVec - StoresRatioScalar;

		float Improvement = (ScalarCost / (Cost / (float) VF));
		DEBUG (dbgs() << "STTI: Increased stores ratio -- " << "R_vec: " << StoresRatioVec
		<< " / R_scal: " << StoresRatioScalar
		<< " Diff: " << StoresRatioIncrease
		<< " Imp: " << Improvement << "\n");

		if (NumVecStores <= 12 && StoresRatioIncrease < MaxStoresRatioIncrease)
		return true;

		return false;
		}

		unsigned SystemZTTIImpl::
		getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions) {
		// If the loop will not be vectorized, don't interleave the loop.
		// Let regular unroll unroll the loop, which saves the overflow
		// check and memory check cost.
		if (VF == 1 \|\| !ST->hasVector())
		return 1;

		unsigned Max = MaxILFactor;

		// Num stores after vectorization. Currently just limiting unrolling like
		// for the regular unroller. TODO: we could be more generous, because this
		// might result in reductions optimisation.
		unsigned NumVecStores = countStores(*WideningDecisions, VF).second;
		unsigned StoresLim = (NumVecStores ? (12 / NumVecStores) : UINT_MAX);
		if (StoresLim == 0)
		StoresLim = 1;
		if (StoresLim < Max) {
		DEBUG (dbgs() << "STTI: Interleaving limited for memory intensive loop.\n");
		Max = StoresLim;
		}

		return Max;
		}

lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	public:

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector);		unsigned getRegisterBitWidth(bool Vector);
unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF,
		DecisionList *WideningDecisions = nullptr);
int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >());		ArrayRef<const Value > Args = ArrayRef<const Value >());
int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);		int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	unsigned X86TTIImpl::getRegisterBitWidth(bool Vector) {
}		}

if (ST->is64Bit())		if (ST->is64Bit())
return 64;		return 64;

return 32;		return 32;
}		}

unsigned X86TTIImpl::getMaxInterleaveFactor(unsigned VF) {		unsigned X86TTIImpl::
		getMaxInterleaveFactor(unsigned VF, DecisionList *WideningDecisions) {
// If the loop will not be vectorized, don't interleave the loop.		// If the loop will not be vectorized, don't interleave the loop.
// Let regular unroll to unroll the loop, which saves the overflow		// Let regular unroll to unroll the loop, which saves the overflow
// check and memory check cost.		// check and memory check cost.
if (VF == 1)		if (VF == 1)
return 1;		return 1;

if (ST->isAtom())		if (ST->isAtom())
return 1;		return 1;
▲ Show 20 Lines • Show All 2,178 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,960 Lines • ▼ Show 20 Lines	public:

/// \returns True if instruction \p I can be truncated to a smaller bitwidth		/// \returns True if instruction \p I can be truncated to a smaller bitwidth
/// for vectorization factor \p VF.		/// for vectorization factor \p VF.
bool canTruncateToMinimalBitwidth(Instruction *I, unsigned VF) const {		bool canTruncateToMinimalBitwidth(Instruction *I, unsigned VF) const {
return VF > 1 && MinBWs.count(I) && !isProfitableToScalarize(I, VF) &&		return VF > 1 && MinBWs.count(I) && !isProfitableToScalarize(I, VF) &&
!isScalarAfterVectorization(I, VF);		!isScalarAfterVectorization(I, VF);
}		}

/// Decision that was taken during cost calculation for memory instruction.
enum InstWidening {
CM_Unknown,
CM_Widen,
CM_Interleave,
CM_GatherScatter,
CM_Scalarize
};

/// Save vectorization decision \p W and \p Cost taken by the cost model for		/// Save vectorization decision \p W and \p Cost taken by the cost model for
/// instruction \p I and vector width \p VF.		/// instruction \p I and vector width \p VF.
void setWideningDecision(Instruction *I, unsigned VF, InstWidening W,		void setWideningDecision(Instruction *I, unsigned VF, InstWidening W,
unsigned Cost) {		unsigned Cost) {
assert(VF >= 2 && "Expected VF >=2");		assert(VF >= 2 && "Expected VF >=2");
WideningDecisions[std::make_pair(I, VF)] = std::make_pair(W, Cost);		WideningDecisions[std::make_pair(I, VF)] = std::make_pair(W, Cost);
}		}

Show All 13 Lines	for (unsigned i = 0; i < Grp->getFactor(); ++i) {
}		}
}		}
}		}

/// Return the cost model decision for the given instruction \p I and vector		/// Return the cost model decision for the given instruction \p I and vector
/// width \p VF. Return CM_Unknown if this instruction did not pass		/// width \p VF. Return CM_Unknown if this instruction did not pass
/// through the cost modeling.		/// through the cost modeling.
InstWidening getWideningDecision(Instruction *I, unsigned VF) {		InstWidening getWideningDecision(Instruction *I, unsigned VF) {
assert(VF >= 2 && "Expected VF >=2");		return WideningDecisions.getWideningDecision(I, VF);
std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);
auto Itr = WideningDecisions.find(InstOnVF);
if (Itr == WideningDecisions.end())
return CM_Unknown;
return Itr->second.first;
}		}

/// Return the vectorization cost for the given instruction \p I and vector		/// Return the vectorization cost for the given instruction \p I and vector
/// width \p VF.		/// width \p VF.
unsigned getWideningCost(Instruction *I, unsigned VF) {		unsigned getWideningCost(Instruction *I, unsigned VF) {
assert(VF >= 2 && "Expected VF >=2");		assert(VF >= 2 && "Expected VF >=2");
std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);		std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);
assert(WideningDecisions.count(InstOnVF) && "The cost is not calculated");		assert(WideningDecisions.count(InstOnVF) && "The cost is not calculated");
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	if (VF == 1 \|\| Uniforms.count(VF))
return;		return;
setCostBasedWideningDecision(VF);		setCostBasedWideningDecision(VF);
collectLoopUniforms(VF);		collectLoopUniforms(VF);
collectLoopScalars(VF);		collectLoopScalars(VF);
}		}

/// Keeps cost model vectorization decision and cost for instructions.		/// Keeps cost model vectorization decision and cost for instructions.
/// Right now it is used for memory instructions only.		/// Right now it is used for memory instructions only.
typedef DenseMap<std::pair<Instruction *, unsigned>,
std::pair<InstWidening, unsigned>>
DecisionList;

DecisionList WideningDecisions;		DecisionList WideningDecisions;

public:		public:
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;
/// Predicated scalar evolution analysis.		/// Predicated scalar evolution analysis.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;
/// Loop Info analysis.		/// Loop Info analysis.
▲ Show 20 Lines • Show All 773 Lines • ▼ Show 20 Lines

void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {		void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
// Attempt to issue a wide load.		// Attempt to issue a wide load.
LoadInst *LI = dyn_cast<LoadInst>(Instr);		LoadInst *LI = dyn_cast<LoadInst>(Instr);
StoreInst *SI = dyn_cast<StoreInst>(Instr);		StoreInst *SI = dyn_cast<StoreInst>(Instr);

assert((LI \|\| SI) && "Invalid Load/Store instruction");		assert((LI \|\| SI) && "Invalid Load/Store instruction");

LoopVectorizationCostModel::InstWidening Decision =		InstWidening Decision = Cost->getWideningDecision(Instr, VF);
Cost->getWideningDecision(Instr, VF);		assert(Decision != InstWidening::CM_Unknown &&
assert(Decision != LoopVectorizationCostModel::CM_Unknown &&
"CM decision should be taken at this point");		"CM decision should be taken at this point");
if (Decision == LoopVectorizationCostModel::CM_Interleave)		if (Decision == InstWidening::CM_Interleave)
return vectorizeInterleaveGroup(Instr);		return vectorizeInterleaveGroup(Instr);

Type *ScalarDataTy = getMemInstValueType(Instr);		Type *ScalarDataTy = getMemInstValueType(Instr);
Type *DataTy = VectorType::get(ScalarDataTy, VF);		Type *DataTy = VectorType::get(ScalarDataTy, VF);
Value *Ptr = getPointerOperand(Instr);		Value *Ptr = getPointerOperand(Instr);
unsigned Alignment = getMemInstAlignment(Instr);		unsigned Alignment = getMemInstAlignment(Instr);
// An alignment of 0 means target abi alignment. We need to use the scalar's		// An alignment of 0 means target abi alignment. We need to use the scalar's
// target abi alignment in such a case.		// target abi alignment in such a case.
const DataLayout &DL = Instr->getModule()->getDataLayout();		const DataLayout &DL = Instr->getModule()->getDataLayout();
if (!Alignment)		if (!Alignment)
Alignment = DL.getABITypeAlignment(ScalarDataTy);		Alignment = DL.getABITypeAlignment(ScalarDataTy);
unsigned AddressSpace = getMemInstAddressSpace(Instr);		unsigned AddressSpace = getMemInstAddressSpace(Instr);

// Scalarize the memory instruction if necessary.		// Scalarize the memory instruction if necessary.
if (Decision == LoopVectorizationCostModel::CM_Scalarize)		if (Decision == InstWidening::CM_Scalarize)
return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));		return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));

// Determine if the pointer operand of the access is either consecutive or		// Determine if the pointer operand of the access is either consecutive or
// reverse consecutive.		// reverse consecutive.
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;
bool CreateGatherScatter =		bool CreateGatherScatter =
(Decision == LoopVectorizationCostModel::CM_GatherScatter);		(Decision == InstWidening::CM_GatherScatter);

VectorParts VectorGep;		VectorParts VectorGep;

// Handle consecutive loads/stores.		// Handle consecutive loads/stores.
GetElementPtrInst *Gep = getGEPInstruction(Ptr);		GetElementPtrInst *Gep = getGEPInstruction(Ptr);
if (ConsecutiveStride) {		if (ConsecutiveStride) {
if (Gep) {		if (Gep) {
unsigned NumOperands = Gep->getNumOperands();		unsigned NumOperands = Gep->getNumOperands();
▲ Show 20 Lines • Show All 3,339 Lines • ▼ Show 20 Lines	if (ForceVectorization && MaxVF > 1) {
Cost = expectedCost(Width).first / (float)Width;		Cost = expectedCost(Width).first / (float)Width;
}		}

for (unsigned i = 2; i <= MaxVF; i *= 2) {		for (unsigned i = 2; i <= MaxVF; i *= 2) {
// Notice that the vector loop needs to be executed less times, so		// Notice that the vector loop needs to be executed less times, so
// we need to divide the cost of the vector loops by the width of		// we need to divide the cost of the vector loops by the width of
// the vector elements.		// the vector elements.
VectorizationCostTy C = expectedCost(i);		VectorizationCostTy C = expectedCost(i);

float VectorCost = C.first / (float)i;		float VectorCost = C.first / (float)i;
DEBUG(dbgs() << "LV: Vector loop of width " << i		DEBUG(dbgs() << "LV: Vector loop of width " << i
<< " costs: " << (int)VectorCost << ".\n");		<< " costs: " << (int)VectorCost << ".\n");
if (!C.second && !ForceVectorization) {		if (!C.second && !ForceVectorization) {
DEBUG(		DEBUG(
dbgs() << "LV: Not considering vector loop of width " << i		dbgs() << "LV: Not considering vector loop of width " << i
<< " because it will not generate any vector instructions.\n");		<< " because it will not generate any vector instructions.\n");
continue;		continue;
}		}
if (VectorCost < Cost) {		if (VectorCost < Cost) {
		// Target may put a limit on memory intenisve loops. This depends on
		// widening decisions, so it must be done after expectedCost().
		if (!TTI.checkVectorizationFactorForMem(WideningDecisions, i, C.first,
		ScalarCost)) {
		DEBUG (dbgs() << "LV: VF limited for memory intensive loop.\n");
		continue;
		}

Cost = VectorCost;		Cost = VectorCost;
Width = i;		Width = i;
}		}
}		}

DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()		DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		<< "LV: Vectorization seems to be not beneficial, "
<< "but was forced by a user.\n");		<< "but was forced by a user.\n");
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
R.MaxLocalUsers);		R.MaxLocalUsers);

// Don't count the induction variable as interleaved.		// Don't count the induction variable as interleaved.
if (EnableIndVarRegisterHeur)		if (EnableIndVarRegisterHeur)
IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /		IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /
std::max(1U, (R.MaxLocalUsers - 1)));		std::max(1U, (R.MaxLocalUsers - 1)));

// Clamp the interleave ranges to reasonable counts.		// Clamp the interleave ranges to reasonable counts.
unsigned MaxInterleaveCount = TTI.getMaxInterleaveFactor(VF);		unsigned MaxInterleaveCount = TTI.getMaxInterleaveFactor(VF, &WideningDecisions);

// Check if the user has overridden the max.		// Check if the user has overridden the max.
if (VF == 1) {		if (VF == 1) {
if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0)		if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0)
MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor;		MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor;
} else {		} else {
if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)		if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)
MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;		MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;
▲ Show 20 Lines • Show All 1,414 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

LoopVectorizer: let target limit memory intensive loopsNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 92616

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

include/llvm/CodeGen/BasicTTIImpl.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/AArch64/AArch64TargetTransformInfo.h

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

lib/Target/ARM/ARMTargetTransformInfo.h

lib/Target/PowerPC/PPCTargetTransformInfo.h

lib/Target/PowerPC/PPCTargetTransformInfo.cpp

lib/Target/SystemZ/SystemZTargetTransformInfo.h

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

lib/Target/X86/X86TargetTransformInfo.h

lib/Target/X86/X86TargetTransformInfo.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

LoopVectorizer: let target limit memory intensive loops
Needs ReviewPublic