This is an archive of the discontinued LLVM Phabricator instance.

[LoopDataPrefetch, SystemZ] Increase the amount of prefetching.
Needs ReviewPublic

Authored by jonpa on Sep 19 2019, 3:40 AM.

Download Raw Diff

Details

Reviewers

uweigand
anemet
jfb

Summary

I found that adding prefetch instructions greatly improved performance of LBM. Currently no PFD:s are emitted in the hot loop by clang, but gcc does this.

I experimented with some parameters to the LoopDataPrefetch pass:

-min-prefetch-stride: "Min stride to add prefetches", default is 2048 for SystemZ.
-loop-prefetch-writes: enables prefetching for stores in addition to loads (default off).
-prefetch-distance: "Number of instructions to prefetch ahead", default is 2000 for SystemZ.

The stride of the accesses in the loop in LBM is 160, and the most important prefetch was that of the stores. So to improve LBM, I had to pass -min-prefetch-stride=160, and -loop-prefetch-writes. I found for this particular benchmark that it seemed just slightly better to also pass -prefetch-distance=4000.

On the whole of SPEC, this is the number of pfd instructions emitted:

SPEC 2006 (z14):
prefetches          Loads          Stores
gcc z14              1330            1841
current clang         394              18  (mvc loop expansions)
(B)                  1650             632
(C)                   583             261
(D)                  3343            1060

(B) -min-prefetch-stride=128 -loop-prefetch-writes -prefetch-distance=4000
(C) -min-prefetch-stride=128 -loop-prefetch-writes -prefetch-distance=4000 -max-prefetch-iters-ahead=75
(D) -min-prefetch-stride=80  -loop-prefetch-writes -prefetch-distance=4000

In these initial experiments B, C and D all improve LBM with ~15% while not affecting other benchmarks so much. C has an extra limit which makes for a smaller overall change compared to trunk.

This patch now reflects these initial experiments while not being necessarily optimal yet.

One additional small improvement of the LoopDataPrefetch pass I could see is to not emit prefetches in loops where the known constant trip count is smaller than the "iterations ahead" of the prefetch. This actually removed a few pfd:s in theoutput: (D) becomes 3178 / 969. I guess this could be committed separately. This would be the three lines with 'LoopConstantTripCount' and the new test.

Diff Detail

Event Timeline

jonpa created this revision.Sep 19 2019, 3:40 AM

Herald added a subscriber: dexonsmith. · View Herald TranscriptSep 19 2019, 3:40 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

SystemZ/

SystemZTargetTransformInfo.h

5 lines

Transforms/

Scalar/

LoopDataPrefetch.cpp

12 lines

test/

CodeGen/

SystemZ/

prefetch-02.ll

78 lines

Diff 220831

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 851 Lines • ▼ Show 20 Lines	public:
/// adding SW prefetches. The default is 1, i.e. prefetch with any stride.		/// adding SW prefetches. The default is 1, i.e. prefetch with any stride.
unsigned getMinPrefetchStride() const;		unsigned getMinPrefetchStride() const;

/// \return The maximum number of iterations to prefetch ahead. If the		/// \return The maximum number of iterations to prefetch ahead. If the
/// required number of iterations is more than this number, no prefetching is		/// required number of iterations is more than this number, no prefetching is
/// performed.		/// performed.
unsigned getMaxPrefetchIterationsAhead() const;		unsigned getMaxPrefetchIterationsAhead() const;

		/// \return True if prefetching should also be done for stores.
		bool doPrefetchWrites() const;

/// \return The maximum interleave factor that any transform should try to		/// \return The maximum interleave factor that any transform should try to
/// perform for this target. This number depends on the level of parallelism		/// perform for this target. This number depends on the level of parallelism
/// and the number of execution units in the CPU.		/// and the number of execution units in the CPU.
unsigned getMaxInterleaveFactor(unsigned VF) const;		unsigned getMaxInterleaveFactor(unsigned VF) const;

/// Collect properties of V used in cost analysis, e.g. OP_PowerOf2.		/// Collect properties of V used in cost analysis, e.g. OP_PowerOf2.
static OperandValueKind getOperandInfo(Value *V,		static OperandValueKind getOperandInfo(Value *V,
OperandValueProperties &OpProps);		OperandValueProperties &OpProps);
▲ Show 20 Lines • Show All 391 Lines • ▼ Show 20 Lines	public:
virtual bool shouldConsiderAddressTypePromotion(		virtual bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;		const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;
virtual unsigned getCacheLineSize() = 0;		virtual unsigned getCacheLineSize() = 0;
virtual llvm::Optional<unsigned> getCacheSize(CacheLevel Level) = 0;		virtual llvm::Optional<unsigned> getCacheSize(CacheLevel Level) = 0;
virtual llvm::Optional<unsigned> getCacheAssociativity(CacheLevel Level) = 0;		virtual llvm::Optional<unsigned> getCacheAssociativity(CacheLevel Level) = 0;
virtual unsigned getPrefetchDistance() = 0;		virtual unsigned getPrefetchDistance() = 0;
virtual unsigned getMinPrefetchStride() = 0;		virtual unsigned getMinPrefetchStride() = 0;
virtual unsigned getMaxPrefetchIterationsAhead() = 0;		virtual unsigned getMaxPrefetchIterationsAhead() = 0;
		virtual bool doPrefetchWrites() = 0;
virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;		virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;
virtual unsigned		virtual unsigned
getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,		getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,
OperandValueKind Opd2Info,		OperandValueKind Opd2Info,
OperandValueProperties Opd1PropInfo,		OperandValueProperties Opd1PropInfo,
OperandValueProperties Opd2PropInfo,		OperandValueProperties Opd2PropInfo,
ArrayRef<const Value *> Args) = 0;		ArrayRef<const Value *> Args) = 0;
virtual int getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,		virtual int getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,
▲ Show 20 Lines • Show All 352 Lines • ▼ Show 20 Lines	public:
}		}
unsigned getPrefetchDistance() override { return Impl.getPrefetchDistance(); }		unsigned getPrefetchDistance() override { return Impl.getPrefetchDistance(); }
unsigned getMinPrefetchStride() override {		unsigned getMinPrefetchStride() override {
return Impl.getMinPrefetchStride();		return Impl.getMinPrefetchStride();
}		}
unsigned getMaxPrefetchIterationsAhead() override {		unsigned getMaxPrefetchIterationsAhead() override {
return Impl.getMaxPrefetchIterationsAhead();		return Impl.getMaxPrefetchIterationsAhead();
}		}
		bool doPrefetchWrites() override {
		return Impl.doPrefetchWrites();
		}
unsigned getMaxInterleaveFactor(unsigned VF) override {		unsigned getMaxInterleaveFactor(unsigned VF) override {
return Impl.getMaxInterleaveFactor(VF);		return Impl.getMaxInterleaveFactor(VF);
}		}
unsigned getEstimatedNumberOfCaseClusters(const SwitchInst &SI,		unsigned getEstimatedNumberOfCaseClusters(const SwitchInst &SI,
unsigned &JTSize) override {		unsigned &JTSize) override {
return Impl.getEstimatedNumberOfCaseClusters(SI, JTSize);		return Impl.getEstimatedNumberOfCaseClusters(SI, JTSize);
}		}
unsigned		unsigned
▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 400 Lines • ▼ Show 20 Lines	public:
}		}

unsigned getPrefetchDistance() { return 0; }		unsigned getPrefetchDistance() { return 0; }

unsigned getMinPrefetchStride() { return 1; }		unsigned getMinPrefetchStride() { return 1; }

unsigned getMaxPrefetchIterationsAhead() { return UINT_MAX; }		unsigned getMaxPrefetchIterationsAhead() { return UINT_MAX; }

		bool doPrefetchWrites() { return false; }

unsigned getMaxInterleaveFactor(unsigned VF) { return 1; }		unsigned getMaxInterleaveFactor(unsigned VF) { return 1; }

unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,		unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info,		TTI::OperandValueKind Opd1Info,
TTI::OperandValueKind Opd2Info,		TTI::OperandValueKind Opd2Info,
TTI::OperandValueProperties Opd1PropInfo,		TTI::OperandValueProperties Opd1PropInfo,
TTI::OperandValueProperties Opd2PropInfo,		TTI::OperandValueProperties Opd2PropInfo,
ArrayRef<const Value *> Args) {		ArrayRef<const Value *> Args) {
▲ Show 20 Lines • Show All 499 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 526 Lines • ▼ Show 20 Lines
	unsigned TargetTransformInfo::getMinPrefetchStride() const {			unsigned TargetTransformInfo::getMinPrefetchStride() const {
	return TTIImpl->getMinPrefetchStride();			return TTIImpl->getMinPrefetchStride();
	}			}

	unsigned TargetTransformInfo::getMaxPrefetchIterationsAhead() const {			unsigned TargetTransformInfo::getMaxPrefetchIterationsAhead() const {
	return TTIImpl->getMaxPrefetchIterationsAhead();			return TTIImpl->getMaxPrefetchIterationsAhead();
	}			}

				bool TargetTransformInfo::doPrefetchWrites() const {
				return TTIImpl->doPrefetchWrites();
				}

	unsigned TargetTransformInfo::getMaxInterleaveFactor(unsigned VF) const {			unsigned TargetTransformInfo::getMaxInterleaveFactor(unsigned VF) const {
	return TTIImpl->getMaxInterleaveFactor(VF);			return TTIImpl->getMaxInterleaveFactor(VF);
	}			}

	TargetTransformInfo::OperandValueKind			TargetTransformInfo::OperandValueKind
	TargetTransformInfo::getOperandInfo(Value *V, OperandValueProperties &OpProps) {			TargetTransformInfo::getOperandInfo(Value *V, OperandValueProperties &OpProps) {
	OperandValueKind OpInfo = OK_AnyValue;			OperandValueKind OpInfo = OK_AnyValue;
	OpProps = OP_None;			OpProps = OP_None;
	▲ Show 20 Lines • Show All 848 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZTargetTransformInfo.h

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	public:

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector) const;		unsigned getRegisterBitWidth(bool Vector) const;

unsigned getCacheLineSize() { return 256; }		unsigned getCacheLineSize() { return 256; }
unsigned getPrefetchDistance() { return 2000; }		unsigned getPrefetchDistance() { return 4000; }
unsigned getMinPrefetchStride() { return 2048; }		unsigned getMinPrefetchStride() { return 80; }
		bool doPrefetchWrites() { return true; }

bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
bool prefersVectorizedAddressing() { return false; }		bool prefersVectorizedAddressing() { return false; }
bool LSRWithInstrQueries() { return true; }		bool LSRWithInstrQueries() { return true; }
bool supportsEfficientVectorElementLoadStore() { return true; }		bool supportsEfficientVectorElementLoadStore() { return true; }
bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

int getArithmeticInstrCost(		int getArithmeticInstrCost(
Show All 40 Lines

lib/Transforms/Scalar/LoopDataPrefetch.cpp

Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines

private:		private:
bool runOnLoop(Loop *L);		bool runOnLoop(Loop *L);

/// Check if the stride of the accesses is large enough to		/// Check if the stride of the accesses is large enough to
/// warrant a prefetch.		/// warrant a prefetch.
bool isStrideLargeEnough(const SCEVAddRecExpr *AR);		bool isStrideLargeEnough(const SCEVAddRecExpr *AR);

		bool doPrefetchWrites() {
		if (PrefetchWrites.getNumOccurrences() > 0)
		return PrefetchWrites;
		return TTI->doPrefetchWrites();
		}

unsigned getMinPrefetchStride() {		unsigned getMinPrefetchStride() {
if (MinPrefetchStride.getNumOccurrences() > 0)		if (MinPrefetchStride.getNumOccurrences() > 0)
return MinPrefetchStride;		return MinPrefetchStride;
return TTI->getMinPrefetchStride();		return TTI->getMinPrefetchStride();
}		}

unsigned getPrefetchDistance() {		unsigned getPrefetchDistance() {
if (PrefetchDistance.getNumOccurrences() > 0)		if (PrefetchDistance.getNumOccurrences() > 0)
▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	bool LoopDataPrefetch::runOnLoop(Loop *L) {

unsigned ItersAhead = getPrefetchDistance() / LoopSize;		unsigned ItersAhead = getPrefetchDistance() / LoopSize;
if (!ItersAhead)		if (!ItersAhead)
ItersAhead = 1;		ItersAhead = 1;

if (ItersAhead > getMaxPrefetchIterationsAhead())		if (ItersAhead > getMaxPrefetchIterationsAhead())
return MadeChange;		return MadeChange;

		unsigned LoopConstantTripCount = SE->getSmallConstantTripCount(L);
		if (LoopConstantTripCount && LoopConstantTripCount < ItersAhead)
		return MadeChange;

LLVM_DEBUG(dbgs() << "Prefetching " << ItersAhead		LLVM_DEBUG(dbgs() << "Prefetching " << ItersAhead
<< " iterations ahead (loop size: " << LoopSize << ") in "		<< " iterations ahead (loop size: " << LoopSize << ") in "
<< L->getHeader()->getParent()->getName() << ": " << *L);		<< L->getHeader()->getParent()->getName() << ": " << *L);

SmallVector<std::pair<Instruction , const SCEVAddRecExpr >, 16> PrefLoads;		SmallVector<std::pair<Instruction , const SCEVAddRecExpr >, 16> PrefLoads;
for (const auto BB : L->blocks()) {		for (const auto BB : L->blocks()) {
for (auto &I : *BB) {		for (auto &I : *BB) {
Value *PtrValue;		Value *PtrValue;
Instruction *MemI;		Instruction *MemI;

if (LoadInst *LMemI = dyn_cast<LoadInst>(&I)) {		if (LoadInst *LMemI = dyn_cast<LoadInst>(&I)) {
MemI = LMemI;		MemI = LMemI;
PtrValue = LMemI->getPointerOperand();		PtrValue = LMemI->getPointerOperand();
} else if (StoreInst *SMemI = dyn_cast<StoreInst>(&I)) {		} else if (StoreInst *SMemI = dyn_cast<StoreInst>(&I)) {
if (!PrefetchWrites) continue;		if (!doPrefetchWrites()) continue;
MemI = SMemI;		MemI = SMemI;
PtrValue = SMemI->getPointerOperand();		PtrValue = SMemI->getPointerOperand();
} else continue;		} else continue;

unsigned PtrAddrSpace = PtrValue->getType()->getPointerAddressSpace();		unsigned PtrAddrSpace = PtrValue->getType()->getPointerAddressSpace();
if (PtrAddrSpace)		if (PtrAddrSpace)
continue;		continue;

▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

test/CodeGen/SystemZ/prefetch-02.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z14 -prefetch-distance=2000 \
				; RUN: -stop-after=loop-data-prefetch \| FileCheck %s -check-prefix=FAR-PREFETCH
				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z14 -prefetch-distance=200 \
				; RUN: -stop-after=loop-data-prefetch \| FileCheck %s -check-prefix=NEAR-PREFETCH
				;
				; Check that prefetches are not emitted a the known constant trip count of
				; the loop is smaller than the estimated "iterations ahead" of the prefetch.
				;
				; FAR-PREFETCH-LABEL: fun
				; FAR-PREFETCH-NOT: call void @llvm.prefetch

				; NEAR-PREFETCH-LABEL: fun
				; NEAR-PREFETCH: call void @llvm.prefetch


				define void @fun(i32* nocapture %Src, i32* nocapture readonly %Dst) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next.9, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv
				store i32 %0, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 160
				%arrayidx.1 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next
				%1 = load i32, i32* %arrayidx.1, align 4
				%arrayidx2.1 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next
				store i32 %1, i32* %arrayidx2.1, align 4
				%indvars.iv.next.1 = add nuw nsw i64 %indvars.iv, 320
				%arrayidx.2 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.1
				%2 = load i32, i32* %arrayidx.2, align 4
				%arrayidx2.2 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.1
				store i32 %2, i32* %arrayidx2.2, align 4
				%indvars.iv.next.2 = add nuw nsw i64 %indvars.iv, 480
				%arrayidx.3 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.2
				%3 = load i32, i32* %arrayidx.3, align 4
				%arrayidx2.3 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.2
				store i32 %3, i32* %arrayidx2.3, align 4
				%indvars.iv.next.3 = add nuw nsw i64 %indvars.iv, 640
				%arrayidx.4 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.3
				%4 = load i32, i32* %arrayidx.4, align 4
				%arrayidx2.4 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.3
				store i32 %4, i32* %arrayidx2.4, align 4
				%indvars.iv.next.4 = add nuw nsw i64 %indvars.iv, 800
				%arrayidx.5 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.4
				%5 = load i32, i32* %arrayidx.5, align 4
				%arrayidx2.5 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.4
				store i32 %5, i32* %arrayidx2.5, align 4
				%indvars.iv.next.5 = add nuw nsw i64 %indvars.iv, 960
				%arrayidx.6 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.5
				%6 = load i32, i32* %arrayidx.6, align 4
				%arrayidx2.6 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.5
				store i32 %6, i32* %arrayidx2.6, align 4
				%indvars.iv.next.6 = add nuw nsw i64 %indvars.iv, 1120
				%arrayidx.7 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.6
				%7 = load i32, i32* %arrayidx.7, align 4
				%arrayidx2.7 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.6
				store i32 %7, i32* %arrayidx2.7, align 4
				%indvars.iv.next.7 = add nuw nsw i64 %indvars.iv, 1280
				%arrayidx.8 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.7
				%8 = load i32, i32* %arrayidx.8, align 4
				%arrayidx2.8 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.7
				store i32 %8, i32* %arrayidx2.8, align 4
				%indvars.iv.next.8 = add nuw nsw i64 %indvars.iv, 1440
				%arrayidx.9 = getelementptr inbounds i32, i32* %Dst, i64 %indvars.iv.next.8
				%9 = load i32, i32* %arrayidx.9, align 4
				%arrayidx2.9 = getelementptr inbounds i32, i32* %Src, i64 %indvars.iv.next.8
				store i32 %9, i32* %arrayidx2.9, align 4
				%indvars.iv.next.9 = add nuw nsw i64 %indvars.iv, 1600
				%cmp.9 = icmp ult i64 %indvars.iv.next.9, 11200
				br i1 %cmp.9, label %for.body, label %for.cond.cleanup
				}