This is an archive of the discontinued LLVM Phabricator instance.

[LoopDataPrefetch/AArch64] Allow selective prefetching of irregular symbolic strided accesses
AbandonedPublic

Authored by bmakam on Sep 22 2016, 9:50 AM.

Download Raw Diff

Details

Reviewers

anemet
mcrosier

Summary

Irregular streams typically consist of symbolic strided accesses.

For example:

struct MyStruct {
int field;
char kk[60];
} *my_struct;

int f(struct MyStruct *p, struct MyStruct *q, int N) {
  int total = 0;
  struct MyStruct *r = p;
  for (int i = 0; i < N/300; ++i)
    for (r = p + i; r < q; r += N)
      total += r->field;
  return total;
}

This software prefetching scheme looks for such irregular symbolic strides and prefetches
'PrefetchDegree' cache lines ahead of the visiting load address. Adds a TTI interface
'getPrefetchDegree' to query the target parameter.

This is currently enabled for Kryo only. A target would have to provide this information
to opt in to prefetch 'PrefetchDegree' cache lines ahead.

This improves performance of these Spec200x benchmarks on Kryo:

Benchmark	Diff (%)
spec2006/povray:ref	+1.738%
spec2006/gcc:ref	+1.749%
spec2006/mcf:ref	+7.936%
spec2000/gap:ref	+16.51%

Diff Detail

Event Timeline

bmakam updated this revision to Diff 72187.Sep 22 2016, 9:50 AM

bmakam retitled this revision from to [LoopDataPrefetch/AArch64] Allow selective prefetching of symbolic strided accesses.

bmakam updated this object.

bmakam added reviewers: anemet, mcrosier.

bmakam added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, mcrosier, rengolin, aemerson. · View Herald TranscriptSep 22 2016, 9:50 AM

junbuml added a subscriber: junbuml.Sep 22 2016, 9:53 AM

haicheng added a subscriber: haicheng.Sep 22 2016, 9:56 AM

Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:
for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];

There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.

In D24833#549827, @anemet wrote:
Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:
for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];
There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.

Sorry for the confusion. What I meant was address of B[i] can be represented as i*sizeof(B), so sizeof(B) is the loop-invariant stride here.

bmakam updated this object.Sep 22 2016, 11:20 AM

bmakam edited edge metadata.

mssimpso added a subscriber: mssimpso.Sep 23 2016, 10:44 AM

mssimpso added inline comments.Sep 23 2016, 11:06 AM

lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201	Would it be possible to reuse llvm::getStrideFromPointer instead of re-implementing some of the logic here?

bmakam added inline comments.Sep 23 2016, 11:18 AM

lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201	It seems like llvm::getStrideFromPointer is not suitable for reuse here, especially it somehow assumes that the PtrAccessSize is always 1 and returns null when PtrAccessSize != StepVal. For all the interesting cases this scheme targets the stepval is a non-unit constant.

In D24833#549828, @bmakam wrote:
In D24833#549827, @anemet wrote:
Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:
for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];
There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.
Sorry for the confusion. What I meant was address of B[i] can be represented as i*sizeof(B), so sizeof(B) is the loop-invariant stride here.

Sorry but I still don't understand this. Can you please elaborate on where the A[B[i]] style access is in this loop?

In D24833#551000, @anemet wrote:
In D24833#549828, @bmakam wrote:
In D24833#549827, @anemet wrote:
Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:
for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];
There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.
Sorry for the confusion. What I meant was address of B[i] can be represented as i*sizeof(B), so sizeof(B) is the loop-invariant stride here.
Sorry but I still don't understand this. Can you please elaborate on where the A[B[i]] style access is in this loop?

Sorry, this is not for a A[B[i]] style access. This is currently only restricted to a symbolic stride access. I plan to expand this to other irregular patterns such as A[B[i]] style access, hash maps like A[f(i)]->B style access and recursive data structures like A->next style accesses, but it will require some more testing.

mssimpso added inline comments.Sep 23 2016, 11:57 AM

lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201	That's just for the non-gep case, though, right? For example, if the pointer operand of a load/store is a pointer induction variable instead of a gep. In that case it checks that the pointer operand is an addrec like Ptr + V, where V is non-constant. Shouldn't that work?

bmakam updated this object.Sep 26 2016, 3:12 AM

Added more testscases.

bmakam added inline comments.Sep 26 2016, 3:20 AM

lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201	Sometimes GEP is a pointer but not an index, for example in mcf's case the Access Ptr is: %128 = getelementptr inbounds %struct.arc, %struct.arc* %111, i64 0, i32 0 and the SCEV is: {((64 * %105)<nsw> + %1)<nsw>,+,(64 * %100)<nsw>}<nw><%108> Here 64 is the size of %struct.arc and is a non-unit constant. I have added a test case to show where this won't work with llvm::getStrideFromPointer

bmakam updated this revision to Diff 72453.Sep 26 2016, 3:23 AM

Hi Balaram,

This seems like a well made patch. Correctly enabling the feature, using the pre-fetch when it's profitable and with good tests.

I'll leave the remaining of the reviews and approval to Adam et al, but from my side, the change looks good.

cheers,
--renato

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll
7	Don't force the CPU here, we have the -prefetch-degree for that. Once we have a CPU that pre-fetches aren't profitable, we can use Kryo vs that one as an example, in addition to the flag-based ones.

Oh, I forgot, you state all the improvements on SPEC, did you have any regressions? What's the overall geomean?

I was expecting some regressions when your prefetch guess was wrong (there are always those cases), so this may indicate a slightly better heuristics when enabling this for other cores / workloads.

cheers,
--renato

In D24833#552533, @rengolin wrote:

Hi Balaram,

This seems like a well made patch. Correctly enabling the feature, using the pre-fetch when it's profitable and with good tests.

I'll leave the remaining of the reviews and approval to Adam et al, but from my side, the change looks good.

Thanks for the review, Renato.

Oh, I forgot, you state all the improvements on SPEC, did you have any regressions? What's the overall geomean?

I was expecting some regressions when your prefetch guess was wrong (there are always those cases), so this may indicate a slightly better heuristics when enabling this for other cores > / workloads.

I have included all the gains and regressions in the summary. Only regression I saw on Kryo was a -1.28% in spec2006/h264ref and rest all were solid gains.

Thanks,
Balaram

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll
7	The LoopDataPrefetch pass is gated by targets that have PrefetchDistance set, so I am forcing the CPU to enable the software prefetcher. The -prefetch-degree flag is to enable the next line prefetching heuristic for targets that set the PrefetchDegree which is only Kryo currently.

In D24833#552654, @bmakam wrote:

I have included all the gains and regressions in the summary. Only regression I saw on Kryo was a -1.28% in spec2006/h264ref and rest all were solid gains.

Oh, I missed the negative result. :)

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll
7	Right, makes sense. Thanks!

@anemet/@mssimpso,

Do you have any additional comments on this change?

In D24833#554089, @bmakam wrote:

@anemet/@mssimpso,

Do you have any additional comments on this change?

Not yet, will look at it later today hopefully.

flyingforyou added a subscriber: flyingforyou.Sep 27 2016, 3:58 PM

anemet requested changes to this revision.Sep 28 2016, 8:33 PM

anemet edited edge metadata.

anemet added inline comments.

lib/Target/AArch64/AArch64Subtarget.cpp
76–77	So you are saying on one hand (MinPrefetchStride = 1024) that we shouldn't bother prefetching unless the stride is at least 1K but then you say (PrefetchDegree = 1) that you want to prefetch the very next cache line anytime the stride is not known at compile time. I feel that there is a contradiction here. The former suggest that you have a pretty powerful HW prefetcher, the latter that you don't and willing to speculate aggressively to compensate for it. It seems that something is wrong with the model here.

This revision now requires changes to proceed.Sep 28 2016, 8:33 PM

bmakam added inline comments.Sep 28 2016, 10:21 PM

lib/Target/AArch64/AArch64Subtarget.cpp
76–77	Thanks for the review, Adam. You are right, If the stride is unknown at compile time we speculate and prefetch the next cache line and if the stride is known we do not need to speculate so we conservatively prefetch for strides > 1K. Sorry, I do not see a contradiction here. Are you suggesting to insert runtime checks to determine that the unknown stride is > 1K and only then prefetch? I just do not see a justification for the additional complexity and probably it might hurt performance due to the runtime checks.

anemet added inline comments.Sep 28 2016, 10:39 PM

lib/Target/AArch64/AArch64Subtarget.cpp
76–77	No I am not suggesting run-time checks. Accesses with unknown strides are only omitted if MinPrefetchStrides is set (since we can't compare them at compile time). It seems to me that you don't want to set MinPrefetchStride for your target. Have you experimented with that? To explain the contradiction a bit more, your model says that you have a HW perfetcher that is able to track regular strides < 1024. Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024.

bmakam added inline comments.Sep 28 2016, 11:06 PM

lib/Target/AArch64/AArch64Subtarget.cpp
76–77	It seems to me that you don't want to set MinPrefetchStride for your target. Have you experimented with that? We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K. Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024 The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves. [1] When Prefetching Works, When It Doesn’t, and Why. JAEKYU LEE et. al.

anemet added inline comments.Sep 28 2016, 11:23 PM

lib/Target/AArch64/AArch64Subtarget.cpp
76–77	We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K. The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that. MinPrefetchStride is the minimum. Even if you don't set it (i.e. = 0), it will cover the milc case. The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves. Possibly, but this is not the situation you're analyzing for. You are really getting this by the chance of how you set up your heuristic.

bmakam added inline comments.Sep 28 2016, 11:39 PM

lib/Target/AArch64/AArch64Subtarget.cpp
76–77	The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that. MinPrefetchStride is the minimum. Even if you don't set it (i.e. = 0), it will cover the milc case. Yes, but this heuristic only improves spec2006/milc. Possibly, but this is not the situation you're analyzing for. You are really getting this by the chance of how you set up your heuristic. The chance that the data in the next cache line will be soon accessed is high when the access pattern is irregular (unknown compile time stride). This is the situation I am analyzing for and the chance seems pretty high as it improves several benchmarks not just spec2006/mcf.

Restricted to only irregular symbolic strides such as those found in spec2006/mcf and spec2000/gap. Please take a look.

bmakam retitled this revision from [LoopDataPrefetch/AArch64] Allow selective prefetching of symbolic strided accesses to [LoopDataPrefetch/AArch64] Allow selective prefetching of irregular symbolic strided accesses.Oct 4 2016, 10:38 PM

bmakam edited edge metadata.

Rebase and ping.

bmakam planned changes to this revision.Nov 23 2016, 7:50 AM

bmakam abandoned this revision.Jan 26 2017, 11:04 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

5 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

AArch64/

AArch64Subtarget.h

2 lines

AArch64Subtarget.cpp

1 line

AArch64TargetTransformInfo.h

2 lines

AArch64TargetTransformInfo.cpp

4 lines

Transforms/

Scalar/

LoopDataPrefetch.cpp

112 lines

test/

Transforms/

LoopDataPrefetch/

AArch64/

kryo-large-stride.ll

110 lines

Diff 74403

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 471 Lines • ▼ Show 20 Lines	public:

/// \return The size of a cache line in bytes.		/// \return The size of a cache line in bytes.
unsigned getCacheLineSize() const;		unsigned getCacheLineSize() const;

/// \return How much before a load we should place the prefetch instruction.		/// \return How much before a load we should place the prefetch instruction.
/// This is currently measured in number of instructions.		/// This is currently measured in number of instructions.
unsigned getPrefetchDistance() const;		unsigned getPrefetchDistance() const;

		/// \return The number of cache lines ahead we should prefetch.
		unsigned getPrefetchDegree() const;

/// \return Some HW prefetchers can handle accesses up to a certain constant		/// \return Some HW prefetchers can handle accesses up to a certain constant
/// stride. This is the minimum stride in bytes where it makes sense to start		/// stride. This is the minimum stride in bytes where it makes sense to start
/// adding SW prefetches. The default is 1, i.e. prefetch with any stride.		/// adding SW prefetches. The default is 1, i.e. prefetch with any stride.
unsigned getMinPrefetchStride() const;		unsigned getMinPrefetchStride() const;

/// \return The maximum number of iterations to prefetch ahead. If the		/// \return The maximum number of iterations to prefetch ahead. If the
/// required number of iterations is more than this number, no prefetching is		/// required number of iterations is more than this number, no prefetching is
/// performed.		/// performed.
▲ Show 20 Lines • Show All 237 Lines • ▼ Show 20 Lines	public:
virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual unsigned getNumberOfRegisters(bool Vector) = 0;		virtual unsigned getNumberOfRegisters(bool Vector) = 0;
virtual unsigned getRegisterBitWidth(bool Vector) = 0;		virtual unsigned getRegisterBitWidth(bool Vector) = 0;
virtual unsigned getCacheLineSize() = 0;		virtual unsigned getCacheLineSize() = 0;
virtual unsigned getPrefetchDistance() = 0;		virtual unsigned getPrefetchDistance() = 0;
		virtual unsigned getPrefetchDegree() = 0;
virtual unsigned getMinPrefetchStride() = 0;		virtual unsigned getMinPrefetchStride() = 0;
virtual unsigned getMaxPrefetchIterationsAhead() = 0;		virtual unsigned getMaxPrefetchIterationsAhead() = 0;
virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;		virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;
virtual unsigned		virtual unsigned
getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,		getArithmeticInstrCost(unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,
OperandValueKind Opd2Info,		OperandValueKind Opd2Info,
OperandValueProperties Opd1PropInfo,		OperandValueProperties Opd1PropInfo,
OperandValueProperties Opd2PropInfo) = 0;		OperandValueProperties Opd2PropInfo) = 0;
▲ Show 20 Lines • Show All 198 Lines • ▼ Show 20 Lines	public:
unsigned getRegisterBitWidth(bool Vector) override {		unsigned getRegisterBitWidth(bool Vector) override {
return Impl.getRegisterBitWidth(Vector);		return Impl.getRegisterBitWidth(Vector);
}		}

unsigned getCacheLineSize() override {		unsigned getCacheLineSize() override {
return Impl.getCacheLineSize();		return Impl.getCacheLineSize();
}		}
unsigned getPrefetchDistance() override { return Impl.getPrefetchDistance(); }		unsigned getPrefetchDistance() override { return Impl.getPrefetchDistance(); }
		unsigned getPrefetchDegree() override { return Impl.getPrefetchDegree(); }
unsigned getMinPrefetchStride() override {		unsigned getMinPrefetchStride() override {
return Impl.getMinPrefetchStride();		return Impl.getMinPrefetchStride();
}		}
unsigned getMaxPrefetchIterationsAhead() override {		unsigned getMaxPrefetchIterationsAhead() override {
return Impl.getMaxPrefetchIterationsAhead();		return Impl.getMaxPrefetchIterationsAhead();
}		}
unsigned getMaxInterleaveFactor(unsigned VF) override {		unsigned getMaxInterleaveFactor(unsigned VF) override {
return Impl.getMaxInterleaveFactor(VF);		return Impl.getMaxInterleaveFactor(VF);
▲ Show 20 Lines • Show All 220 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines	public:
unsigned getNumberOfRegisters(bool Vector) { return 8; }		unsigned getNumberOfRegisters(bool Vector) { return 8; }

unsigned getRegisterBitWidth(bool Vector) { return 32; }		unsigned getRegisterBitWidth(bool Vector) { return 32; }

unsigned getCacheLineSize() { return 0; }		unsigned getCacheLineSize() { return 0; }

unsigned getPrefetchDistance() { return 0; }		unsigned getPrefetchDistance() { return 0; }

		unsigned getPrefetchDegree() { return 0; }

unsigned getMinPrefetchStride() { return 1; }		unsigned getMinPrefetchStride() { return 1; }

unsigned getMaxPrefetchIterationsAhead() { return UINT_MAX; }		unsigned getMaxPrefetchIterationsAhead() { return UINT_MAX; }

unsigned getMaxInterleaveFactor(unsigned VF) { return 1; }		unsigned getMaxInterleaveFactor(unsigned VF) { return 1; }

unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,		unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info,		TTI::OperandValueKind Opd1Info,
▲ Show 20 Lines • Show All 277 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines
	unsigned TargetTransformInfo::getCacheLineSize() const {			unsigned TargetTransformInfo::getCacheLineSize() const {
	return TTIImpl->getCacheLineSize();			return TTIImpl->getCacheLineSize();
	}			}

	unsigned TargetTransformInfo::getPrefetchDistance() const {			unsigned TargetTransformInfo::getPrefetchDistance() const {
	return TTIImpl->getPrefetchDistance();			return TTIImpl->getPrefetchDistance();
	}			}

				unsigned TargetTransformInfo::getPrefetchDegree() const {
				return TTIImpl->getPrefetchDegree();
				}

	unsigned TargetTransformInfo::getMinPrefetchStride() const {			unsigned TargetTransformInfo::getMinPrefetchStride() const {
	return TTIImpl->getMinPrefetchStride();			return TTIImpl->getMinPrefetchStride();
	}			}

	unsigned TargetTransformInfo::getMaxPrefetchIterationsAhead() const {			unsigned TargetTransformInfo::getMaxPrefetchIterationsAhead() const {
	return TTIImpl->getMaxPrefetchIterationsAhead();			return TTIImpl->getMaxPrefetchIterationsAhead();
	}			}

	▲ Show 20 Lines • Show All 239 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	protected:
bool UseAlternateSExtLoadCVTF32Pattern = false;		bool UseAlternateSExtLoadCVTF32Pattern = false;
bool HasArithmeticBccFusion = false;		bool HasArithmeticBccFusion = false;
bool HasArithmeticCbzFusion = false;		bool HasArithmeticCbzFusion = false;
bool DisableLatencySchedHeuristic = false;		bool DisableLatencySchedHeuristic = false;
uint8_t MaxInterleaveFactor = 2;		uint8_t MaxInterleaveFactor = 2;
uint8_t VectorInsertExtractBaseCost = 3;		uint8_t VectorInsertExtractBaseCost = 3;
uint16_t CacheLineSize = 0;		uint16_t CacheLineSize = 0;
uint16_t PrefetchDistance = 0;		uint16_t PrefetchDistance = 0;
		uint16_t PrefetchDegree = 0;
uint16_t MinPrefetchStride = 1;		uint16_t MinPrefetchStride = 1;
unsigned MaxPrefetchIterationsAhead = UINT_MAX;		unsigned MaxPrefetchIterationsAhead = UINT_MAX;
unsigned PrefFunctionAlignment = 0;		unsigned PrefFunctionAlignment = 0;
unsigned PrefLoopAlignment = 0;		unsigned PrefLoopAlignment = 0;
unsigned MaxJumpTableSize = 0;		unsigned MaxJumpTableSize = 0;

// ReserveX18 - X18 is not available as a general purpose register.		// ReserveX18 - X18 is not available as a general purpose register.
bool ReserveX18;		bool ReserveX18;
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	public:
bool hasArithmeticBccFusion() const { return HasArithmeticBccFusion; }		bool hasArithmeticBccFusion() const { return HasArithmeticBccFusion; }
bool hasArithmeticCbzFusion() const { return HasArithmeticCbzFusion; }		bool hasArithmeticCbzFusion() const { return HasArithmeticCbzFusion; }
unsigned getMaxInterleaveFactor() const { return MaxInterleaveFactor; }		unsigned getMaxInterleaveFactor() const { return MaxInterleaveFactor; }
unsigned getVectorInsertExtractBaseCost() const {		unsigned getVectorInsertExtractBaseCost() const {
return VectorInsertExtractBaseCost;		return VectorInsertExtractBaseCost;
}		}
unsigned getCacheLineSize() const { return CacheLineSize; }		unsigned getCacheLineSize() const { return CacheLineSize; }
unsigned getPrefetchDistance() const { return PrefetchDistance; }		unsigned getPrefetchDistance() const { return PrefetchDistance; }
		unsigned getPrefetchDegree() const { return PrefetchDegree; }
unsigned getMinPrefetchStride() const { return MinPrefetchStride; }		unsigned getMinPrefetchStride() const { return MinPrefetchStride; }
unsigned getMaxPrefetchIterationsAhead() const {		unsigned getMaxPrefetchIterationsAhead() const {
return MaxPrefetchIterationsAhead;		return MaxPrefetchIterationsAhead;
}		}
unsigned getPrefFunctionAlignment() const { return PrefFunctionAlignment; }		unsigned getPrefFunctionAlignment() const { return PrefFunctionAlignment; }
unsigned getPrefLoopAlignment() const { return PrefLoopAlignment; }		unsigned getPrefLoopAlignment() const { return PrefLoopAlignment; }

unsigned getMaximumJumpTableSize() const { return MaxJumpTableSize; }		unsigned getMaximumJumpTableSize() const { return MaxJumpTableSize; }
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64Subtarget.cpp

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	case ExynosM1:
PrefLoopAlignment = 3;		PrefLoopAlignment = 3;
MaxJumpTableSize = 12;		MaxJumpTableSize = 12;
break;		break;
case Kryo:		case Kryo:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
VectorInsertExtractBaseCost = 2;		VectorInsertExtractBaseCost = 2;
CacheLineSize = 128;		CacheLineSize = 128;
PrefetchDistance = 740;		PrefetchDistance = 740;
		PrefetchDegree = 1;
MinPrefetchStride = 1024;		MinPrefetchStride = 1024;
		anemetUnsubmitted Not Done Reply Inline Actions So you are saying on one hand (MinPrefetchStride = 1024) that we shouldn't bother prefetching unless the stride is at least 1K but then you say (PrefetchDegree = 1) that you want to prefetch the very next cache line anytime the stride is not known at compile time. I feel that there is a contradiction here. The former suggest that you have a pretty powerful HW prefetcher, the latter that you don't and willing to speculate aggressively to compensate for it. It seems that something is wrong with the model here. anemet: So you are saying on one hand (MinPrefetchStride = 1024) that we shouldn't bother prefetching…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions Thanks for the review, Adam. You are right, If the stride is unknown at compile time we speculate and prefetch the next cache line and if the stride is known we do not need to speculate so we conservatively prefetch for strides > 1K. Sorry, I do not see a contradiction here. Are you suggesting to insert runtime checks to determine that the unknown stride is > 1K and only then prefetch? I just do not see a justification for the additional complexity and probably it might hurt performance due to the runtime checks. bmakam: Thanks for the review, Adam. You are right, If the stride is unknown at compile time we…
		anemetUnsubmitted Not Done Reply Inline Actions No I am not suggesting run-time checks. Accesses with unknown strides are only omitted if MinPrefetchStrides is set (since we can't compare them at compile time). It seems to me that you don't want to set MinPrefetchStride for your target. Have you experimented with that? To explain the contradiction a bit more, your model says that you have a HW perfetcher that is able to track regular strides < 1024. Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024. anemet: No I am not suggesting run-time checks. Accesses with unknown strides are only omitted if…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions It seems to me that you don't want to set MinPrefetchStride for your target. Have you experimented with that? We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K. Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024 The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves. [1] When Prefetching Works, When It Doesn’t, and Why. JAEKYU LEE et. al. bmakam: > It seems to me that you don't want to set MinPrefetchStride for your target. Have you…
		anemetUnsubmitted Not Done Reply Inline Actions We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K. The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that. MinPrefetchStride is the minimum. Even if you don't set it (i.e. = 0), it will cover the milc case. The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves. Possibly, but this is not the situation you're analyzing for. You are really getting this by the chance of how you set up your heuristic. anemet: > We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that. MinPrefetchStride is the minimum. Even if you don't set it (i.e. = 0), it will cover the milc case. Yes, but this heuristic only improves spec2006/milc. Possibly, but this is not the situation you're analyzing for. You are really getting this by the chance of how you set up your heuristic. The chance that the data in the next cache line will be soon accessed is high when the access pattern is irregular (unknown compile time stride). This is the situation I am analyzing for and the chance seems pretty high as it improves several benchmarks not just spec2006/mcf. bmakam: > The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that.
MaxPrefetchIterationsAhead = 11;		MaxPrefetchIterationsAhead = 11;
break;		break;
case Vulcan:		case Vulcan:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
break;		break;
case CortexA35: break;		case CortexA35: break;
case CortexA53: break;		case CortexA53: break;
case CortexA72: break;		case CortexA72: break;
▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	public:
int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace);		unsigned AddressSpace);

unsigned getCacheLineSize();		unsigned getCacheLineSize();

unsigned getPrefetchDistance();		unsigned getPrefetchDistance();

		unsigned getPrefetchDegree();

unsigned getMinPrefetchStride();		unsigned getMinPrefetchStride();

unsigned getMaxPrefetchIterationsAhead();		unsigned getMaxPrefetchIterationsAhead();
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

	Show First 20 Lines • Show All 627 Lines • ▼ Show 20 Lines
	unsigned AArch64TTIImpl::getCacheLineSize() {			unsigned AArch64TTIImpl::getCacheLineSize() {
	return ST->getCacheLineSize();			return ST->getCacheLineSize();
	}			}

	unsigned AArch64TTIImpl::getPrefetchDistance() {			unsigned AArch64TTIImpl::getPrefetchDistance() {
	return ST->getPrefetchDistance();			return ST->getPrefetchDistance();
	}			}

				unsigned AArch64TTIImpl::getPrefetchDegree() {
				return ST->getPrefetchDegree();
				}

	unsigned AArch64TTIImpl::getMinPrefetchStride() {			unsigned AArch64TTIImpl::getMinPrefetchStride() {
	return ST->getMinPrefetchStride();			return ST->getMinPrefetchStride();
	}			}

	unsigned AArch64TTIImpl::getMaxPrefetchIterationsAhead() {			unsigned AArch64TTIImpl::getMaxPrefetchIterationsAhead() {
	return ST->getMaxPrefetchIterationsAhead();			return ST->getMaxPrefetchIterationsAhead();
	}			}

lib/Transforms/Scalar/LoopDataPrefetch.cpp

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	PrefetchWrites("loop-prefetch-writes", cl::Hidden, cl::init(false),
cl::desc("Prefetch write addresses"));		cl::desc("Prefetch write addresses"));

static cl::opt<unsigned>		static cl::opt<unsigned>
PrefetchDistance("prefetch-distance",		PrefetchDistance("prefetch-distance",
cl::desc("Number of instructions to prefetch ahead"),		cl::desc("Number of instructions to prefetch ahead"),
cl::Hidden);		cl::Hidden);

static cl::opt<unsigned>		static cl::opt<unsigned>
		PrefetchDegree("prefetch-degree",
		cl::desc("Number of cache lines to prefetch ahead"),
		cl::Hidden);

		static cl::opt<unsigned>
MinPrefetchStride("min-prefetch-stride",		MinPrefetchStride("min-prefetch-stride",
cl::desc("Min stride to add prefetches"), cl::Hidden);		cl::desc("Min stride to add prefetches"), cl::Hidden);

static cl::opt<unsigned> MaxPrefetchIterationsAhead(		static cl::opt<unsigned> MaxPrefetchIterationsAhead(
"max-prefetch-iters-ahead",		"max-prefetch-iters-ahead",
cl::desc("Max number of iterations to prefetch ahead"), cl::Hidden);		cl::desc("Max number of iterations to prefetch ahead"), cl::Hidden);

STATISTIC(NumPrefetches, "Number of prefetches inserted");		STATISTIC(NumPrefetches, "Number of prefetches inserted");
Show All 12 Lines

private:		private:
bool runOnLoop(Loop *L);		bool runOnLoop(Loop *L);

/// \brief Check if the the stride of the accesses is large enough to		/// \brief Check if the the stride of the accesses is large enough to
/// warrant a prefetch.		/// warrant a prefetch.
bool isStrideLargeEnough(const SCEVAddRecExpr *AR);		bool isStrideLargeEnough(const SCEVAddRecExpr *AR);

		/// \brief Look for irregular symbolic strides.
		bool isIrregularSymbolicStride(const SCEVAddRecExpr AR, Loop L);

unsigned getMinPrefetchStride() {		unsigned getMinPrefetchStride() {
if (MinPrefetchStride.getNumOccurrences() > 0)		if (MinPrefetchStride.getNumOccurrences() > 0)
return MinPrefetchStride;		return MinPrefetchStride;
return TTI->getMinPrefetchStride();		return TTI->getMinPrefetchStride();
}		}

unsigned getPrefetchDistance() {		unsigned getPrefetchDistance() {
if (PrefetchDistance.getNumOccurrences() > 0)		if (PrefetchDistance.getNumOccurrences() > 0)
return PrefetchDistance;		return PrefetchDistance;
return TTI->getPrefetchDistance();		return TTI->getPrefetchDistance();
}		}

		unsigned getPrefetchDegree() {
		if (PrefetchDegree.getNumOccurrences() > 0)
		return PrefetchDegree;
		return TTI->getPrefetchDegree();
		}

unsigned getMaxPrefetchIterationsAhead() {		unsigned getMaxPrefetchIterationsAhead() {
if (MaxPrefetchIterationsAhead.getNumOccurrences() > 0)		if (MaxPrefetchIterationsAhead.getNumOccurrences() > 0)
return MaxPrefetchIterationsAhead;		return MaxPrefetchIterationsAhead;
return TTI->getMaxPrefetchIterationsAhead();		return TTI->getMaxPrefetchIterationsAhead();
}		}

AssumptionCache *AC;		AssumptionCache *AC;
LoopInfo *LI;		LoopInfo *LI;
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	bool LoopDataPrefetch::isStrideLargeEnough(const SCEVAddRecExpr *AR) {
// larger.		// larger.
if (!ConstStride)		if (!ConstStride)
return false;		return false;

unsigned AbsStride = std::abs(ConstStride->getAPInt().getSExtValue());		unsigned AbsStride = std::abs(ConstStride->getAPInt().getSExtValue());
return TargetMinStride <= AbsStride;		return TargetMinStride <= AbsStride;
}		}

		bool LoopDataPrefetch::isIrregularSymbolicStride(const SCEVAddRecExpr *AR,
		Loop *L) {
		// If there is no outer loop we cannot ensure the stride is irregular.
		const Loop *OuterLoop = L->getParentLoop();
		if (!OuterLoop)
		return false;

		// If the base address is varying, then it is an irregular access.
		const SCEV *Base = AR->getStart();
		if (SE->isLoopInvariant(Base, OuterLoop))
		return false;

		// Now check if it is a symbolic stride.
		const SCEV V = AR->getStepRecurrence(SE);
		if (!V)
		return false;

		const auto *M = dyn_cast<SCEVMulExpr>(V);
		if (!M)
		return false;
		// Check if the step value is a non-unit constant.
		// Be conservative and give up if step value is larger than the cache line.
		V = M->getOperand(0);
		if (V->getSCEVType() != scConstant)
		mssimpsoUnsubmitted Not Done Reply Inline Actions Would it be possible to reuse llvm::getStrideFromPointer instead of re-implementing some of the logic here? mssimpso: Would it be possible to reuse llvm::getStrideFromPointer instead of re-implementing some of the…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions It seems like llvm::getStrideFromPointer is not suitable for reuse here, especially it somehow assumes that the PtrAccessSize is always 1 and returns null when PtrAccessSize != StepVal. For all the interesting cases this scheme targets the stepval is a non-unit constant. bmakam: It seems like llvm::getStrideFromPointer is not suitable for reuse here, especially it somehow…
		mssimpsoUnsubmitted Not Done Reply Inline Actions That's just for the non-gep case, though, right? For example, if the pointer operand of a load/store is a pointer induction variable instead of a gep. In that case it checks that the pointer operand is an addrec like Ptr + V, where V is non-constant. Shouldn't that work? mssimpso: That's just for the non-gep case, though, right? For example, if the pointer operand of a…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions Sometimes GEP is a pointer but not an index, for example in mcf's case the Access Ptr is: %128 = getelementptr inbounds %struct.arc, %struct.arc* %111, i64 0, i32 0 and the SCEV is: {((64 * %105)<nsw> + %1)<nsw>,+,(64 * %100)<nsw>}<nw><%108> Here 64 is the size of %struct.arc and is a non-unit constant. I have added a test case to show where this won't work with llvm::getStrideFromPointer bmakam: Sometimes GEP is a pointer but not an index, for example in mcf's case the Access Ptr is: %128…
		return false;

		if (V->isOne() \|\| V->isAllOnesValue())
		return false;

		const APInt &APStepVal = cast<SCEVConstant>(V)->getAPInt();
		int64_t StepVal = APStepVal.getSExtValue();
		if (StepVal > TTI->getCacheLineSize())
		return false;

		V = M->getOperand(1);

		// Strip off casts.
		while (const auto *C = dyn_cast<SCEVCastExpr>(V))
		V = C->getOperand();

		const auto *U = dyn_cast<SCEVUnknown>(V);
		if (!U)
		return false;

		return (L->isLoopInvariant(U->getValue()));
		}

PreservedAnalyses LoopDataPrefetchPass::run(Function &F,		PreservedAnalyses LoopDataPrefetchPass::run(Function &F,
FunctionAnalysisManager &AM) {		FunctionAnalysisManager &AM) {
LoopInfo *LI = &AM.getResult<LoopAnalysis>(F);		LoopInfo *LI = &AM.getResult<LoopAnalysis>(F);
ScalarEvolution *SE = &AM.getResult<ScalarEvolutionAnalysis>(F);		ScalarEvolution *SE = &AM.getResult<ScalarEvolutionAnalysis>(F);
AssumptionCache *AC = &AM.getResult<AssumptionAnalysis>(F);		AssumptionCache *AC = &AM.getResult<AssumptionAnalysis>(F);
OptimizationRemarkEmitter *ORE =		OptimizationRemarkEmitter *ORE =
&AM.getResult<OptimizationRemarkEmitterAnalysis>(F);		&AM.getResult<OptimizationRemarkEmitterAnalysis>(F);
const TargetTransformInfo *TTI = &AM.getResult<TargetIRAnalysis>(F);		const TargetTransformInfo *TTI = &AM.getResult<TargetIRAnalysis>(F);
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	for (const auto BB : L->blocks()) {

Metrics.analyzeBasicBlock(BB, *TTI, EphValues);		Metrics.analyzeBasicBlock(BB, *TTI, EphValues);
}		}
unsigned LoopSize = Metrics.NumInsts;		unsigned LoopSize = Metrics.NumInsts;
if (!LoopSize)		if (!LoopSize)
LoopSize = 1;		LoopSize = 1;

unsigned ItersAhead = getPrefetchDistance() / LoopSize;		unsigned ItersAhead = getPrefetchDistance() / LoopSize;
		unsigned BytesAhead = TTI->getCacheLineSize() * getPrefetchDegree();
if (!ItersAhead)		if (!ItersAhead)
ItersAhead = 1;		ItersAhead = 1;

if (ItersAhead > getMaxPrefetchIterationsAhead())
return MadeChange;

DEBUG(dbgs() << "Prefetching " << ItersAhead
<< " iterations ahead (loop size: " << LoopSize << ") in "
<< L->getHeader()->getParent()->getName() << ": " << *L);

SmallVector<std::pair<Instruction , const SCEVAddRecExpr >, 16> PrefLoads;		SmallVector<std::pair<Instruction , const SCEVAddRecExpr >, 16> PrefLoads;
for (const auto BB : L->blocks()) {		for (const auto BB : L->blocks()) {
for (auto &I : *BB) {		for (auto &I : *BB) {
Value *PtrValue;		Value *PtrValue;
Instruction *MemI;		Instruction *MemI;
		bool PrefetchNextLine = false;

if (LoadInst *LMemI = dyn_cast<LoadInst>(&I)) {		if (LoadInst *LMemI = dyn_cast<LoadInst>(&I)) {
MemI = LMemI;		MemI = LMemI;
PtrValue = LMemI->getPointerOperand();		PtrValue = LMemI->getPointerOperand();
} else if (StoreInst *SMemI = dyn_cast<StoreInst>(&I)) {		} else if (StoreInst *SMemI = dyn_cast<StoreInst>(&I)) {
if (!PrefetchWrites) continue;		if (!PrefetchWrites) continue;
MemI = SMemI;		MemI = SMemI;
PtrValue = SMemI->getPointerOperand();		PtrValue = SMemI->getPointerOperand();
} else continue;		} else continue;

unsigned PtrAddrSpace = PtrValue->getType()->getPointerAddressSpace();		auto *PtrTy = dyn_cast<PointerType>(PtrValue->getType());
		if (!PtrTy \|\| PtrTy->isAggregateType())
		continue;

		unsigned PtrAddrSpace = PtrTy->getPointerAddressSpace();
if (PtrAddrSpace)		if (PtrAddrSpace)
continue;		continue;

if (L->isLoopInvariant(PtrValue))		if (L->isLoopInvariant(PtrValue))
continue;		continue;

const SCEV *LSCEV = SE->getSCEV(PtrValue);		const SCEV *LSCEV = SE->getSCEV(PtrValue);
const SCEVAddRecExpr *LSCEVAddRec = dyn_cast<SCEVAddRecExpr>(LSCEV);		const auto *LSCEVAddRec = dyn_cast<SCEVAddRecExpr>(LSCEV);
if (!LSCEVAddRec)		if (!LSCEVAddRec)
continue;		continue;

		if (getPrefetchDegree() > 0)
		PrefetchNextLine = isIrregularSymbolicStride(LSCEVAddRec, L);

		if (!PrefetchNextLine &&
		ItersAhead > getMaxPrefetchIterationsAhead())
		continue;

		if (PrefetchNextLine) {
		DEBUG(dbgs() << "Prefetching " << getPrefetchDegree()
		<< " cache line(s) ahead (irregular access: " << *PtrValue
		<< ") in " << L->getHeader()->getParent()->getName()
		<< ": " << *L);
		} else {
		DEBUG(dbgs() << "Prefetching " << ItersAhead
		<< " iterations ahead (loop size: " << LoopSize << ") in "
		<< L->getHeader()->getParent()->getName() << ": " << *L);
		}

// Check if the the stride of the accesses is large enough to warrant a		// Check if the the stride of the accesses is large enough to warrant a
// prefetch.		// prefetch.
if (!isStrideLargeEnough(LSCEVAddRec))		if (!PrefetchNextLine && !isStrideLargeEnough(LSCEVAddRec))
continue;		continue;

// We don't want to double prefetch individual cache lines. If this load		// We don't want to double prefetch individual cache lines. If this load
// is known to be within one cache line of some other load that has		// is known to be within one cache line of some other load that has
// already been prefetched, then don't prefetch this one as well.		// already been prefetched, then don't prefetch this one as well.
bool DupPref = false;		bool DupPref = false;
for (const auto &PrefLoad : PrefLoads) {		for (const auto &PrefLoad : PrefLoads) {
const SCEV *PtrDiff = SE->getMinusSCEV(LSCEVAddRec, PrefLoad.second);		const SCEV *PtrDiff = SE->getMinusSCEV(LSCEVAddRec, PrefLoad.second);
if (const SCEVConstant *ConstPtrDiff =		if (const SCEVConstant *ConstPtrDiff =
dyn_cast<SCEVConstant>(PtrDiff)) {		dyn_cast<SCEVConstant>(PtrDiff)) {
int64_t PD = std::abs(ConstPtrDiff->getValue()->getSExtValue());		int64_t PD = std::abs(ConstPtrDiff->getValue()->getSExtValue());
if (PD < (int64_t) TTI->getCacheLineSize()) {		if (PD < (int64_t) TTI->getCacheLineSize()) {
DupPref = true;		DupPref = true;
break;		break;
}		}
}		}
}		}
if (DupPref)		if (DupPref)
continue;		continue;

const SCEV *NextLSCEV = SE->getAddExpr(LSCEVAddRec, SE->getMulExpr(		// If this is a irregular symbolic stride, prefetch next cache line.
		const SCEV *NextLSCEV =
		PrefetchNextLine
		? SE->getAddExpr(LSCEVAddRec,
		SE->getConstant(LSCEVAddRec->getType(),
		BytesAhead))
		: SE->getAddExpr(
		LSCEVAddRec,
		SE->getMulExpr(
SE->getConstant(LSCEVAddRec->getType(), ItersAhead),		SE->getConstant(LSCEVAddRec->getType(), ItersAhead),
LSCEVAddRec->getStepRecurrence(*SE)));		LSCEVAddRec->getStepRecurrence(*SE)));
if (!isSafeToExpand(NextLSCEV, *SE))		if (!isSafeToExpand(NextLSCEV, *SE))
continue;		continue;

PrefLoads.push_back(std::make_pair(MemI, LSCEVAddRec));		PrefLoads.push_back(std::make_pair(MemI, LSCEVAddRec));

Type *I8Ptr = Type::getInt8PtrTy(BB->getContext(), PtrAddrSpace);		Type *I8Ptr = Type::getInt8PtrTy(BB->getContext(), PtrAddrSpace);
SCEVExpander SCEVE(*SE, I.getModule()->getDataLayout(), "prefaddr");		SCEVExpander SCEVE(*SE, I.getModule()->getDataLayout(), "prefaddr");
Value *PrefPtrValue = SCEVE.expandCodeFor(NextLSCEV, I8Ptr, MemI);		Value *PrefPtrValue = SCEVE.expandCodeFor(NextLSCEV, I8Ptr, MemI);
Show All 23 Lines

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll

; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -loop-data-prefetch -max-prefetch-iters-ahead=1000 -S < %s \| FileCheck %s --check-prefix=LARGE_PREFETCH --check-prefix=ALL		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -loop-data-prefetch -max-prefetch-iters-ahead=1000 -S < %s \| FileCheck %s --check-prefix=LARGE_PREFETCH --check-prefix=ALL
; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -loop-data-prefetch -S < %s \| FileCheck %s --check-prefix=NO_LARGE_PREFETCH --check-prefix=ALL		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -loop-data-prefetch -S < %s \| FileCheck %s --check-prefix=NO_LARGE_PREFETCH --check-prefix=ALL
		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -loop-data-prefetch -S < %s \| FileCheck %s --check-prefix=SYMBOLIC_PREFETCH --check-prefix=ALL
		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -loop-data-prefetch -prefetch-degree=0 -S < %s \| FileCheck %s --check-prefix=NO_SYMBOLIC_PREFETCH --check-prefix=ALL
; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -passes=loop-data-prefetch -max-prefetch-iters-ahead=1000 -S < %s \| FileCheck %s --check-prefix=LARGE_PREFETCH --check-prefix=ALL		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -passes=loop-data-prefetch -max-prefetch-iters-ahead=1000 -S < %s \| FileCheck %s --check-prefix=LARGE_PREFETCH --check-prefix=ALL
; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -passes=loop-data-prefetch -S < %s \| FileCheck %s --check-prefix=NO_LARGE_PREFETCH --check-prefix=ALL		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -passes=loop-data-prefetch -S < %s \| FileCheck %s --check-prefix=NO_LARGE_PREFETCH --check-prefix=ALL
		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -passes=loop-data-prefetch -S < %s \| FileCheck %s --check-prefix=SYMBOLIC_PREFETCH --check-prefix=ALL
		rengolinUnsubmitted Not Done Reply Inline Actions Don't force the CPU here, we have the -prefetch-degree for that. Once we have a CPU that pre-fetches aren't profitable, we can use Kryo vs that one as an example, in addition to the flag-based ones. rengolin: Don't force the CPU here, we have the -prefetch-degree for that. Once we have a CPU that pre…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions The LoopDataPrefetch pass is gated by targets that have PrefetchDistance set, so I am forcing the CPU to enable the software prefetcher. The -prefetch-degree flag is to enable the next line prefetching heuristic for targets that set the PrefetchDegree which is only Kryo currently. bmakam: The LoopDataPrefetch pass is gated by targets that have PrefetchDistance set, so I am forcing…
		rengolinUnsubmitted Not Done Reply Inline Actions Right, makes sense. Thanks! rengolin: Right, makes sense. Thanks!
		; RUN: opt -mcpu=kryo -mtriple=aarch64-gnu-linux -passes=loop-data-prefetch -prefetch-degree=0 -S < %s \| FileCheck %s --check-prefix=NO_SYMBOLIC_PREFETCH --check-prefix=ALL

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"

; ALL-LABEL: @small_stride(		; ALL-LABEL: @small_stride(
define void @small_stride(double* nocapture %a, double* nocapture readonly %b) {		define void @small_stride(double* nocapture %a, double* nocapture readonly %b) {
entry:		entry:
br label %for.body		br label %for.body

Show All 33 Lines	; NO_LARGE_PREFETCH-NOT: call void @llvm.prefetch
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 150		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 150
%exitcond = icmp eq i64 %indvars.iv.next, 160000		%exitcond = icmp eq i64 %indvars.iv.next, 160000
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

; ALL: for.end:		; ALL: for.end:
for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret void		ret void
}		}

		; No Prefetching in the presence of regular symbolic strides:
		;
		; for (unsigned i = 0; i < 100; i++)
		; A[i + 1] = A[Stride * i] + B[i];

		; ALL-LABEL: @regular_symbolic_stride(
		define void @regular_symbolic_stride(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i64 %N,
		i64 %stride) {
		entry:
		br label %for.body

		; ALL: for.body:
		for.body: ; preds = %for.body, %entry
		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
		%mul = mul i64 %indvars.iv, %stride
		; SYMBOLIC_PREFETCH-NOT: call void @llvm.prefetch
		; NO_SYMBOLIC_PREFETCH-NOT: call void @llvm.prefetch
		%arrayidx = getelementptr inbounds i32, i32* %A, i64 %mul
		%load = load i32, i32* %arrayidx, align 4
		%arrayidx2 = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
		%load_1 = load i32, i32* %arrayidx2, align 4
		%add = add i32 %load_1, %load
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%arrayidx_next = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
		store i32 %add, i32* %arrayidx_next, align 4
		%exitcond = icmp eq i64 %indvars.iv.next, %N
		br i1 %exitcond, label %for.end, label %for.body

		; ALL: for.end:
		for.end: ; preds = %for.body
		ret void
		}

		; Prefetching in the presence of irregular symbolic strides:
		;
		; struct MyStruct {
		; int field;
		; char kk[60];
		;} *my_struct;
		;
		;int f(struct MyStruct p, struct MyStruct q, int N) {
		; int total = 0;
		; struct MyStruct *r = p;
		; for (int i = 0; i < N/300; ++i)
		; for (r = p + i; r < q; r += N)
		; total += r->field;
		; return total;
		;}

		%struct.MyStruct = type { i32, [60 x i8] }

		@my_struct = local_unnamed_addr global %struct.MyStruct* null, align 8

		; ALL-LABEL: @irregular_symbolic_stride(
		define i32 @irregular_symbolic_stride(%struct.MyStruct* readonly %p, %struct.MyStruct* readnone %q, i32 %N) local_unnamed_addr {
		entry:
		%cmp23 = icmp sgt i32 %N, 299
		br i1 %cmp23, label %for.cond1.preheader.lr.ph, label %for.cond.cleanup

		for.cond1.preheader.lr.ph: ; preds = %entry
		%div27 = udiv i32 %N, 300
		%idx.ext4 = sext i32 %N to i64
		%0 = zext i32 %div27 to i64
		br label %for.cond1.preheader

		for.cond1.preheader: ; preds = %for.cond1.preheader.lr.ph, %for.inc6
		%indvars.iv = phi i64 [ 0, %for.cond1.preheader.lr.ph ], [ %indvars.iv.next, %for.inc6 ]
		%total.024 = phi i32 [ 0, %for.cond1.preheader.lr.ph ], [ %total.1.lcssa, %for.inc6 ]
		%add.ptr519 = getelementptr inbounds %struct.MyStruct, %struct.MyStruct* %p, i64 %indvars.iv
		%cmp220 = icmp ult %struct.MyStruct* %add.ptr519, %q
		br i1 %cmp220, label %for.body3.preheader, label %for.inc6

		for.body3.preheader: ; preds = %for.cond1.preheader
		br label %for.body3

		for.cond.cleanup.loopexit: ; preds = %for.inc6
		br label %for.cond.cleanup

		for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
		%total.0.lcssa = phi i32 [ 0, %entry ], [ %total.1.lcssa, %for.cond.cleanup.loopexit ]
		ret i32 %total.0.lcssa

		; ALL: for.body3:
		for.body3: ; preds = %for.body3.preheader, %for.body3
		%add.ptr522 = phi %struct.MyStruct* [ %add.ptr5, %for.body3 ], [ %add.ptr519, %for.body3.preheader ]
		%total.121 = phi i32 [ %add, %for.body3 ], [ %total.024, %for.body3.preheader ]
		%field = getelementptr inbounds %struct.MyStruct, %struct.MyStruct* %add.ptr522, i64 0, i32 0
		; SYMBOLIC_PREFETCH: call void @llvm.prefetch
		; NO_SYMBOLIC_PREFETCH-NOT: call void @llvm.prefetch
		%1 = load i32, i32* %field, align 4
		%add = add nsw i32 %1, %total.121
		%add.ptr5 = getelementptr inbounds %struct.MyStruct, %struct.MyStruct* %add.ptr522, i64 %idx.ext4
		%cmp2 = icmp ult %struct.MyStruct* %add.ptr5, %q
		br i1 %cmp2, label %for.body3, label %for.inc6.loopexit

		; ALL:for.inc6.loopexit
		for.inc6.loopexit: ; preds = %for.body3
		br label %for.inc6

		for.inc6: ; preds = %for.inc6.loopexit, %for.cond1.preheader
		%total.1.lcssa = phi i32 [ %total.024, %for.cond1.preheader ], [ %add, %for.inc6.loopexit ]
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%cmp = icmp slt i64 %indvars.iv.next, %0
		br i1 %cmp, label %for.cond1.preheader, label %for.cond.cleanup.loopexit
		}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopDataPrefetch/AArch64] Allow selective prefetching of irregular symbolic strided accessesAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 74403

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

lib/Analysis/TargetTransformInfo.cpp

lib/Target/AArch64/AArch64Subtarget.h

lib/Target/AArch64/AArch64Subtarget.cpp

lib/Target/AArch64/AArch64TargetTransformInfo.h

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

lib/Transforms/Scalar/LoopDataPrefetch.cpp

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll

[LoopDataPrefetch/AArch64] Allow selective prefetching of irregular symbolic strided accesses
AbandonedPublic