This is an archive of the discontinued LLVM Phabricator instance.

Differential D92985

[SystemZTTIImpl::getMinPrefetchStride] Allow some non-prefetched mem accesses.
ClosedPublic

Authored by jonpa on Dec 9 2020, 5:20 PM.

Download Raw Diff

Details

Reviewers

uweigand

Commits

rG0c2d23933f06: [SystemZTTIImpl] Allow some non-prefetched accesses in getMinPrefetchStride().

Summary

The performance improvement on LBM previously achieved with improved software prefetching (36d4421) seems to have gone lost with https://reviews.llvm.org/D88789. This commit (e00f189) removes a canonicalization that changed 'load double -> store double' to use i64 instead. I notice that there now is one memory access in the loop that LoopDataPrefetch cannot handle, while before there was none. This means that when getMinPrefetchStride() is called on LBM_performStreamCollideTRT NumStridedMemAccesses=92 but NumMemAccesses=93, so the test fails.

That memory instruction gets a SCEV that is not an LSCEVAddRec, but looks like

((8 * r({0,+,20}<nuw><nsw><%for.body> + %.sink571))<nsw> + %dstGrid)

, which is similar to one that is, e.g.

{(152 + %srcGrid)<nuw>,+,160}<nuw><%for.body>

There is still a stride it seems (+20), but it is not as simple of an expression as an LSCEVAddRec which LoopDataPrefetch can handle.

The reasoning in getMinPrefetchStride() tries to establish cases where software prefetching seems likely to be beneficial: too many (more than 16) prefetches may by themselves put an extra load on the LSU which could be bad. If the relatively few prefetches cover many more (more than 32) memory accesses, and there are not any other (non-prefetched) memory accesses that stress the memory systems to make prefetching less likely beneficial, then emit the prefetches. It doesn't seem right to bail just for one non-prefetched access, so it seems reasonable to allow a relatively very small amount of non-prefetched instructions, to make the heuristic more stable. This patch suggests allowing 1 non-prefetched memory access per 32 prefetched ones. This handles LBM, and also gives two prefetches to imagick which however do not seem to play a role.

Alternatively, it might be possible to find a refined heuristic that would differntiate between "stride not found" instructions and "stride found, but not handled" (like the one above). Out of the non-prefetched instructions (NumMemAccesses - NumStridedMemAccesses), there are those that have some kind of known stride, and then also those that do not have any kind of stride at all. A stride would load the h/w prefetcher, which may make s/w prefetching more desirable, but there is also the general load of the memory systems to consider...

Measuring LBM built with e00f189 and the one immediately before (last commit where prefetching happened as intended), it seems that if -min-prefetch-stride=1 is used on both versions, e00f189 may be slightly preferred (34% speedup vs 32%). So e00f189 doesn't seem to be necessarily bad by itself.

Diff Detail

Event Timeline

jonpa created this revision.Dec 9 2020, 5:20 PM

Herald added subscribers: javed.absar, hiraditya. · View Herald TranscriptDec 9 2020, 5:20 PM

jonpa requested review of this revision.Dec 9 2020, 5:20 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2020, 5:20 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

It doesn't seem right to bail just for one non-prefetched access, so it seems reasonable to allow a relatively very small amount of non-prefetched instructions, to make the heuristic more stable. This patch suggests allowing 1 non-prefetched memory access per 32 prefetched ones. This handles LBM, and also gives two prefetches to imagick which however do not seem to play a role.

This sounds reasonable to me, but this is the kind of question that in the end only measurement can decide ... What is the overall effect of this patch on benchmarks? Anything else beyond lbm and imagick?

In D92985#2445068, @uweigand wrote:

It doesn't seem right to bail just for one non-prefetched access, so it seems reasonable to allow a relatively very small amount of non-prefetched instructions, to make the heuristic more stable. This patch suggests allowing 1 non-prefetched memory access per 32 prefetched ones. This handles LBM, and also gives two prefetches to imagick which however do not seem to play a role.

This sounds reasonable to me, but this is the kind of question that in the end only measurement can decide ... What is the overall effect of this patch on benchmarks? Anything else beyond lbm and imagick?

Only two files changed in total, one each for lbm/imagick.

OK, I think this is fine then.

This revision is now accepted and ready to land.Dec 11 2020, 12:30 AM

Closed by commit rG0c2d23933f06: [SystemZTTIImpl] Allow some non-prefetched accesses in getMinPrefetchStride(). (authored by jonpa). · Explain WhyDec 11 2020, 4:08 PM

This revision was automatically updated to reflect the committed changes.

jonpa added a commit: rG0c2d23933f06: [SystemZTTIImpl] Allow some non-prefetched accesses in getMinPrefetchStride()..

Revision Contents

Path

Size

llvm/

lib/

Target/

SystemZ/

SystemZTargetTransformInfo.cpp

4 lines

Diff 310723

llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

Show First 20 Lines • Show All 335 Lines • ▼ Show 20 Lines	unsigned SystemZTTIImpl::getMinPrefetchStride(unsigned NumMemAccesses,
unsigned NumPrefetches,		unsigned NumPrefetches,
bool HasCall) const {		bool HasCall) const {
// Don't prefetch a loop with many far apart accesses.		// Don't prefetch a loop with many far apart accesses.
if (NumPrefetches > 16)		if (NumPrefetches > 16)
return UINT_MAX;		return UINT_MAX;

// Emit prefetch instructions for smaller strides in cases where we think		// Emit prefetch instructions for smaller strides in cases where we think
// the hardware prefetcher might not be able to keep up.		// the hardware prefetcher might not be able to keep up.
if (NumStridedMemAccesses > 32 &&		if (NumStridedMemAccesses > 32 && !HasCall &&
NumStridedMemAccesses == NumMemAccesses && !HasCall)		(NumMemAccesses - NumStridedMemAccesses) * 32 <= NumStridedMemAccesses)
return 1;		return 1;

return ST->hasMiscellaneousExtensions3() ? 8192 : 2048;		return ST->hasMiscellaneousExtensions3() ? 8192 : 2048;
}		}

bool SystemZTTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) {		bool SystemZTTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) {
EVT VT = TLI->getValueType(DL, DataType);		EVT VT = TLI->getValueType(DL, DataType);
return (VT.isScalarInteger() && TLI->isTypeLegal(VT));		return (VT.isScalarInteger() && TLI->isTypeLegal(VT));
▲ Show 20 Lines • Show All 829 Lines • Show Last 20 Lines