This is an archive of the discontinued LLVM Phabricator instance.

[SystemZTTIImpl::getMinPrefetchStride] Allow some non-prefetched mem accesses.
ClosedPublic

Authored by jonpa on Dec 9 2020, 5:20 PM.

Details

Summary

The performance improvement on LBM previously achieved with improved software prefetching (36d4421) seems to have gone lost with https://reviews.llvm.org/D88789. This commit (e00f189) removes a canonicalization that changed 'load double -> store double' to use i64 instead. I notice that there now is one memory access in the loop that LoopDataPrefetch cannot handle, while before there was none. This means that when getMinPrefetchStride() is called on LBM_performStreamCollideTRT NumStridedMemAccesses=92 but NumMemAccesses=93, so the test fails.

That memory instruction gets a SCEV that is not an LSCEVAddRec, but looks like

((8 * r({0,+,20}<nuw><nsw><%for.body> + %.sink571))<nsw> + %dstGrid)

, which is similar to one that is, e.g.

{(152 + %srcGrid)<nuw>,+,160}<nuw><%for.body>

There is still a stride it seems (+20), but it is not as simple of an expression as an LSCEVAddRec which LoopDataPrefetch can handle.

The reasoning in getMinPrefetchStride() tries to establish cases where software prefetching seems likely to be beneficial: too many (more than 16) prefetches may by themselves put an extra load on the LSU which could be bad. If the relatively few prefetches cover many more (more than 32) memory accesses, and there are not any other (non-prefetched) memory accesses that stress the memory systems to make prefetching less likely beneficial, then emit the prefetches. It doesn't seem right to bail just for one non-prefetched access, so it seems reasonable to allow a relatively very small amount of non-prefetched instructions, to make the heuristic more stable. This patch suggests allowing 1 non-prefetched memory access per 32 prefetched ones. This handles LBM, and also gives two prefetches to imagick which however do not seem to play a role.

Alternatively, it might be possible to find a refined heuristic that would differntiate between "stride not found" instructions and "stride found, but not handled" (like the one above). Out of the non-prefetched instructions (NumMemAccesses - NumStridedMemAccesses), there are those that have some kind of known stride, and then also those that do not have any kind of stride at all. A stride would load the h/w prefetcher, which may make s/w prefetching more desirable, but there is also the general load of the memory systems to consider...

Measuring LBM built with e00f189 and the one immediately before (last commit where prefetching happened as intended), it seems that if -min-prefetch-stride=1 is used on both versions, e00f189 may be slightly preferred (34% speedup vs 32%). So e00f189 doesn't seem to be necessarily bad by itself.

Diff Detail

Event Timeline

jonpa created this revision.Dec 9 2020, 5:20 PM
jonpa requested review of this revision.Dec 9 2020, 5:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2020, 5:20 PM

It doesn't seem right to bail just for one non-prefetched access, so it seems reasonable to allow a relatively very small amount of non-prefetched instructions, to make the heuristic more stable. This patch suggests allowing 1 non-prefetched memory access per 32 prefetched ones. This handles LBM, and also gives two prefetches to imagick which however do not seem to play a role.

This sounds reasonable to me, but this is the kind of question that in the end only measurement can decide ... What is the overall effect of this patch on benchmarks? Anything else beyond lbm and imagick?

jonpa added a comment.EditedDec 10 2020, 11:30 AM

It doesn't seem right to bail just for one non-prefetched access, so it seems reasonable to allow a relatively very small amount of non-prefetched instructions, to make the heuristic more stable. This patch suggests allowing 1 non-prefetched memory access per 32 prefetched ones. This handles LBM, and also gives two prefetches to imagick which however do not seem to play a role.

This sounds reasonable to me, but this is the kind of question that in the end only measurement can decide ... What is the overall effect of this patch on benchmarks? Anything else beyond lbm and imagick?

Only two files changed in total, one each for lbm/imagick.

uweigand accepted this revision.Dec 11 2020, 12:30 AM

OK, I think this is fine then.

This revision is now accepted and ready to land.Dec 11 2020, 12:30 AM