The performance improvement on LBM previously achieved with improved software prefetching (36d4421) seems to have gone lost with https://reviews.llvm.org/D88789. This commit (e00f189) removes a canonicalization that changed 'load double -> store double' to use i64 instead. I notice that there now is one memory access in the loop that LoopDataPrefetch cannot handle, while before there was none. This means that when getMinPrefetchStride() is called on LBM_performStreamCollideTRT NumStridedMemAccesses=92 but NumMemAccesses=93, so the test fails.
That memory instruction gets a SCEV that is not an LSCEVAddRec, but looks like
((8 * r({0,+,20}<nuw><nsw><%for.body> + %.sink571))<nsw> + %dstGrid)
, which is similar to one that is, e.g.
{(152 + %srcGrid)<nuw>,+,160}<nuw><%for.body>
There is still a stride it seems (+20), but it is not as simple of an expression as an LSCEVAddRec which LoopDataPrefetch can handle.
The reasoning in getMinPrefetchStride() tries to establish cases where software prefetching seems likely to be beneficial: too many (more than 16) prefetches may by themselves put an extra load on the LSU which could be bad. If the relatively few prefetches cover many more (more than 32) memory accesses, and there are not any other (non-prefetched) memory accesses that stress the memory systems to make prefetching less likely beneficial, then emit the prefetches. It doesn't seem right to bail just for one non-prefetched access, so it seems reasonable to allow a relatively very small amount of non-prefetched instructions, to make the heuristic more stable. This patch suggests allowing 1 non-prefetched memory access per 32 prefetched ones. This handles LBM, and also gives two prefetches to imagick which however do not seem to play a role.
Alternatively, it might be possible to find a refined heuristic that would differntiate between "stride not found" instructions and "stride found, but not handled" (like the one above). Out of the non-prefetched instructions (NumMemAccesses - NumStridedMemAccesses), there are those that have some kind of known stride, and then also those that do not have any kind of stride at all. A stride would load the h/w prefetcher, which may make s/w prefetching more desirable, but there is also the general load of the memory systems to consider...
Measuring LBM built with e00f189 and the one immediately before (last commit where prefetching happened as intended), it seems that if -min-prefetch-stride=1 is used on both versions, e00f189 may be slightly preferred (34% speedup vs 32%). So e00f189 doesn't seem to be necessarily bad by itself.