This is an archive of the discontinued LLVM Phabricator instance.

[LoopDataPrefetch, SystemZ] Increase the amount of prefetching.
Needs ReviewPublic

Authored by jonpa on Sep 19 2019, 3:40 AM.

Details

Summary

I found that adding prefetch instructions greatly improved performance of LBM. Currently no PFD:s are emitted in the hot loop by clang, but gcc does this.

I experimented with some parameters to the LoopDataPrefetch pass:

-min-prefetch-stride: "Min stride to add prefetches", default is 2048 for SystemZ.
-loop-prefetch-writes: enables prefetching for stores in addition to loads (default off).
-prefetch-distance: "Number of instructions to prefetch ahead", default is 2000 for SystemZ.

The stride of the accesses in the loop in LBM is 160, and the most important prefetch was that of the stores. So to improve LBM, I had to pass -min-prefetch-stride=160, and -loop-prefetch-writes. I found for this particular benchmark that it seemed just slightly better to also pass -prefetch-distance=4000.

On the whole of SPEC, this is the number of pfd instructions emitted:

SPEC 2006 (z14):
prefetches          Loads          Stores
gcc z14              1330            1841
current clang         394              18  (mvc loop expansions)
(B)                  1650             632
(C)                   583             261
(D)                  3343            1060

(B) -min-prefetch-stride=128 -loop-prefetch-writes -prefetch-distance=4000
(C) -min-prefetch-stride=128 -loop-prefetch-writes -prefetch-distance=4000 -max-prefetch-iters-ahead=75
(D) -min-prefetch-stride=80  -loop-prefetch-writes -prefetch-distance=4000

In these initial experiments B, C and D all improve LBM with ~15% while not affecting other benchmarks so much. C has an extra limit which makes for a smaller overall change compared to trunk.

This patch now reflects these initial experiments while not being necessarily optimal yet.

One additional small improvement of the LoopDataPrefetch pass I could see is to not emit prefetches in loops where the known constant trip count is smaller than the "iterations ahead" of the prefetch. This actually removed a few pfd:s in theoutput: (D) becomes 3178 / 969. I guess this could be committed separately. This would be the three lines with 'LoopConstantTripCount' and the new test.

Diff Detail

Event Timeline

jonpa created this revision.Sep 19 2019, 3:40 AM