I found that adding prefetch instructions greatly improved performance of LBM. Currently no PFD:s are emitted in the hot loop by clang, but gcc does this.
I experimented with some parameters to the LoopDataPrefetch pass:
-min-prefetch-stride: "Min stride to add prefetches", default is 2048 for SystemZ. -loop-prefetch-writes: enables prefetching for stores in addition to loads (default off). -prefetch-distance: "Number of instructions to prefetch ahead", default is 2000 for SystemZ.
The stride of the accesses in the loop in LBM is 160, and the most important prefetch was that of the stores. So to improve LBM, I had to pass -min-prefetch-stride=160, and -loop-prefetch-writes. I found for this particular benchmark that it seemed just slightly better to also pass -prefetch-distance=4000.
On the whole of SPEC, this is the number of pfd instructions emitted:
SPEC 2006 (z14): prefetches Loads Stores gcc z14 1330 1841 current clang 394 18 (mvc loop expansions) (B) 1650 632 (C) 583 261 (D) 3343 1060 (B) -min-prefetch-stride=128 -loop-prefetch-writes -prefetch-distance=4000 (C) -min-prefetch-stride=128 -loop-prefetch-writes -prefetch-distance=4000 -max-prefetch-iters-ahead=75 (D) -min-prefetch-stride=80 -loop-prefetch-writes -prefetch-distance=4000
In these initial experiments B, C and D all improve LBM with ~15% while not affecting other benchmarks so much. C has an extra limit which makes for a smaller overall change compared to trunk.
This patch now reflects these initial experiments while not being necessarily optimal yet.
One additional small improvement of the LoopDataPrefetch pass I could see is to not emit prefetches in loops where the known constant trip count is smaller than the "iterations ahead" of the prefetch. This actually removed a few pfd:s in theoutput: (D) becomes 3178 / 969. I guess this could be committed separately. This would be the three lines with 'LoopConstantTripCount' and the new test.