I found that the LBM benchmark benefits significantly if enabling (stores) prefetching on the hot loop with '-mllvm -min-prefetch-stride=128 -mllvm -loop-prefetch-writes' on SystemZ. I then observed that the same memory addresses are accessed on different paths in different iterations. Currently, the LoopDataPrefetch pass emits one prefetch and then skips adding another one for the next close/identical access without taking this into consideration. I therefore tried the idea of making sure that the one prefetch actually dominates both accesses by moving it if necessary. For what I can see now on SystemZ at least, this patch gives yet another 7-8% of improvement on LBM (matching gcc)...
Is this generally a good idea? The prefetch distance heuristic may get slightly thrown off by this in some bigger loops, I guess.