This is an archive of the discontinued LLVM Phabricator instance.

[LoopDataPrefetch + SystemZ] Let target decide on prefetching on a per loop basis
ClosedPublic

Authored by jonpa on Nov 14 2019, 3:14 AM.

Details

Summary

This patch adds

  • New arguments to getMinPrefetchStride() to let the target decide on a per-loop basis if software prefetching should be done even with a stride within the limit of the hw prefetcher.
  • New TTI hook enableWritePrefetching() to let a target do write prefetching by default (defaults to false).

LoopDataPrefetch:

  • A search through the whole loop to gather information before emitting any prefetches. This way the target can get information via new arguments to getMinPrefetchStride() and emit prefetches more selectively. Collected information includes: Does the loop have a call, how many memory accesses, how many of them are strided, how many prefetches will cover them. This is NFC to before as long as the target does not change its definition of getMinPrefetchStride().
  • If a previous access to the same exact address was 'read', and the current one is 'write', make it a 'write' prefetch.
  • If two accesses that are covered by the same prefetch do not dominate each other, put the prefetch in a block that dominates both of them.
  • If a ConstantMaxTripCount is less than ItersAhead, then skip the loop.

SystemZ:

  • increase the distance of prefetching (to meet the hot lbm loop with prefetching 9 iterations ahead).
  • enable write prefetching by default.
  • emit sw prefetching for any stride according to new heuristics in getMinPrefetchStride(), which includes lbm.
  • Do we need a test to test getMinPrefetchStride()?

Diff Detail

Event Timeline

jonpa created this revision.Nov 14 2019, 3:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2019, 3:14 AM
jonpa marked 2 inline comments as done.Nov 14 2019, 3:18 AM
jonpa added inline comments.
llvm/lib/Transforms/Scalar/LoopDataPrefetch.cpp
233

The Prefetch struct MemI member is used solely for debug/ORE output. It is currently not guarded by '#ifndef NDEBUG', but maybe it should be, even though it's just one pointer? The debug output could perhaps be improved also, I merely tried to keep what there was while also printing the new values gathered. The MemI member is not needed for anything else, so if we could change the debug output instead perhaps it's not even needed, or?

251

I am trusting DT->findNearestCommonDominator() and DomBB->getTerminator() to do this, but I am not 100% sure that there is always a terminator in each block...?

jonpa updated this revision to Diff 249387.Mar 10 2020, 8:15 AM

Ping! Patch rebased.

We see that lbm gets a significant speedup with this patch on SystemZ.

Ping!

Adding members of the Loop Optimization Working Group, as I do believe this patch deserves some attention.

reames resigned from this revision.Mar 25 2020, 11:10 AM
jonpa added a comment.Mar 26 2020, 9:10 AM

ping!

Did anyone try to enable write prefetching on lbm on another target than SystemZ?

@greened is looking into generating the prefetcher properties using tblgen (D58736). Would this be compatible with that approach?

I like the clean test cases. Unfortunely, I don't have experience with prefetching in the lbm benchmark.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1324–1327

Could you add explanations for the individual parameters to the doxygen comment?

llvm/lib/Transforms/Scalar/LoopDataPrefetch.cpp
233

I'd appreciate if it was guarded by #ifndef NDEBUG. It would make it more explicit and enforcing to be only used in debug builds.

238

[style] Please add empty lines between methods.

Also, add a doxygen comment about what this method and its parameters (and the class itself) are supposed to do.

244–245

[style] please clang-format the patch.

jonpa updated this revision to Diff 253826.Mar 31 2020, 3:46 AM
jonpa marked 4 inline comments as done.

Thanks for review! Patch updated.

@greened is looking into generating the prefetcher properties using tblgen (D58736). Would this be compatible with that approach?

I see that his patch is as well adding the option to specify write-prefetching per default, so that is a common goal. From what I can see that patch doesn't have any other functional changes to LoopDataPrefetch, so it seems complimentary. Personally, I do think that there is a need to do tuning in C++ for a given target, like my patch here for SystemZ - I am not sure if greened is aiming to generalize cache systems and in the future implement some kind of common heuristics for all? I would hope then that it would still be possible for each target to make its own decisions. I have a feeling that these systems are so complex that they are not quite predictable with generalized parameters.

Unfortunely, I don't have experience with prefetching in the lbm benchmark.

It goes over a huge memory array with both loads and stores. Enabling store prefetching on it seems like a good idea on any architecture...

llvm/lib/Transforms/Scalar/LoopDataPrefetch.cpp
233

I don't think that's possible given that it's also used by the ORE, or?

Meinersbur added inline comments.Mar 31 2020, 6:43 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
884

I am confused with the description of NumPrefetches. It looks like its passed the number of different memory accesses in the loop. "Number of prefetches needed for the strided accesses." sounds like hardware requirement, but its passed a property of the code. Can you clarify?

llvm/lib/Transforms/Scalar/LoopDataPrefetch.cpp
233

Well, then it's not only for debug output as the comment says. Remove the part in parens?

I'd appreciate if the other class members had doxygen comments as well.

237
jonpa updated this revision to Diff 254441.Apr 2 2020, 12:41 AM
jonpa marked 4 inline comments as done.

Patch updated per review.

Comments expanded. I hope multiple line doxygen comments are ok for a parameter?

Meinersbur accepted this revision.Apr 2 2020, 3:11 AM

Comments expanded. I hope multiple line doxygen comments are ok for a parameter?

Yes; I usually indent/align them, although not done automatically by clang format. For instance, the doxygen for getMinPrefetchStride could be improved:

/// Some HW prefetchers can handle accesses up to a certain constant stride.
/// Sometimes prefetching is beneficial even below the HW prefetcher limit,
/// and the arguments provided are meant to serve as a basis for deciding this
/// for a particular loop.
///
/// \param NumMemAccesses        Number of memory accesses in the loop.
/// \param NumStridedMemAccesses Number of the memory accesses that
///                              ScalarEvolution could find a known stride
///                              for.
/// \param NumPrefetches         Number of software prefetches that will be
///                              emitted as determined by the addresses
///                              involved and the cache line size.
/// \param HasCall               True if the loop contains a call.
///
/// \return This is the minimum stride in bytes where it makes sense to start
///         adding SW prefetches. The default is 1, i.e. prefetch with any
///         stride.

However, there is a tradeoff involved in number off lines changed in this diff, consistency with the other docygen comments in the same file, number of likes for the entire doxygen comment.

This revision is now accepted and ready to land.Apr 2 2020, 3:11 AM
jonpa added a comment.Apr 2 2020, 5:57 AM

Comments expanded. I hope multiple line doxygen comments are ok for a parameter?

Yes; I usually indent/align them, although not done automatically by clang format. For instance, the doxygen for getMinPrefetchStride could be improved:

/// Some HW prefetchers can handle accesses up to a certain constant stride.
/// Sometimes prefetching is beneficial even below the HW prefetcher limit,
/// and the arguments provided are meant to serve as a basis for deciding this
/// for a particular loop.
///
/// \param NumMemAccesses        Number of memory accesses in the loop.
/// \param NumStridedMemAccesses Number of the memory accesses that
///                              ScalarEvolution could find a known stride
///                              for.
/// \param NumPrefetches         Number of software prefetches that will be
///                              emitted as determined by the addresses
///                              involved and the cache line size.
/// \param HasCall               True if the loop contains a call.
///
/// \return This is the minimum stride in bytes where it makes sense to start
///         adding SW prefetches. The default is 1, i.e. prefetch with any
///         stride.

However, there is a tradeoff involved in number off lines changed in this diff, consistency with the other docygen comments in the same file, number of likes for the entire doxygen comment.

I see :-) Thanks for the review.

This revision was automatically updated to reflect the committed changes.