This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Try to avoid inserting duplicate s_inst_prefetch
ClosedPublic

Authored by critson on Apr 11 2022, 9:27 PM.

Details

Summary

Check for existing s_inst_prefetch instructions when
configuring prefetches during loop alignment.

Diff Detail

Event Timeline

critson created this revision.Apr 11 2022, 9:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 9:27 PM
critson requested review of this revision.Apr 11 2022, 9:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 9:27 PM
critson updated this revision to Diff 422120.Apr 11 2022, 9:49 PM

Rebase on to pre-committed test.

foad added a comment.Apr 12 2022, 2:40 AM

Thank you! I had noticed this before but never got around to investigating properly.

foad added inline comments.Apr 12 2022, 2:42 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12316

Surely the condition should be PreTerm == Pre->begin() || ...?

12322

Likewise.

critson marked 2 inline comments as done.Apr 12 2022, 3:35 AM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12316

A fair point, I had not consider we still want to insert when this is the beginning of the block.

critson updated this revision to Diff 422162.Apr 12 2022, 3:35 AM
critson marked an inline comment as done.
  • Incorporate reviewer comments
foad accepted this revision.Apr 12 2022, 3:37 AM
This revision is now accepted and ready to land.Apr 12 2022, 3:37 AM
rampitec accepted this revision.Apr 12 2022, 9:31 AM

Thanks! Did we ever run any benchmarking on this? I have written this before actual HW was available.

Thanks! Did we ever run any benchmarking on this? I have written this before actual HW was available.

I was curious so did some quick investigation on GFX10.1 (Navi10).

For graphics at the macro scale, I cannot see any performance impact from entirely disabling generation of s_inst_prefetch instructions on our test suite.

Setting up a micro benchmark, I can see a >20% performance uplift setting an appropriate mode, and >20% performance drop for setting an inappropriate mode via s_inst_prefetch.
So these instructions definitely matter, but its an open question if we are using them effectively -- at least they don't seem to be hurting performance.
Additionally the cost of back-to-back s_inst_prefetch is the same as s_nop, so I would not expect to see change in performance for this patch, just saving a few redundant scalar instructions.

Thanks! Did we ever run any benchmarking on this? I have written this before actual HW was available.

I was curious so did some quick investigation on GFX10.1 (Navi10).

For graphics at the macro scale, I cannot see any performance impact from entirely disabling generation of s_inst_prefetch instructions on our test suite.

Setting up a micro benchmark, I can see a >20% performance uplift setting an appropriate mode, and >20% performance drop for setting an inappropriate mode via s_inst_prefetch.
So these instructions definitely matter, but its an open question if we are using them effectively -- at least they don't seem to be hurting performance.
Additionally the cost of back-to-back s_inst_prefetch is the same as s_nop, so I would not expect to see change in performance for this patch, just saving a few redundant scalar instructions.

Thanks Carl! I would suggest this should really matter if we have a loop in a certain range, not too small so it doesn't fit into I$ entirely, not to large to be evicted anyway. It can be somewhat tricky to measure the impact. Certainly nested loops may expose an impact. Maybe compute shaders have better chances to fall into that range.

This revision was landed with ongoing or failed builds.Apr 14 2022, 12:21 AM
This revision was automatically updated to reflect the committed changes.