This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Set amdgpu-memory-bound if a basic block has dense global memory access
ClosedPublic

Authored by abinavpp on Jul 14 2022, 5:14 AM.

Details

Summary

AMDGPUPerfHintAnalysis doesn't set the memory bound attribute if
FuncInfo::InstCost outweighs MemInstCost even if we have a basic block
with relatively high global memory access. GCNSchedStrategy could revert
optimal scheduling in favour of occupancy which seems to degrade
performance for some kernels. This change introduces the
HasDenseGlobalMemAcc metric in the heuristic that makes the analysis
more conservative in these cases.

This fixes SWDEV-334259/SWDEV-343932

Diff Detail

Event Timeline

abinavpp created this revision.Jul 14 2022, 5:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2022, 5:14 AM
abinavpp requested review of this revision.Jul 14 2022, 5:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2022, 5:14 AM

This looks generally reasonable to me. Although I am thinking about a potential huge kernel with only a single block like this...
I guess it comes down to the question if that block is hot or cold. There is no perfect heuristic out there, but chances are higher a block is hot if it is in a loop.
Is the actual block you are considering in the regressed app inside a loop? Is that a "small single bb loop", i.e. when the bb branches right to itself? This is more likely to be a "dense memory access code" (and you will not need LoopInfo).

I still think this pass should be replaced with a machine IR analysis

llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
201

Maybe this should be inverted to be if the value is live out?

331

Weird use of ternary operator to use true here

I guess it comes down to the question if that block is hot or cold. There is no perfect heuristic out there, but chances are higher a block is hot if it is in a loop.

I agree that we should try to guess how frequent the block with dense global
memory access would be executed.

Is the actual block you are considering in the regressed app inside a loop? Is that a "small single bb loop", i.e. when the bb branches right to itself? This is more likely to be a "dense memory access code" (and you will not need LoopInfo).

Yes, the block is in a loop in the regressed kernel, but unfortunately it
doesn't branch to itself. So I guess we need LoopInfo. Unfortunately, I don't
think there's any way to get the LoopInfo analysis from a CallGraphSCCPass in
the legacy pass manager. I think we could use the
FunctionAnalysisManagerCGSCCProxy in the new pass manager
(https://llvm.org/docs/NewPassManager.html#using-analyses). Do you see any
problem in adding the new pass manager here? Or did I miss any other way to do
this in the legacy pass manager for now?

llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
201

Did you mean to rename this in terms of live out, or return false if there's a
use of I outside its block? I'm not sure if the live out is relevant here.

Is the actual block you are considering in the regressed app inside a loop? Is that a "small single bb loop", i.e. when the bb branches right to itself? This is more likely to be a "dense memory access code" (and you will not need LoopInfo).

Yes, the block is in a loop in the regressed kernel, but unfortunately it
doesn't branch to itself. So I guess we need LoopInfo. Unfortunately, I don't
think there's any way to get the LoopInfo analysis from a CallGraphSCCPass in
the legacy pass manager. I think we could use the
FunctionAnalysisManagerCGSCCProxy in the new pass manager
(https://llvm.org/docs/NewPassManager.html#using-analyses). Do you see any
problem in adding the new pass manager here? Or did I miss any other way to do
this in the legacy pass manager for now?

No, I do not like to bring a new analysis here and a new PM either. Maybe we should try this as is.

rampitec accepted this revision.Jul 15 2022, 10:59 AM

Let's see what will it bring and tune more when/if needed.

This revision is now accepted and ready to land.Jul 15 2022, 10:59 AM
abinavpp updated this revision to Diff 445404.Jul 18 2022, 12:33 AM
  • Rebase
  • Don't compute GlobalMemAccPercentage if HasDenseGlobalMemAcc is already set for the function
  • Replace ternary expression with if
abinavpp marked an inline comment as done.Jul 18 2022, 12:35 AM
abinavpp added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
201

Are we good with this?

arsenm added inline comments.Jul 18 2022, 3:32 PM
llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
201

I mean invert the logic to see if the only users are outside the block. It doesn't really matter

abinavpp added inline comments.Jul 19 2022, 2:45 AM
llvm/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
201

We need to check for the user in the block regardless of the presence of users outside it, so I guess we can keep it this way.

This revision was landed with ongoing or failed builds.Jul 19 2022, 2:47 AM
This revision was automatically updated to reflect the committed changes.