This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Use module level register maximums for unknown callees
ClosedPublic

Authored by arsenm on Jan 14 2022, 3:11 PM.

Details

Reviewers
rampitec
sebastian-ne
Group Reviewers
Restricted Project
Summary

Compute the theoretical register budget based on the IR function
signature/attributes, and use the global maximum register budgets for
unknown callees.

This should fix the kernel reported register usage in the presence of
indirect calls. The previous fix in
2b08f6af62afbf32e89a6a392dbafa92c62f7bdf was incorrect becauset it was
only taking the maximum in the known call graph, and missing something
that was either outside of it or codegened later.

This fixes a second case I discovered where calls to aliases also did
not work as expected. CallGraphAnalysis misses these, so functions
called through aliases were not codegened ahead of callers as
expected. CallGraphAnalysis should probably be fixed to understand
this case, and there's likely a bug with IPRA here. This fixes
numerous failures in the conformance test at -O0.

Diff Detail

Event Timeline

arsenm created this revision.Jan 14 2022, 3:11 PM
arsenm requested review of this revision.Jan 14 2022, 3:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2022, 3:11 PM
Herald added a subscriber: wdng. · View Herald Transcript
rampitec accepted this revision.Jan 14 2022, 3:23 PM
This revision is now accepted and ready to land.Jan 14 2022, 3:23 PM
sebastian-ne requested changes to this revision.Jan 17 2022, 2:31 AM
sebastian-ne added inline comments.
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

This is over-approximating the vgpr_count whenever an indirect call is involved, which is quite a performance hit.

Can we switch AMDGPUResourceUsageAnalysis to a ModulePass and run propagateIndirectCallRegisterUsage at the end, so that all functions with indirect calls will get the maximum VGPR count of all functions in the module?
(As opposed to max VGPR count of the SCC that is used currently, which I did not intend.)

This revision now requires changes to proceed.Jan 17 2022, 2:31 AM
arsenm added inline comments.Jan 17 2022, 8:33 AM
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

I'd like to move switching to a module pass into a follow up patch. I'm a bit afraid of unintended side effects by switching to a module pass. We're already paying a compile time cost by using SCC codegen, and module passes will be worse

sebastian-ne added inline comments.Jan 17 2022, 9:17 AM
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

If the over-approximation goes in, we’ll have to revert it in our graphics branch, otherwise any benchmarks or optimization work that is going on would be meaningless (we had to locally revert D103636 for that reason).
So, I’d prefer it if we could fix this in one push, if you think that’s possible.

arsenm added inline comments.Jan 17 2022, 9:18 AM
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

This is just completely broken as is, any benchmarks are working by accident

sebastian-ne added inline comments.Jan 17 2022, 9:24 AM
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

The current graphics use-case is not affected by this bug. We’re only compiling single functions per LLVM module and finding the maximum register usage and linking is done by the loader (PAL in this case).

arsenm added inline comments.Jan 17 2022, 9:25 AM
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

So then it also wouldn't be impacted by this change?

sebastian-ne added inline comments.Jan 17 2022, 9:38 AM
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

The compiled modules contain only a single function, but the function contains indirect functions calls. So it hits the code path here that uses getMaxNumVGPRs(F) to approximate the register usage.
The VGPR usage for the compiled function should just stay as it is, even if it contains an indirect call, because there is no other function (in the module) that could be called.

I tested this patch on a pipeline and the reported VGPR usage goes from 140 VGPRs to 256 VGPRs, noticeably reducing the occupancy.

As said, I’d prefer if this does not go in without an accompanying patch that makes the VGPR usage accurate again.
If you think that’s unreasonable, I’ll unblock this patch and revert it in our graphics branch until the follow up patch.

arsenm added inline comments.
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

I will push them both at the same time, but I'd like to defensively keep this and D117504 as separate commits in case something goes wrong

sebastian-ne added inline comments.
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
184

Thank you, I’ll have a look tomorrow.

sebastian-ne accepted this revision.Feb 2 2022, 1:12 AM

Forgot to accept as amdgpu last time.

This revision is now accepted and ready to land.Feb 2 2022, 1:12 AM