This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Elide module lds allocation in kernels with no callees
ClosedPublic

Authored by JonChesterfield on Mar 19 2022, 6:13 PM.

Details

Summary

Introduces a string attribute, amdgpu-requires-module-lds, to allow
eliding the module.lds block from kernels. Will allocate the block as before
if the attribute is missing or has its default value of true.

Patch uses the new attribute to detect the simplest possible instance of this,
where a kernel makes no calls and thus cannot call any functions that use LDS.

Tests updated to match, coverage was already good. Interesting cases is in
lower-module-lds-offsets where annotating the kernel allows the backend to pick
a different (in this case better) variable ordering than previously. A later
patch will avoid moving kernel variables into module.lds when the kernel can
have this attribute, allowing optimal ordering and locally unused variable
elimination.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2022, 6:13 PM
JonChesterfield requested review of this revision.Mar 19 2022, 6:13 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2022, 6:13 PM

@ronlieb this is the patch I'd like to run through the internal CI

The patch passed internal CI.

I don't like using an attribute for this (especially one set to a negative value). Ultimately the passes need to understand how much will be allocated.

Plus, it turns out you can't rely so heavily on the call graph analysis for correctness since it doesn't see calls through aliases.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
380

I don't think this actually works for indirect calls (or even calls through aliases)

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
36

You can just do getFnAttribute, you don't need a second query

The other passes do know how much is allocated. I don't want to go digging around the IR looking for the llvm.donothing call, which iirc is removed before instruction selection

I can invert the sense of the attribute if you deem that important. Call it amdgpu-no-module-lds-needed or similar, with the same allocate when missing attribute behaviour.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
380

Indirect calls are represented as an edge from 'outside', I'll check that external edge is accounted under size.

Indirect calls in general need careful handling, but this initial patch is only checking for any calls at all.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
36

That returns 'false' for a missing attribute though, and I want the default state to be 'true'

The other passes do know how much is allocated. I don't want to go digging around the IR looking for the llvm.donothing call, which iirc is removed before instruction selection

I don't understand this patch at all then.

I can invert the sense of the attribute if you deem that important. Call it amdgpu-no-module-lds-needed or similar, with the same allocate when missing attribute behaviour.

It should be inverted (and doesn't need to have a string value), although it still feels like this isn't an attribute

The current behaviour is to allocate an instance of llvm.amdgcn.module.lds.t at address zero in every kernel in the IR module if the global llvm.amdgcn.module.lds is present. That instance only needs to be allocated in kernels which subsequently call a function which uses it.

This patch revises the control flow approximation from 'any kernel can call any function' to 'kernels that make calls can call any function'. That means kernels where everything was inlined don't automatically allocate the module.lds, though presently they'll still allocate it somewhere (not necessarily at zero) if they use a variable that was moved into it. That limitation is solvable by specialising variables with respect to kernels, at which point kernels that make no calls will lower LDS optimally.

Either we allocate this instance in every kernel or for the subset which require it (which this patch approximates better than all, but not as well as it could). If we're going to allocate it for a subset we need to indicate that somehow. I.e. to implement calleeRequiresModuleLDS(). I don't mind if that's an attribute or something else, what would you prefer?

  • invert sense of attribute
arsenm added inline comments.Mar 23 2022, 9:18 AM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
391

You don't need to have a value (no true), just set the attribute. I think all the bool-as-string variable are an antipattern that for some reason spread to a subset of attributes

  • Drop string bool parameter and unreliable test
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
391

Ah, that explains the getValueAsBool behaviour. Thanks

llvm/test/CodeGen/AMDGPU/lower-module-lds-offsets.ll
64

Sadly I think the ordering here is too unstable to hit from a test case, dropping it in the next diff.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
380

Yep, counted as expected. Indirect and normal calls both increment size by one. N calls to the same function increment by N.

If I understand this patch correctly, the main intention of this patch is to check if the entry point function makes any call(s) to device functions, and if so, create and allocate module lds. Otherwise don't.

What happens to the case where the entry point function makes call(s) to device functions, but none of them use lds?

What happens to the case where the entry point function makes call(s) to device functions, but none of them use lds?

As written here, no distinction is made between calls that use LDS and calls that don't. That's the obvious next refinement after this lands. This patch is minimal in the hope of separating the plumbing and naming choices from more complicated call graph walking.

@arsenm there's a bunch of stuff to do that builds on this. What further changes would you like?

arsenm added inline comments.Apr 4 2022, 9:37 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
36

hasFnAttribute without the key value

JonChesterfield added a subscriber: bcahoon.

Ron thought @bcahoon might be interested, adding as reviewer

arsenm added inline comments.Apr 4 2022, 11:29 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
60

Probably not much value in caching this in MFI

60

I guess might as well leave it since we have the others (although I'm not a huge fan of these cached fields we don't serialize)

  • rebase, and getFnAttr -> hasFnAttr

Didn't notice the hasFnAttribute request comment, change made. Anything else?

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
69

Not sure following the existing pattern makes sense here. Passing the function to allocateModuleLDSGlobal then checking the attribute in place is probably better.

  • Module->function, skip caching the attribute

Drop the caching of the value and pass in function instead of module. No functional change to previous diff, simpler code.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
60

It makes some WIP stuff cleaner to remove this, updating

arsenm accepted this revision.May 4 2022, 1:59 PM
This revision is now accepted and ready to land.May 4 2022, 1:59 PM
This revision was landed with ongoing or failed builds.May 4 2022, 2:42 PM
This revision was automatically updated to reflect the committed changes.