This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Implement lds kernel id intrinsic
ClosedPublic

Authored by JonChesterfield on May 5 2022, 4:46 PM.

Details

Summary

Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.

There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.

It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.

The intent is to emit a __const array of LDS addresses and index into it.

Diff Detail

Event Timeline

JonChesterfield created this revision.May 5 2022, 4:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 4:46 PM
JonChesterfield requested review of this revision.May 5 2022, 4:46 PM
Herald added a reviewer: sstefan1. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
arsenm added a comment.May 6 2022, 7:20 AM

I have to look closer but my main concern is about adding an SGPR argument for this. It doesn’t correspond to a real kernel input, and we didn’t add this to the new ABI register layout proposal. What happened to using some kind of relocation for the kernel ID?

As far as I understand this requires a whole program compilation and will not work with late linking?

  • Rebase on main, append metadata to kernels in lower module lds pass

Revision just uploaded annotates all kernels with an integer. That doesn't need to land with the intrinsic but it makes it much easier to drive runtime tests through this code so it's there for now.

I have to look closer but my main concern is about adding an SGPR argument for this. It doesn’t correspond to a real kernel input, and we didn’t add this to the new ABI register layout proposal. What happened to using some kind of relocation for the kernel ID?

I need to amend the new ABI layout page internally, and possibly pick a different number to burn. Relocation/loader patch doesn't work for this case as a given function may be called from multiple different kernels. In the happy case where a function is only callable from one kernel, the intrinsic should be constant folded (todo self, test/implement that) and still doesn't want a relocation.

As far as I understand this requires a whole program compilation and will not work with late linking?

Depends what you mean by linking really. Because this wires the variable accesses to an internal-linkage table lookup in IR, renaming that table when combining IR later on actually works fine. Each kernel will have an assigned index that makes sense in the context of the only table it can see.

The module.lds at address zero trick will break if it's renamed on linking, so that'll need to become something more sophisticated than a string comparison. Also the kernel lds struct named after the current function will break if the kernel is renamed and the struct not. So there's a few edges around incrementally lowering LDS (e.g. allowing calling the lowering IR pass repeatedly) that should be patched up.

If you mean a kernel compiled to ISA calling a function in some other module compiled to ISA which uses LDS, that doesn't work as written. The lookup and enumeration extends relatively easily - tag the table with appending linkage and provide a linker symbol for the start position for the current elf - but the allocation in the kernel doesn't. I'm not yet sure if there's a link time optimisation available or if we should treat calls between code objects and calls between isa in elf modules equivalently (via load time relocation).

This or equivalent is required to allocate LDS at different addresses from different kernels which is required to minimise LDS requirements. I'm going to assume the design is acceptable as it matches what we're already doing and noone has said otherwise, therefore will write the associated test changes.

  • Update tests, rebase
JonChesterfield retitled this revision from [amdgpu][wip] Implement lds kernel id intrinsic to [amdgpu] Implement lds kernel id intrinsic.Jul 4 2022, 10:53 AM
JonChesterfield edited the summary of this revision. (Show Details)
  • cleanup for review
JonChesterfield edited the summary of this revision. (Show Details)Jul 4 2022, 11:04 AM

Changes to tests were mechanical. Expecting a minor performance hit for code using indirect calls which resist analysis by the attribute propagator because sgpr15 is now reserved.

An alternative is to dedicate some LDS to the same purpose and store to it from the kernel, but accessing that LDS variable from an indirect function in order to find the other LDS is relatively complicated (put it at a fixed address and codegen loads/stores) and will inevitably lead to an application that wants to use all the LDS on the machine without sacrificing those four bytes. We could dedicate a vgpr instead but sgprs are cheaper and it will hold a __const value which is better represented by a scalar load.

Which sgpr is used doesn't matter but changing it will respin the noisy test updates. I'd much prefer we land this using 15, chosen as the next one in sequence, and then change it along with the others if necessary.

arsenm added inline comments.Jul 5 2022, 9:05 AM
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
171

This absolutely should not get a ClangBuiltin. Also probably shouldn't include "lds" in the name llvm.amdgcn.compiler.kernel.id? Probably should add a comment that this is for internal compiler uses and users should not rely on it

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4207

Can build the constant directly into DstReg

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
106

Why '.' and '_'?

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
739

This doesn't apply since it's not a real user SGPR?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1674

No else before return

arsenm added inline comments.Jul 5 2022, 11:26 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
106

What is this metadata for? Where does it come from?

Sorry Matt, missed the review comments a week ago. Will address them shortly.

  • Address review comments
JonChesterfield marked 5 inline comments as done.Jul 12 2022, 8:55 AM

Fixes applied. This whole patch is the machinery to tag kernels with a uint32_t at compile time then access it from a function at runtime, using the existing machinery for implicit SGPR arguments.

What is this metadata for? Where does it come from?

The following patch (which is functional but not review ready) annotates each kernel that uses LDS with an arbitrary unique integer stored in this metadata. As of this patch, that's only in test cases.

Said patch also creates an array of LDS addresses in __constant memory and initializes it to match the unique integers stored in said metadata.

Finally the intrinsic introduced here is used to look up which set of addresses corresponds to the kernel that is running the current function in order to find where a given variable is allocated.

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
171

It should have lds in the name because it's currently only used by lds. Specifically I don't want to enumerate kernels that don't need this indirection as that would increase the size of a lookup table.

If this machinery for identifying a kernel at runtime proves to be more generally useful, we can rename these variables and move the enumeration pass out of lowermodulelds when that use case arises.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
739

It's not a system sgpr either. I'll name it differently but it's still a preloaded sgpr in the sense that it's an sgpr that gets initialised by mostly hidden magic when necessary.

The alternative seems to be duplicating the current machinery for passing magic hidden registers across function calls and I'd really prefer to hook into the existing subsystem.

JonChesterfield marked 2 inline comments as done.
  • S to s in comment
  • Update three more test cases, result of rebasing on main
arsenm added inline comments.Jul 13 2022, 9:29 AM
llvm/lib/Target/AMDGPU/AMDGPUArgumentUsageInfo.h
106

Needs a comment explaining this isn't a real user SGPR defined in the ABI

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
106

probably should have gone the other way, all . instead of _

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
530–536 ↗(On Diff #443999)

This shouldn't get initialized in frame lowering. This should have been emitted as an outgoing call argument only, this isn't a reserved register

arsenm added inline comments.Jul 13 2022, 9:33 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
530–536 ↗(On Diff #443999)

The value chosen is also likely clobbering an incoming user sgpr

  • Replace frame lowering with call lowering
JonChesterfield marked 2 inline comments as done.Jul 15 2022, 11:05 AM
JonChesterfield added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
530–536 ↗(On Diff #443999)

Moving this to call lowering is much cleaner, thanks.

arsenm added inline comments.Jul 16 2022, 7:33 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
106

I don't like using metadata to pass information from a lowering pass to here, but right now I don't have a better solution

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
108

Why not return the int value?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
2827

This should get a test in abi-attribute-hints-undefined-behavior.ll

  • Constant to Optional, add case to undefined-behaviour test
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
108

ConstantInt is the return type of mdconst::extract and this gives a cheap/lazy way to test for the metadata missing (returning nullptr). Replaced with std::optional<uint32_t> for more explicit/verbose uses at the call sites.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
2827

Added the clause, but obv no change to codegen as it lowers to undef on missing

  • implicit return constructor is ok
arsenm accepted this revision.Jul 19 2022, 9:26 AM
This revision is now accepted and ready to land.Jul 19 2022, 9:26 AM
This revision was landed with ongoing or failed builds.Jul 19 2022, 9:46 AM
This revision was automatically updated to reflect the committed changes.