This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Implement lower function LDS pass
ClosedPublic

Authored by JonChesterfield on Jan 13 2021, 7:40 PM.

Details

Summary

[amdgpu] Implement lower function LDS pass

Local variables are allocated at kernel launch. This pass collects global
variables that are used from non-kernel functions, moves them into a new struct
type, and allocates an instance of that type in every kernel. Uses are then
replaced with a constantexpr offset.

Prior to this pass, accesses from a function are compiled to trap. With this
pass, most such accesses are removed before reaching codegen. The trap logic
is left unchanged by this pass. It is still reachable for the cases this pass
misses, notably the extern shared construct from hip and variables marked
constant which survive the optimizer.

This is of interest to the openmp project because the deviceRTL runtime library
uses cuda shared variables from functions that cannot be inlined. Trunk llvm
therefore cannot compile some openmp kernels for amdgpu. In addition to the
unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled
and the function pointer hashing scheme deleted passes the openmp suite.

This lowering will use more LDS than strictly necessary. It is intended to be
a functionally correct fallback for cases that are difficult to target from
future optimisation passes.

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Jan 13 2021, 7:40 PM
Herald added a project: Restricted Project. · View Herald Transcript
madhur13490 added inline comments.Jan 15 2021, 10:03 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
69 ↗(On Diff #316557)

Erroneous characters in the comment.

201 ↗(On Diff #316557)

Can we please split this function into logical blocks and wrap them in private functions? That would make the code more readable.

218 ↗(On Diff #316557)

Such information is useful to know, so please use debugging messages with DEBUG_TYPE. Debugging messages would help us to spot issues.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43

I think this name can be a bit more mangled. It is easy to have a lier in the file. Probably use mechanism to randomly generate a string and use that to name and use the same random algorithm while de-referencing. This is too fancy but a bit more mangled name should be used.

Needs some test to stress different alignment scenarios. Also need some with these globals used in some weird constant initializers.

I also thought the idea was to have a constant memory table with pointers in it, not one giant LDS block

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
66–68 ↗(On Diff #316557)

I think this search isn't quite right and will miss arbitrarily nested constant expressions.

We already have similar code you need to analyze users in AMDGPUAnnotateKernelFeatures::visitConstantExprsRecursively to find constant LDS addrspacecasts anywhere they can appear.

113 ↗(On Diff #316557)

Not sure why you need to bother considering this case, if you just treat it normally it should work

188 ↗(On Diff #316557)

Definitely shouldn't be introducing inline asm, not sure why you are doing this. Also "r" is bad and we shouldn't support it

202 ↗(On Diff #316557)

Probably should move this flag into the pass pipeline in AMDGPUTargetMachine

207 ↗(On Diff #316557)

const &

224 ↗(On Diff #316557)

llvm::sort

233–234 ↗(On Diff #316557)

Should use the type alloc size. You are also ignoring the alignment of the global itself, which may differ from the type alignment

269 ↗(On Diff #316557)

typeAllocSize

279–283 ↗(On Diff #316557)

"null" in the IR is just 0. This is only treated as an invalid pointer in address space 0. -1 is used as the invalid pointer value and "null" in addrspace(3) is valid. Ideally this would be a property in the datalayout

294 ↗(On Diff #316557)

Probably should use llvm.amdgcn prefix

llvm/test/CodeGen/AMDGPU/lower-function-lds-inactive.ll
1 ↗(On Diff #316557)

Should run with both new and old PM since you handled both

4–6 ↗(On Diff #316557)

Negative checks aren't particularly helpful. Needs positive checks for what's actually produced

36 ↗(On Diff #316557)

Could also use some stores, intrinsic calls, cmpxchg.

Also some more exotic users that tend to break, such as storing the value's address to itself

llvm/test/CodeGen/AMDGPU/lower-function-lds.ll
51 ↗(On Diff #316557)

We're not really handling spir_kernel anymore, should use amdgpu_kernel

JonChesterfield added a comment.EditedJan 15 2021, 11:43 AM

Great review guys, thanks! With the exception of avoiding inline asm (which I'd like to, but am not sure what to do about) I'll fix the rest when I'm back in the office.

I also thought the idea was to have a constant memory table with pointers in it, not one giant LDS block

That's the idea behind D91516. This one puts variables at the same address in each kernel and accesses them as cheaply as from a kernel in exchange for wasting LDS in various cases. We've slightly conflated lowering with indirect calls because D91516 doesn't handle them. This is partly simpler because it uses '0' instead of an extra function argument.

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
66–68 ↗(On Diff #316557)

The '4' limits the depth of constant expressions analysed, will to change to a something using heap memory. I believe I can assume constantexpr are acyclic.

This test will move variables into the struct unless it can show that is unnecessary, which is safe provided replaceAllUsesWith does the right thing. I may be missing cases on what the user can be.

69 ↗(On Diff #316557)

Will rephrase. Those used by kernels and the llvm.used or llvm.compiler.used lists.

113 ↗(On Diff #316557)

It change the constant LDS variable to a mutable one. I don't think that matters much, since the variable is undef initialized, so assumptions about it always having the same value are dubious.

I don't think any of the languages let one create a constant undef value, and if one did, it should have no uses and be erased, but I haven't verified that is the case.

Basically I'm not certain what the right thing to do with a probably-useless constant undef value is so this pass ignores them.

188 ↗(On Diff #316557)

This would be a hack. I wanted a construct that looks like a use of the instance (and won't be deleted by IR passes and generates minimal code), so that other passes will accurately account for the amount of LDS used by a kernel. Specifically promoteAlloca but I may have missed some.

An intrinsic that evaporates later would work. I haven't thought of an alternative, will see if a cleaner answer comes to mind.

(aside: what's bad about r in particular? I'm unfamiliar with our inline assembler, perhaps there's an immediate option instead)

202 ↗(On Diff #316557)

Sure, will do.

233–234 ↗(On Diff #316557)

Didn't see getTypeAllocSize, nice! getAlign is a private function that wraps the verbose getValueOrABITypeAlignment.

279–283 ↗(On Diff #316557)

I was worried by Constant::getNullValue() returning zero for addrspace(3) but it does indeed seem to work ok. Drop the comment?

arsenm added inline comments.Jan 15 2021, 11:53 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
66–68 ↗(On Diff #316557)

Probably should just use the same recursive search, even better if they are sharable

188 ↗(On Diff #316557)

Well since the allocation point isn't really fixed yet, whether this size is really correct is questionable. AMDGPUPromoteAlloca currently assumes a worst case placement for padding to avoid going over.

r is "pick any register of any class". We have a hard split between VGPRs and SGPRs, so "r" is unpredictable and not very helpful.n

279–283 ↗(On Diff #316557)

Yes

  • address some review comments
JonChesterfield marked 9 inline comments as done.EditedJan 18 2021, 9:02 AM

Fixed the easy parts. Adding a call to the new pass manager by RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-function-lds < %s | FileCheck %s caught some errors in the plumbing, now fixed.

I'm a little confused by the mixture of legacy and new pass manager. In particular, would like feedback on whether I've successfully expressed that this pass should run before PromoteAllocaToLDS.

hsmhsm added inline comments.Jan 19 2021, 6:19 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

The function AMDGPU::isModuleEntryFunctionCC() rerturns true for graphics, shaders, SPIR (OpeCL?), etc. Is it what we expect here? Is not it that we are concerned here only with the CC - CallingConv::AMDGPU_KERNEL?

  • address some review comments
  • Run tests under new pass manager
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

LLC (e.g. AMDGPULegalizerInfo) uses isModuleEntryFunction to detect non-kernel use of LDS, which resolves to a function in Utils/AMDGPUBaseInfo.cpp that returns true for AMDGPU_KERNEL, SPIR_KERNEL and various calling conventions I don't recognise. AMDGPU_VS etc.

Some opencl I compiled as a sanity check used SPIR_KERNEL as the calling convention.

I don't know whether that's right, only that this pass should use exactly the same predicate as the one guarding allocateLDSGlobal.

llvm/test/CodeGen/AMDGPU/lower-function-lds.ll
51 ↗(On Diff #316557)

An OpenCL compilation emitted spir_kernel (somewhat to my surprise) so I thought I'd go for one of each. See also a comment above about wanting to be consistent with the guards around allocateLDSGlobal

hsmhsm added inline comments.Jan 19 2021, 8:42 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

IMHO, LLC is a generic back-end for OpenCL, OpenMP, HIP, Graphic Shader language, etc. So, AMDGPULegalizerInfo might be generally using it. but, this pass is concerned with only CallingConv::AMDGPU_KERNEL. That said, probably @arsenm or others in the review list could better clarify it.

sebastian-ne added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

If I understand the purpose of isKernelCC correctly, it should return true if the given function is able to allocate LDS (in the sense of “an entry point is allowed to allocate LDS”).
That is exactly what isModuleEntryFunctionCC returns, so this looks right to me.

arsenm added inline comments.Jan 19 2021, 12:08 PM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
215 ↗(On Diff #317575)

Probably should use stable_sort, if the values are anonymous the final compare won't provide ordering

  • address some review comments
  • Run tests under new pass manager
  • Extend testing, increase recur depth req
  • Remove stack depth limit

I think that's all the comments addressed other than objections to inline asm, which still passes zero to a "s" constrained register.

The inline asm makes the kernel use the newly created LDS structure, so that passes like PromoteAlloca can see that it uses said structure instance. Some alternatives are:

  • Modify PromoteAlloca (and any other passes that use size of LDS) to look for the magic variable. Means spreading knowledge of this transform across other passes.
  • Add an IR intrinsic, SDag and GlobalISel lowering to pseudo instruction, pseudo expansion to no-op. Semantically very similar to the inline asm.
  • Metadata - mark kernels as using +N bytes of LDS beyond what their uses suggest
  • Alternative lowering / transform

There are various optimisations available, e.g. metadata to mark functions as can't-use-lds, propagated, and used to drop the 'use' of the variable from some kernels, indirection to allow putting variables at different offsets across different kernels etc.

  • add assign to self clause to inactive test
llvm/test/CodeGen/AMDGPU/lower-function-lds-inactive.ll
36 ↗(On Diff #316557)

Mixed it up a bit. As far as I can tell, ReplaceAllUsesWith works everywhere other than the compiler.used list, which looks like an oversight. The larger constexpr case has two uses of the same subexpression.

Storing value's address to itself just behaves like any other non-undef initializer, i.e. ignored by the pass. Added to the inactive.ll test set.

arsenm added inline comments.Jan 28 2021, 6:46 PM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗(On Diff #316557)

This is a tricky one. I think checking for a specific intrinsic global variable when allocating the kernel's LDS wouldn't be too bad. However, I did come up with this alternative hack:

@arst = addrspace(3) global [256 x i32] zeroinitializer

declare void @llvm.donothing()

define void @foo() {
  call void @llvm.donothing() [ "dummyuse"([256 x i32] addrspace(3)* @arst)]
  ret void
}

This at least solves the PromoteAlloca case. The use disappears for the actual codegen amount so that doesn't quite solve everything. I guess an intrinsic that takes a pointer and returns the same value would be slightly better without changing the existing LDS lowering

102 ↗(On Diff #319995)

No reason to mention llc

142 ↗(On Diff #319995)

I feel like this should be a utility function somewhere else if it really doesn't already exist

218 ↗(On Diff #320008)

const first

291 ↗(On Diff #320008)

Doesn't really have to do with functions anymore. llvm.amdgcn.module.lds.t?

299 ↗(On Diff #320008)

.module?

331–332 ↗(On Diff #320008)

Merge to one if

  • s/function/module/g
  • Replace inline asm with donothing and operand

@arsenm addressed last round of comments. Apologies for missing it on Friday, I don't seem to be getting email from reviews.llvm. I like the donothing hack.

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
142 ↗(On Diff #319995)

Yes, but I don't want to propose it as such since that is likely to slow the patch process down further. Better to land it here and attempt to promote it later.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188

I quite like the donothing alternative to inline asm. It does indeed keep the use alive long enough.

A future change to the pipeline might break that, but it'll do so fairly obviously (all the openmp stuff stops working, for one). I think we go with annotated donothing for now, and implement an intrinsic -> pseudo sequence when/if it becomes necessary. Written a fairly long comment to that effect in the source.

  • Revert unintended test change, 0/undef
aeubanks added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

This should always be added here regardless of the flag, this is just for registering that the pass exists. Rather, the pass should also be added in registerCGSCCOptimizerLateEPCallback() below guarded by the flag.

  • Adjust pass registration
JonChesterfield added inline comments.Feb 3 2021, 9:29 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

Welcome! Thank you very much for the input, there's been more guesswork in the new pass manager parts than I like. Can't seem to find any documentation on how it works. Dropped the test here, added to CGSCC. It's a module pass and the other things there were function passes, but instantiating a new ModulePassManager seems to work fine.

aeubanks added inline comments.Feb 3 2021, 9:44 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

I should definitely write some documentation somewhere.

Adding it to a ModulePassManager you created doesn't do anything since it's not getting added to the overall pipeline. For example, the FunctionPassManager there is added to the CGSCCPassManager.
For the legacy PM it doesn't really make sense to add a module pass at EP_CGSCCOptimizerLate since it'll end up breaking the CGSCC pipeline. Normally it runs the the CGSCC passes and the function pipeline on each function in an SCC as it visits the SCCs, but with a module pass in the middle those will get split up. The new PM just makes this whole thing explicit via nesting when adding passes.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

Ah, I missed the PM.addPass at the end. createCGSCCToModulePassAdaptor doesn't appear to exist, and the types of createModuleToPostOrderCGSCCPassAdaptor suggest that's wrong too.

What should I do with this pass then? It's not hugely crucial when it runs, provided it's before PromoteAlloca, which is a function pass in CGSCCOptimizerLate.

aeubanks added inline comments.Feb 3 2021, 10:41 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

Does EP_ModuleOptimizerEarly/registerPipelineEarlySimplificationEPCallback() work?

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

Sure, moving it further forward is fine. What's the difference between registerPipeline and adjustPassManager's addExtension EP_ModuleOptimizerEarly? Do I want both?

aeubanks added inline comments.Feb 3 2021, 1:03 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

adjustPassManager() is for the legacy pass manager, and registerPassBuilderCallbacks() is for the new pass manager

  • Better pass registration
JonChesterfield added inline comments.Feb 3 2021, 2:13 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516

Ah yes, thank you. Should be OK now.

It will be a good day when we have one pass manager and drop this duplication.

  • update pipeline test, missed locally because it requires asserts
arsenm added inline comments.Feb 4 2021, 7:33 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗(On Diff #316557)

Could also avoid this if the kernel already has a direct reference to the LDS, which it likely does

JonChesterfield added inline comments.Feb 4 2021, 7:36 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗(On Diff #316557)

Sure. We wouldn't gain anything though, specifically no saving in LDS. It also carries the slight risk that the existing direct reference to LDS gets dead code eliminated at an inconvenient time.

JonChesterfield added a comment.EditedFeb 8 2021, 12:17 AM

Ping. Openmp is still blocked on this

arsenm added a comment.Feb 9 2021, 9:38 AM

Should add some tests where the same LDS appears in multiple functions/kernels

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188

But if there are no pre-existing uses of the LDS in the kernel, this won't end up getting allocated in the kernel

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188

If all uses of LDS are from a kernel, this pass does nothing. Otherwise:

  • every kernel gets a call to llvm.donothing (previously inline asm) that looks like a use of the per-module struct
  • every kernel allocates the size of the per-module struct, regardless of whether the llvm.donothing is present or not

See the constructor AMDGPUMachineFunction::AMDGPUMachineFunction. If the symbol llvm.amdgcn.module.lds is present, allocateLDSGlobal is called on it, before any other calls to allocateLDSGlobal in order to reliably guess that the offset returned will be zero.

arsenm added inline comments.Feb 10 2021, 9:15 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43

Ah, I missed this part before. However, this isn't the right place for this. I've been trying to fix having code that depends on the function itself in the MachineFunctionInfo constructor. The point this is constructed isn't well defined (i.e. this is not a pass), so depending on whether you are running a MIR pass or something this may not work as expected.

It's a bit hacky, but you could stick this allocation in LowerFormalArguments since we don't really have a better pre-lowering hook

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43

Ah. I was working on the basis that any instance of this class can be used to call allocateLDSGlobal, thus the constructor neatly catches every path. Bad assumption.

If using inline asm, we could put the allocation shortly before the two existing uses of allocateLDSGlobal (from SDag and GlobalISel), as that ensures there will be at least one reference to an LDS global from the kernel. Changing to donothing breaks that, as the call can be removed beforehand, so kernels that don't have any direct LDS uses will miss the handling.

LowerFormalArguments should work, will update to allocate it from there.

aeubanks removed a subscriber: aeubanks.Feb 10 2021, 9:49 AM
  • Move alloc out of machinefunction ctor

Should add some tests where the same LDS appears in multiple functions/kernels

var0 in lower-module-lds.ll does. I'm not particularly worried about the update misfiring as it's just a call to replaceAllUsesWith. Which turned out to miss compiler.used, but seems to cover everything else.

The tests are a bit mixed. E.g. checking the alignment padding is introduced is in the checks on the generated type. There aren't any IR->ASM tests here, perhaps one should be added.

The most thorough testing done on this was running the openmp test suite using a compiler with the existing function pointer workaround deleted, which is how the compiler.used edge case was caught. That also caught the mis-accounting in promotealloca and a misaligned field. I don't think the IR tests have caught anything that the runtime didn't, though they will fail more precisely if things regress.

JonChesterfield marked an inline comment as done.Feb 15 2021, 10:39 AM

Though the filecheck testing still passes, this change no longer works. Looking at some generated IR, it appears that the pass manager registration change leads to the pass being called repeatedly, which it doesn't tolerate very well. In particular, it is run on individual translation units, instead of only once at whole program link time.

A reasonable fix is probably to make the pass safe to run repeatedly. That's probably useful future proofing, as it means the pass can run to eliminate LDS variables, then if another pass decides to introduce some new ones, this can be run again to clean up.

  • run pass only once, from backend pass manager
JonChesterfield added a comment.EditedFeb 24 2021, 3:28 PM

Working again now. Mailing list revealed that the new pass manager isn't used for backends yet, so the last patch dropped the invocation from the opt pipeline. Left the plumbing in place (so the pass can still be run with the new manager, as in the tests). When the new pass manager is used for the amdgcn backend, we can slot this pass in roughly the same place as it runs now.

With this patch and amdgpu-fixed-function-abi=true, most of the generic openmp kernels in the aomp test suite pass with the function pointer hashing scheme disabled. That isn't quite the same as most generic kernels passing with trunk clang, though ones that don't use printf or malloc would be expected to.

Started on the path to making this safe to run repeatedly, with more LDS introduced in between each step. That makes the rewrite to access at 0 + offset unsafe. Would instead need to emit uses of the newly created struct (i.e. not 0 + offset), and patch allocateLDSGlobal to consider that specific variable to be fine to access from within a non-kernel function, as well as allocating it at zero as we presently do. That would be equivalent to the current scheme, except slightly more obvious what is going on in the IR, in exchange for being a more invasive change to the back end.

That change plus renaming the variable if already present would be correct for multiple invocations. Cleaner would be to also replace the module variable with scalars, SROA fashion, before starting the pass to avoid the nested struct buildup. I'd like to leave those revisions for later as this patch is already a couple of months in.

To clarify, "backend" means the backend codegen pipeline. This is part of the middle-end optimization pipeline, so it should be added there.

To clarify, "backend" means the backend codegen pipeline. This is part of the middle-end optimization pipeline, so it should be added there.

This is not an optimization, this is lowering

Oh I'm sorry, I thought this was in adjustPassManager(). Disregard my comment.

arsenm added inline comments.Mar 3 2021, 6:43 PM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142

Needs a comment

175

Needs a comment

JonChesterfield added inline comments.Mar 7 2021, 3:43 PM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142

A comment saying what? The function and helper does what the name and parameter types claim it'll do in almost as boring a fashion as possible.

arsenm added inline comments.Mar 8 2021, 11:39 AM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142

On first read it sounds very generic, and not related to the special intrinsic global variables. I had to read the function to see what it actually did

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142

It is generic. If this lands, I'm hoping to move it under ModuleUtils to go alongside appendToUsed which it closely resembles. Probably as one entry point to remove a set/sequence of constants from llvm.used and a different entry point to remove them from llvm.compiler.used.

As it's far from certain whether this will land, I don't want to propose a function for ModuleUtils with no users, as it'll be rightly rejected as dead code.

arsenm accepted this revision.Mar 11 2021, 6:04 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142

Committing utlities before uses is not unheard of

This revision is now accepted and ready to land.Mar 11 2021, 6:04 PM
aeubanks removed a subscriber: aeubanks.Mar 11 2021, 6:09 PM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
233

collectUsedGlobalVariables was removed by a patch following D97128, fixing up.

  • use new api for collectUsedGlobalVariables
This revision was automatically updated to reflect the committed changes.

Note to self - there is ongoing interest in minimising the LDS usage of applications. This patch allocates the struct in every kernel (see the call to markUsedByKernel, it is applied exactly once to each kernel), in order to support calls to functions that make use of that struct.

This could be refined. Kernels that make no calls don't need to unconditionally allocate this struct. If the kernel itself does use some LDS that was moved into it, that use will remain and suffice to trigger allocation of the struct as normal. More difficult to compute (one for the attributor?), kernels that call no functions that could refer to that struct also don't need to allocate it.

A simplified variant on @hsmhsm's proposal, an LDS variable that is used from an internal function that has not had it's address taken could be passed into the function by pointer from the caller, ultimately leaving the &var, i.e. the use of that variable, in the top level kernel. Access to that variable would be slower than in this patch - an extra dereference, and loss of an argument register to propagate the address down the call tree - but it would move the variable out of the combined struct for a saving of LDS in other kernels. For large variables and scarce LDS that is probably a win.

See also a note further up about maintaining the name of the variable through the IR, instead of using '0' directly, as that would make the IR easier to read. Particularly useful if we end up refining this pass further.

Just run a bug to ground here. Replacing the inline asm with the donothing intrinsic is prettier, but also doesn't work as is. This is a flaw in the above only tested at the IR and application level. This shows up quickly (if opaquely) at the executable unit test scale.

Given:

target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn-amd-amdhsa"

@global_barrier_state = hidden addrspace(3) global i32 undef, align 4

define i32 @rw() #0 {
entry:
  %0 = atomicrmw add i32 addrspace(3)* @global_barrier_state, i32 1 acq_rel, align 4
  ret i32 %0
}

define amdgpu_kernel void @__device_start() {
entry:
  %0 = call i32 @rw()
  ret void
}

attributes #0 = { noinline  }

This transform does exactly what it was intended to, the LDS variable allocated at zero, but the kernel metadata starts:

	.amdhsa_kernel __device_start
		.amdhsa_group_segment_fixed_size 0 ; should be 4, isn't
        .end_amdhsa_kernel

If the inline asm is reintroduced, that goes to 4. Similarly, if the test case that reduced to this is modified to allocate 4 bytes more LDS than the metadata asks for, it works again.

I suspect there is something in hardware that rounds LDS allocation up to a boundary, so as long as the kernel looks like it uses some non-zero amount of LDS, the out of bounds read hits in the allocated region.

t-tye added a comment.May 13 2021, 8:23 AM

I suspect there is something in hardware that rounds LDS allocation up to a boundary, so as long as the kernel looks like it uses some non-zero amount of LDS, the out of bounds read hits in the allocated region.

Yes the LDS size is rounded up as described in the GRANULATED_LDS_SIZE field in the compute_pgm_rsrc2 table at:

https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc2-gfx6-gfx10-table

Just run a bug to ground here. Replacing the inline asm with the donothing intrinsic is prettier, but also doesn't work as is. This is a flaw in the above only tested at the IR and application level. This shows up quickly (if opaquely) at the executable unit test scale.

Yes I thought I realized this before, where we're not going to see the use in codegen to allocate it. However I thought this was fixed by using the special intrinsic global variable which we always allocate?

JonChesterfield added a comment.EditedMay 13 2021, 12:21 PM

Nice reference to GRANULATED_LDS_SIZE, thanks!

We unconditionally allocate the module variable from lowerFormalArgumentsKernel, which still looks right to me. My current theory is there's some hook between that and the metadata writer that needs to be poked from the above code and isn't, but I haven't worked through the metadata setup code yet. Superficially it looks like it is keyed off the AMDGPUMachineFunction.h, but in that case it should be working. Going to need a debug build I think.

Looking back I see

The use disappears for the actual codegen amount so that doesn't quite solve everything

which correlates strongly with this bug, though I didn't make the connection at the time.

Inline asm does keep the use alive long enough to reach the metadata in the binary. An intrinsic would doubtless achieve the same if it was eliminated late enough. Need to find out what late enough is to see how much plumbing that requires.

Worth noting given the recent discussions about LDS usage that this patch puts the module variable in every kernel. If the allocation was pinned to the presence of the intrinsic, or if there was an attribute for no-module-lds-needed-in-this-kernel, that could be eliminated.

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

for (const Argument &Arg : F.args()) { guards the sdag entry, and the same expression guards the gisel entry

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

JonChesterfield added a comment.EditedMay 19 2021, 2:39 PM

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

I expected that too, but debug statements around allocateLDSGlobal didn't fire, and looking at the control flow around the sdag and globalisel paths lowerFormalArguments is called from within a loop for (const Argument &Arg : F.args()) {}. I may be missing something of course (didn't put a lot of time into chasing this), but it definitely looks like lowerFormalArguments doesn't get called when there are no arguments.

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

I expected that too, but debug statements around allocateLDSGlobal didn't fire, and looking at the control flow around the sdag and globalisel paths lowerFormalArguments is called from within a loop for (const Argument &Arg : F.args()) {}. I may be missing something of course (didn't put a lot of time into chasing this), but it definitely looks like lowerFormalArguments doesn't get called when there are no arguments.

Hi Jon,

For the 0 argument kernel, did you put debug statement within SITargetLowering::LowerFormalArguments() and test whether it will hit or not? My experimentation shows that it is indeed hit for 0 arg kernels too. So it is not problem with 0 arg kernel.

Problem is within the function - AMDGPUMachineFunction::allocateModuleLDSGlobal() that you wrote.

The statement

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");

is suppose to be replaced by

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds", true);

My understanding is - For Module to search for a internal global variable successfully within the symbol table, we need to explicitly tell it to do so by passing "true" as an additional arg to getGlobalVariable(). Otherwise, the internal symbol won't be find.

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

I expected that too, but debug statements around allocateLDSGlobal didn't fire, and looking at the control flow around the sdag and globalisel paths lowerFormalArguments is called from within a loop for (const Argument &Arg : F.args()) {}. I may be missing something of course (didn't put a lot of time into chasing this), but it definitely looks like lowerFormalArguments doesn't get called when there are no arguments.

Hi Jon,

For the 0 argument kernel, did you put debug statement within SITargetLowering::LowerFormalArguments() and test whether it will hit or not? My experimentation shows that it is indeed hit for 0 arg kernels too. So it is not problem with 0 arg kernel.

Problem is within the function - AMDGPUMachineFunction::allocateModuleLDSGlobal() that you wrote.

The statement

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");

is suppose to be replaced by

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds", true);

My understanding is - For Module to search for a internal global variable successfully within the symbol table, we need to explicitly tell it to do so by passing "true" as an additional arg to getGlobalVariable(). Otherwise, the internal symbol won't be find.

Further, perhaps, you also need to add one lit test like below, call llc and make sure that @llvm.amdgcn.module.lds is allocated at address 0.

@llvm.amdgcn.module.lds = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16

define amdgpu_kernel void @kern() {
  %llvm.amdgcn.module.lds.bc = bitcast [16 x i8] addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*
   store i8 6, i8 addrspace(3)* %llvm.amdgcn.module.lds.bc, align 16

  ret void
}

CMD: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 --amdhsa-code-object-version=4 lds-allocation2.ll -o tmp.s

assembly generated:

v_mov_b32_e32 v0, 0
v_mov_b32_e32 v1, 6
ds_write_b8 v0, v1
s_endpgm

We should probably CHECK for the instruction pattern - `v_mov_b32_e32 v0, 0```

hsmhsm added inline comments.May 20 2021, 5:36 AM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
179

Every GlobalVariable should be Constant. ref - https://llvm.org/doxygen/classllvm_1_1Constant.html. Then, why do we need dyn_cast<>, and an if conditional check here? Cannot we direct cast<> to Constant?

hsmhsm added inline comments.May 20 2021, 5:43 AM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
303

Will the logic within above *for* loop ensures that first *non-padding* (real) member of the struct will be accessed at the same address as that of struct instance?

Hi Jon,

For the 0 argument kernel, did you put debug statement within SITargetLowering::LowerFormalArguments() and test whether it will hit or not? My experimentation shows that it is indeed hit for 0 arg kernels too. So it is not problem with 0 arg kernel.

Could be. I'm very limited in the time I can spend on this, debugging was a few spare minutes.

Problem is within the function - AMDGPUMachineFunction::allocateModuleLDSGlobal() that you wrote.

The statement

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");

is suppose to be replaced by

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds", true);

API docs agree. This will be a problem in AMDGPULowerModuleLDS::removeFromUsedList as well, the getGlobalVariable at the entry to the function will be ignoring internal variables, so also needs the ,true.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
179

We don't need dyn_cast here, cast is fine

303

it'll be at zero, so, yes

JonChesterfield added a comment.EditedDec 12 2021, 12:11 PM

I can't find any evidence that the bug discussed in the comments here was fixed. Checking quickly now, it looks like LowerFormalArguments is called correctly, getGlobalVariable will ignore internal variables and the module lds is created with internal linkage. I think that's still broken, will add it to the todo list for this week.

repro above that used to emit .amdhsa_group_segment_fixed_size 0 now emits .amdhsa_group_segment_fixed_size 4 as it should, I'm unclear where the behaviour changed. Will debug through.

edit: Was fixed in passing by https://reviews.llvm.org/D102882 by replacing getGlobalVariable with getNamedGlobal, will check in the above repro as a regression test.