Page MenuHomePhabricator

[amdgpu] Implement lower function LDS pass
Needs ReviewPublic

Authored by JonChesterfield on Jan 13 2021, 7:40 PM.

Details

Summary

[amdgpu] Implement lower function LDS pass

Local variables are allocated at kernel launch. This pass collects global
variables that are used from non-kernel functions, moves them into a new struct
type, and allocates an instance of that type in every kernel. Uses are then
replaced with a constantexpr offset.

Prior to this pass, accesses from a function are compiled to trap. With this
pass, most such accesses are removed before reaching codegen. The trap logic
is left unchanged by this pass. It is still reachable for the cases this pass
misses, notably the extern shared construct from hip and variables marked
constant which survive the optimizer.

This is of interest to the openmp project because the deviceRTL runtime library
uses cuda shared variables from functions that cannot be inlined. Trunk llvm
therefore cannot compile some openmp kernels for amdgpu. In addition to the
unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled
and the function pointer hashing scheme deleted passes the openmp suite.

This lowering will use more LDS than strictly necessary. It is intended to be
a functionally correct fallback for cases that are difficult to target from
future optimisation passes.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
JonChesterfield marked 9 inline comments as done.EditedJan 18 2021, 9:02 AM

Fixed the easy parts. Adding a call to the new pass manager by RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-function-lds < %s | FileCheck %s caught some errors in the plumbing, now fixed.

I'm a little confused by the mixture of legacy and new pass manager. In particular, would like feedback on whether I've successfully expressed that this pass should run before PromoteAllocaToLDS.

hsmhsm added inline comments.Jan 19 2021, 6:19 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

The function AMDGPU::isModuleEntryFunctionCC() rerturns true for graphics, shaders, SPIR (OpeCL?), etc. Is it what we expect here? Is not it that we are concerned here only with the CC - CallingConv::AMDGPU_KERNEL?

  • address some review comments
  • Run tests under new pass manager
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

LLC (e.g. AMDGPULegalizerInfo) uses isModuleEntryFunction to detect non-kernel use of LDS, which resolves to a function in Utils/AMDGPUBaseInfo.cpp that returns true for AMDGPU_KERNEL, SPIR_KERNEL and various calling conventions I don't recognise. AMDGPU_VS etc.

Some opencl I compiled as a sanity check used SPIR_KERNEL as the calling convention.

I don't know whether that's right, only that this pass should use exactly the same predicate as the one guarding allocateLDSGlobal.

llvm/test/CodeGen/AMDGPU/lower-function-lds.ll
51 ↗(On Diff #316557)

An OpenCL compilation emitted spir_kernel (somewhat to my surprise) so I thought I'd go for one of each. See also a comment above about wanting to be consistent with the guards around allocateLDSGlobal

hsmhsm added inline comments.Jan 19 2021, 8:42 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

IMHO, LLC is a generic back-end for OpenCL, OpenMP, HIP, Graphic Shader language, etc. So, AMDGPULegalizerInfo might be generally using it. but, this pass is concerned with only CallingConv::AMDGPU_KERNEL. That said, probably @arsenm or others in the review list could better clarify it.

sebastian-ne added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗(On Diff #317373)

If I understand the purpose of isKernelCC correctly, it should return true if the given function is able to allocate LDS (in the sense of “an entry point is allowed to allocate LDS”).
That is exactly what isModuleEntryFunctionCC returns, so this looks right to me.

arsenm added inline comments.Jan 19 2021, 12:08 PM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
215 ↗(On Diff #317575)

Probably should use stable_sort, if the values are anonymous the final compare won't provide ordering

  • address some review comments
  • Run tests under new pass manager
  • Extend testing, increase recur depth req
  • Remove stack depth limit

I think that's all the comments addressed other than objections to inline asm, which still passes zero to a "s" constrained register.

The inline asm makes the kernel use the newly created LDS structure, so that passes like PromoteAlloca can see that it uses said structure instance. Some alternatives are:

  • Modify PromoteAlloca (and any other passes that use size of LDS) to look for the magic variable. Means spreading knowledge of this transform across other passes.
  • Add an IR intrinsic, SDag and GlobalISel lowering to pseudo instruction, pseudo expansion to no-op. Semantically very similar to the inline asm.
  • Metadata - mark kernels as using +N bytes of LDS beyond what their uses suggest
  • Alternative lowering / transform

There are various optimisations available, e.g. metadata to mark functions as can't-use-lds, propagated, and used to drop the 'use' of the variable from some kernels, indirection to allow putting variables at different offsets across different kernels etc.

  • add assign to self clause to inactive test
llvm/test/CodeGen/AMDGPU/lower-function-lds-inactive.ll
36 ↗(On Diff #316557)

Mixed it up a bit. As far as I can tell, ReplaceAllUsesWith works everywhere other than the compiler.used list, which looks like an oversight. The larger constexpr case has two uses of the same subexpression.

Storing value's address to itself just behaves like any other non-undef initializer, i.e. ignored by the pass. Added to the inactive.ll test set.

arsenm added inline comments.Jan 28 2021, 6:46 PM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗(On Diff #316557)

This is a tricky one. I think checking for a specific intrinsic global variable when allocating the kernel's LDS wouldn't be too bad. However, I did come up with this alternative hack:

@arst = addrspace(3) global [256 x i32] zeroinitializer

declare void @llvm.donothing()

define void @foo() {
  call void @llvm.donothing() [ "dummyuse"([256 x i32] addrspace(3)* @arst)]
  ret void
}

This at least solves the PromoteAlloca case. The use disappears for the actual codegen amount so that doesn't quite solve everything. I guess an intrinsic that takes a pointer and returns the same value would be slightly better without changing the existing LDS lowering

102 ↗(On Diff #319995)

No reason to mention llc

142 ↗(On Diff #319995)

I feel like this should be a utility function somewhere else if it really doesn't already exist

218 ↗(On Diff #320008)

const first

291 ↗(On Diff #320008)

Doesn't really have to do with functions anymore. llvm.amdgcn.module.lds.t?

299 ↗(On Diff #320008)

.module?

331–332 ↗(On Diff #320008)

Merge to one if

  • s/function/module/g
  • Replace inline asm with donothing and operand

@arsenm addressed last round of comments. Apologies for missing it on Friday, I don't seem to be getting email from reviews.llvm. I like the donothing hack.

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
142 ↗(On Diff #319995)

Yes, but I don't want to propose it as such since that is likely to slow the patch process down further. Better to land it here and attempt to promote it later.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188

I quite like the donothing alternative to inline asm. It does indeed keep the use alive long enough.

A future change to the pipeline might break that, but it'll do so fairly obviously (all the openmp stuff stops working, for one). I think we go with annotated donothing for now, and implement an intrinsic -> pseudo sequence when/if it becomes necessary. Written a fairly long comment to that effect in the source.

  • Revert unintended test change, 0/undef
aeubanks added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

This should always be added here regardless of the flag, this is just for registering that the pass exists. Rather, the pass should also be added in registerCGSCCOptimizerLateEPCallback() below guarded by the flag.

  • Adjust pass registration
JonChesterfield added inline comments.Wed, Feb 3, 9:29 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

Welcome! Thank you very much for the input, there's been more guesswork in the new pass manager parts than I like. Can't seem to find any documentation on how it works. Dropped the test here, added to CGSCC. It's a module pass and the other things there were function passes, but instantiating a new ModulePassManager seems to work fine.

aeubanks added inline comments.Wed, Feb 3, 9:44 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

I should definitely write some documentation somewhere.

Adding it to a ModulePassManager you created doesn't do anything since it's not getting added to the overall pipeline. For example, the FunctionPassManager there is added to the CGSCCPassManager.
For the legacy PM it doesn't really make sense to add a module pass at EP_CGSCCOptimizerLate since it'll end up breaking the CGSCC pipeline. Normally it runs the the CGSCC passes and the function pipeline on each function in an SCC as it visits the SCCs, but with a module pass in the middle those will get split up. The new PM just makes this whole thing explicit via nesting when adding passes.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

Ah, I missed the PM.addPass at the end. createCGSCCToModulePassAdaptor doesn't appear to exist, and the types of createModuleToPostOrderCGSCCPassAdaptor suggest that's wrong too.

What should I do with this pass then? It's not hugely crucial when it runs, provided it's before PromoteAlloca, which is a function pass in CGSCCOptimizerLate.

aeubanks added inline comments.Wed, Feb 3, 10:41 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

Does EP_ModuleOptimizerEarly/registerPipelineEarlySimplificationEPCallback() work?

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

Sure, moving it further forward is fine. What's the difference between registerPipeline and adjustPassManager's addExtension EP_ModuleOptimizerEarly? Do I want both?

aeubanks added inline comments.Wed, Feb 3, 1:03 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

adjustPassManager() is for the legacy pass manager, and registerPassBuilderCallbacks() is for the new pass manager

  • Better pass registration
JonChesterfield added inline comments.Wed, Feb 3, 2:13 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
515

Ah yes, thank you. Should be OK now.

It will be a good day when we have one pass manager and drop this duplication.

  • update pipeline test, missed locally because it requires asserts
arsenm added inline comments.Thu, Feb 4, 7:33 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗(On Diff #316557)

Could also avoid this if the kernel already has a direct reference to the LDS, which it likely does

JonChesterfield added inline comments.Thu, Feb 4, 7:36 AM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗(On Diff #316557)

Sure. We wouldn't gain anything though, specifically no saving in LDS. It also carries the slight risk that the existing direct reference to LDS gets dead code eliminated at an inconvenient time.

JonChesterfield added a comment.EditedMon, Feb 8, 12:17 AM

Ping. Openmp is still blocked on this

arsenm added a comment.Tue, Feb 9, 9:38 AM

Should add some tests where the same LDS appears in multiple functions/kernels

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188

But if there are no pre-existing uses of the LDS in the kernel, this won't end up getting allocated in the kernel

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188

If all uses of LDS are from a kernel, this pass does nothing. Otherwise:

  • every kernel gets a call to llvm.donothing (previously inline asm) that looks like a use of the per-module struct
  • every kernel allocates the size of the per-module struct, regardless of whether the llvm.donothing is present or not

See the constructor AMDGPUMachineFunction::AMDGPUMachineFunction. If the symbol llvm.amdgcn.module.lds is present, allocateLDSGlobal is called on it, before any other calls to allocateLDSGlobal in order to reliably guess that the offset returned will be zero.

arsenm added inline comments.Wed, Feb 10, 9:15 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43

Ah, I missed this part before. However, this isn't the right place for this. I've been trying to fix having code that depends on the function itself in the MachineFunctionInfo constructor. The point this is constructed isn't well defined (i.e. this is not a pass), so depending on whether you are running a MIR pass or something this may not work as expected.

It's a bit hacky, but you could stick this allocation in LowerFormalArguments since we don't really have a better pre-lowering hook

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43

Ah. I was working on the basis that any instance of this class can be used to call allocateLDSGlobal, thus the constructor neatly catches every path. Bad assumption.

If using inline asm, we could put the allocation shortly before the two existing uses of allocateLDSGlobal (from SDag and GlobalISel), as that ensures there will be at least one reference to an LDS global from the kernel. Changing to donothing breaks that, as the call can be removed beforehand, so kernels that don't have any direct LDS uses will miss the handling.

LowerFormalArguments should work, will update to allocate it from there.

aeubanks removed a subscriber: aeubanks.Wed, Feb 10, 9:49 AM
  • Move alloc out of machinefunction ctor

Should add some tests where the same LDS appears in multiple functions/kernels

var0 in lower-module-lds.ll does. I'm not particularly worried about the update misfiring as it's just a call to replaceAllUsesWith. Which turned out to miss compiler.used, but seems to cover everything else.

The tests are a bit mixed. E.g. checking the alignment padding is introduced is in the checks on the generated type. There aren't any IR->ASM tests here, perhaps one should be added.

The most thorough testing done on this was running the openmp test suite using a compiler with the existing function pointer workaround deleted, which is how the compiler.used edge case was caught. That also caught the mis-accounting in promotealloca and a misaligned field. I don't think the IR tests have caught anything that the runtime didn't, though they will fail more precisely if things regress.

JonChesterfield marked an inline comment as done.Mon, Feb 15, 10:39 AM

Though the filecheck testing still passes, this change no longer works. Looking at some generated IR, it appears that the pass manager registration change leads to the pass being called repeatedly, which it doesn't tolerate very well. In particular, it is run on individual translation units, instead of only once at whole program link time.

A reasonable fix is probably to make the pass safe to run repeatedly. That's probably useful future proofing, as it means the pass can run to eliminate LDS variables, then if another pass decides to introduce some new ones, this can be run again to clean up.

  • run pass only once, from backend pass manager

Working again now. Mailing list revealed that the new pass manager isn't used for backends yet, so the last patch dropped the invocation from the opt pipeline. Left the plumbing in place (so the pass can still be run with the new manager, as in the tests). When the new pass manager is used for the amdgcn backend, we can slot this pass in roughly the same place as it runs now.

With this patch and amdgpu-fixed-function-abi=true, most of the generic openmp kernels in the aomp test suite pass with the function pointer hashing scheme disabled. That isn't quite the same as most generic kernels passing with trunk clang, though ones that don't use printf or malloc would be expected to.

Started on the path to making this safe to run repeatedly, with more LDS introduced in between each step. That makes the rewrite to access at 0 + offset unsafe. Would instead need to emit uses of the module directly, and patch allocateLDSGlobal to consider that specific variable to be fine to access from within a non-kernel function, as well as allocating it at zero as we presently do. That would be equivalent to the current scheme, except slightly more obvious what is going on in the IR, in exchange for being a more invasive change to the back end.

That change plus renaming the variable if already present would be correct for multiple invocations. Cleaner would be to also replace the module variable with scalars, SROA fashion, before starting the pass to avoid the nested struct buildup. I'd like to leave those revisions for later as this patch is already a couple of months in.

To clarify, "backend" means the backend codegen pipeline. This is part of the middle-end optimization pipeline, so it should be added there.

To clarify, "backend" means the backend codegen pipeline. This is part of the middle-end optimization pipeline, so it should be added there.

This is not an optimization, this is lowering

Oh I'm sorry, I thought this was in adjustPassManager(). Disregard my comment.