This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
-
AMDGPUCallLowering.cpp
7/14
AMDGPULowerModuleLDSPass.cpp
-
AMDGPUMachineFunction.h
3/3
AMDGPUMachineFunction.cpp
-
AMDGPUPromoteAlloca.cpp
4/8
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
-
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
lds-global-non-entry-func.ll
-
addrspacecast-initializer-unsupported.ll
-
lds-global-non-entry-func.ll
-
lower-module-lds-constantexpr.ll
-
lower-module-lds-inactive.ll
-
lower-module-lds-indirect.ll
-
lower-module-lds-used-list.ll
-
lower-module-lds.ll
-
promote-alloca-to-lds-constantexpr-use.ll

Differential D94648

[amdgpu] Implement lower function LDS pass
ClosedPublic

Authored by JonChesterfield on Jan 13 2021, 7:40 PM.

Download Raw Diff

Details

Reviewers

hsmhsm
scchan
b-sumner
madhur13490
yaxunl
t-tye
msearles
acmeman925
arsenm
rampitec
jdoerfert
ronlieb
AlexVlx

Commits

rG13e49dcee48f: [amdgpu] Implement lower function LDS pass

Summary

[amdgpu] Implement lower function LDS pass

Local variables are allocated at kernel launch. This pass collects global
variables that are used from non-kernel functions, moves them into a new struct
type, and allocates an instance of that type in every kernel. Uses are then
replaced with a constantexpr offset.

Prior to this pass, accesses from a function are compiled to trap. With this
pass, most such accesses are removed before reaching codegen. The trap logic
is left unchanged by this pass. It is still reachable for the cases this pass
misses, notably the extern shared construct from hip and variables marked
constant which survive the optimizer.

This is of interest to the openmp project because the deviceRTL runtime library
uses cuda shared variables from functions that cannot be inlined. Trunk llvm
therefore cannot compile some openmp kernels for amdgpu. In addition to the
unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled
and the function pointer hashing scheme deleted passes the openmp suite.

This lowering will use more LDS than strictly necessary. It is intended to be
a functionally correct fallback for cases that are difficult to target from
future optimisation passes.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield created this revision.Jan 13 2021, 7:40 PM

Herald added subscribers: kerbowa, jfb, mgrang and 7 others. · View Herald TranscriptJan 13 2021, 7:40 PM

JonChesterfield requested review of this revision.Jan 13 2021, 7:40 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptJan 13 2021, 7:40 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, sstefan1, wdng. · View Herald Transcript

JonChesterfield added reviewers: ronlieb, AlexVlx.Jan 13 2021, 7:42 PM

Harbormaster completed remote builds in B85114: Diff 316557.Jan 13 2021, 8:41 PM

madhur13490 added inline comments.Jan 15 2021, 10:03 AM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
69 ↗	(On Diff #316557)	Erroneous characters in the comment.
201 ↗	(On Diff #316557)	Can we please split this function into logical blocks and wrap them in private functions? That would make the code more readable.
218 ↗	(On Diff #316557)	Such information is useful to know, so please use debugging messages with DEBUG_TYPE. Debugging messages would help us to spot issues.
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43	I think this name can be a bit more mangled. It is easy to have a lier in the file. Probably use mechanism to randomly generate a string and use that to name and use the same random algorithm while de-referencing. This is too fancy but a bit more mangled name should be used.

Needs some test to stress different alignment scenarios. Also need some with these globals used in some weird constant initializers.

I also thought the idea was to have a constant memory table with pointers in it, not one giant LDS block

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
66–68 ↗	(On Diff #316557)	I think this search isn't quite right and will miss arbitrarily nested constant expressions. We already have similar code you need to analyze users in AMDGPUAnnotateKernelFeatures::visitConstantExprsRecursively to find constant LDS addrspacecasts anywhere they can appear.
113 ↗	(On Diff #316557)	Not sure why you need to bother considering this case, if you just treat it normally it should work
188 ↗	(On Diff #316557)	Definitely shouldn't be introducing inline asm, not sure why you are doing this. Also "r" is bad and we shouldn't support it
202 ↗	(On Diff #316557)	Probably should move this flag into the pass pipeline in AMDGPUTargetMachine
207 ↗	(On Diff #316557)	const &
224 ↗	(On Diff #316557)	llvm::sort
233–234 ↗	(On Diff #316557)	Should use the type alloc size. You are also ignoring the alignment of the global itself, which may differ from the type alignment
269 ↗	(On Diff #316557)	typeAllocSize
279–283 ↗	(On Diff #316557)	"null" in the IR is just 0. This is only treated as an invalid pointer in address space 0. -1 is used as the invalid pointer value and "null" in addrspace(3) is valid. Ideally this would be a property in the datalayout
294 ↗	(On Diff #316557)	Probably should use llvm.amdgcn prefix
llvm/test/CodeGen/AMDGPU/lower-function-lds-inactive.ll
1 ↗	(On Diff #316557)	Should run with both new and old PM since you handled both
4–6 ↗	(On Diff #316557)	Negative checks aren't particularly helpful. Needs positive checks for what's actually produced
36 ↗	(On Diff #316557)	Could also use some stores, intrinsic calls, cmpxchg. Also some more exotic users that tend to break, such as storing the value's address to itself
llvm/test/CodeGen/AMDGPU/lower-function-lds.ll
51 ↗	(On Diff #316557)	We're not really handling spir_kernel anymore, should use amdgpu_kernel

Great review guys, thanks! With the exception of avoiding inline asm (which I'd like to, but am not sure what to do about) I'll fix the rest when I'm back in the office.

I also thought the idea was to have a constant memory table with pointers in it, not one giant LDS block

That's the idea behind D91516. This one puts variables at the same address in each kernel and accesses them as cheaply as from a kernel in exchange for wasting LDS in various cases. We've slightly conflated lowering with indirect calls because D91516 doesn't handle them. This is partly simpler because it uses '0' instead of an extra function argument.

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
66–68 ↗	(On Diff #316557)	The '4' limits the depth of constant expressions analysed, will to change to a something using heap memory. I believe I can assume constantexpr are acyclic. This test will move variables into the struct unless it can show that is unnecessary, which is safe provided replaceAllUsesWith does the right thing. I may be missing cases on what the user can be.
69 ↗	(On Diff #316557)	Will rephrase. Those used by kernels and the llvm.used or llvm.compiler.used lists.
113 ↗	(On Diff #316557)	It change the constant LDS variable to a mutable one. I don't think that matters much, since the variable is undef initialized, so assumptions about it always having the same value are dubious. I don't think any of the languages let one create a constant undef value, and if one did, it should have no uses and be erased, but I haven't verified that is the case. Basically I'm not certain what the right thing to do with a probably-useless constant undef value is so this pass ignores them.
188 ↗	(On Diff #316557)	This would be a hack. I wanted a construct that looks like a use of the instance (and won't be deleted by IR passes and generates minimal code), so that other passes will accurately account for the amount of LDS used by a kernel. Specifically promoteAlloca but I may have missed some. An intrinsic that evaporates later would work. I haven't thought of an alternative, will see if a cleaner answer comes to mind. (aside: what's bad about r in particular? I'm unfamiliar with our inline assembler, perhaps there's an immediate option instead)
202 ↗	(On Diff #316557)	Sure, will do.
233–234 ↗	(On Diff #316557)	Didn't see getTypeAllocSize, nice! getAlign is a private function that wraps the verbose getValueOrABITypeAlignment.
279–283 ↗	(On Diff #316557)	I was worried by Constant::getNullValue() returning zero for addrspace(3) but it does indeed seem to work ok. Drop the comment?

arsenm added inline comments.Jan 15 2021, 11:53 AM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
66–68 ↗	(On Diff #316557)	Probably should just use the same recursive search, even better if they are sharable
188 ↗	(On Diff #316557)	Well since the allocation point isn't really fixed yet, whether this size is really correct is questionable. AMDGPUPromoteAlloca currently assumes a worst case placement for padding to avoid going over. r is "pick any register of any class". We have a hard split between VGPRs and SGPRs, so "r" is unpredictable and not very helpful.n
279–283 ↗	(On Diff #316557)	Yes

address some review comments

Fixed the easy parts. Adding a call to the new pass manager by RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-function-lds < %s | FileCheck %s caught some errors in the plumbing, now fixed.

I'm a little confused by the mixture of legacy and new pass manager. In particular, would like feedback on whether I've successfully expressed that this pass should run before PromoteAllocaToLDS.

Harbormaster completed remote builds in B85609: Diff 317373.Jan 18 2021, 9:33 AM

JonChesterfield mentioned this in D94961: [OpenMP] Add OpenMP offloading toolchain for AMDGPU.Jan 19 2021, 4:12 AM

hsmhsm added inline comments.Jan 19 2021, 6:19 AM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗	(On Diff #317373)	The function AMDGPU::isModuleEntryFunctionCC() rerturns true for graphics, shaders, SPIR (OpeCL?), etc. Is it what we expect here? Is not it that we are concerned here only with the CC - CallingConv::AMDGPU_KERNEL?

address some review comments
Run tests under new pass manager

JonChesterfield added inline comments.Jan 19 2021, 8:27 AM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗	(On Diff #317373)	LLC (e.g. AMDGPULegalizerInfo) uses isModuleEntryFunction to detect non-kernel use of LDS, which resolves to a function in Utils/AMDGPUBaseInfo.cpp that returns true for AMDGPU_KERNEL, SPIR_KERNEL and various calling conventions I don't recognise. AMDGPU_VS etc. Some opencl I compiled as a sanity check used SPIR_KERNEL as the calling convention. I don't know whether that's right, only that this pass should use exactly the same predicate as the one guarding allocateLDSGlobal.
llvm/test/CodeGen/AMDGPU/lower-function-lds.ll
51 ↗	(On Diff #316557)	An OpenCL compilation emitted spir_kernel (somewhat to my surprise) so I thought I'd go for one of each. See also a comment above about wanting to be consistent with the guards around allocateLDSGlobal

hsmhsm added inline comments.Jan 19 2021, 8:42 AM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗	(On Diff #317373)	IMHO, LLC is a generic back-end for OpenCL, OpenMP, HIP, Graphic Shader language, etc. So, AMDGPULegalizerInfo might be generally using it. but, this pass is concerned with only CallingConv::AMDGPU_KERNEL. That said, probably @arsenm or others in the review list could better clarify it.

Harbormaster completed remote builds in B85721: Diff 317575.Jan 19 2021, 9:23 AM

sebastian-ne added a subscriber: sebastian-ne.Jan 19 2021, 11:28 AM

sebastian-ne added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
54 ↗	(On Diff #317373)	If I understand the purpose of `isKernelCC` correctly, it should return true if the given function is able to allocate LDS (in the sense of “an entry point is allowed to allocate LDS”). That is exactly what `isModuleEntryFunctionCC` returns, so this looks right to me.

arsenm added inline comments.Jan 19 2021, 12:08 PM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
215 ↗	(On Diff #317575)	Probably should use stable_sort, if the values are anonymous the final compare won't provide ordering

address some review comments
Run tests under new pass manager
Extend testing, increase recur depth req

Herald added a subscriber: nikic. · View Herald TranscriptJan 28 2021, 1:45 PM

Harbormaster completed remote builds in B87068: Diff 319954.Jan 28 2021, 2:26 PM

Remove stack depth limit

I think that's all the comments addressed other than objections to inline asm, which still passes zero to a "s" constrained register.

The inline asm makes the kernel use the newly created LDS structure, so that passes like PromoteAlloca can see that it uses said structure instance. Some alternatives are:

Modify PromoteAlloca (and any other passes that use size of LDS) to look for the magic variable. Means spreading knowledge of this transform across other passes.
Add an IR intrinsic, SDag and GlobalISel lowering to pseudo instruction, pseudo expansion to no-op. Semantically very similar to the inline asm.
Metadata - mark kernels as using +N bytes of LDS beyond what their uses suggest
Alternative lowering / transform

There are various optimisations available, e.g. metadata to mark functions as can't-use-lds, propagated, and used to drop the 'use' of the variable from some kernels, indirection to allow putting variables at different offsets across different kernels etc.

add assign to self clause to inactive test

JonChesterfield added inline comments.Jan 28 2021, 5:17 PM

llvm/test/CodeGen/AMDGPU/lower-function-lds-inactive.ll
36 ↗	(On Diff #316557)	Mixed it up a bit. As far as I can tell, ReplaceAllUsesWith works everywhere other than the compiler.used list, which looks like an oversight. The larger constexpr case has two uses of the same subexpression. Storing value's address to itself just behaves like any other non-undef initializer, i.e. ignored by the pass. Added to the inactive.ll test set.

Harbormaster completed remote builds in B87092: Diff 319995.Jan 28 2021, 5:46 PM

arsenm added inline comments.Jan 28 2021, 6:46 PM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗	(On Diff #316557)	This is a tricky one. I think checking for a specific intrinsic global variable when allocating the kernel's LDS wouldn't be too bad. However, I did come up with this alternative hack: @arst = addrspace(3) global [256 x i32] zeroinitializer declare void @llvm.donothing() define void @foo() { call void @llvm.donothing() [ "dummyuse"([256 x i32] addrspace(3)* @arst)] ret void } This at least solves the PromoteAlloca case. The use disappears for the actual codegen amount so that doesn't quite solve everything. I guess an intrinsic that takes a pointer and returns the same value would be slightly better without changing the existing LDS lowering
102 ↗	(On Diff #319995)	No reason to mention llc
142 ↗	(On Diff #319995)	I feel like this should be a utility function somewhere else if it really doesn't already exist
218 ↗	(On Diff #320008)	const first
291 ↗	(On Diff #320008)	Doesn't really have to do with functions anymore. llvm.amdgcn.module.lds.t?
299 ↗	(On Diff #320008)	.module?
331–332 ↗	(On Diff #320008)	Merge to one if

Harbormaster completed remote builds in B87099: Diff 320008.Jan 28 2021, 6:51 PM

s/function/module/g

Harbormaster completed remote builds in B87547: Diff 320828.Feb 2 2021, 10:38 AM

Replace inline asm with donothing and operand

@arsenm addressed last round of comments. Apologies for missing it on Friday, I don't seem to be getting email from reviews.llvm. I like the donothing hack.

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
142 ↗	(On Diff #319995)	Yes, but I don't want to propose it as such since that is likely to slow the patch process down further. Better to land it here and attempt to promote it later.
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188	I quite like the donothing alternative to inline asm. It does indeed keep the use alive long enough. A future change to the pipeline might break that, but it'll do so fairly obviously (all the openmp stuff stops working, for one). I think we go with annotated donothing for now, and implement an intrinsic -> pseudo sequence when/if it becomes necessary. Written a fairly long comment to that effect in the source.

Harbormaster completed remote builds in B87557: Diff 320851.Feb 2 2021, 11:24 AM

Revert unintended test change, 0/undef

Harbormaster completed remote builds in B87576: Diff 320896.Feb 2 2021, 2:02 PM

aeubanks added a subscriber: aeubanks.Feb 2 2021, 3:04 PM

aeubanks added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	This should always be added here regardless of the flag, this is just for registering that the pass exists. Rather, the pass should also be added in `registerCGSCCOptimizerLateEPCallback()` below guarded by the flag.

Adjust pass registration

JonChesterfield added inline comments.Feb 3 2021, 9:29 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	Welcome! Thank you very much for the input, there's been more guesswork in the new pass manager parts than I like. Can't seem to find any documentation on how it works. Dropped the test here, added to CGSCC. It's a module pass and the other things there were function passes, but instantiating a new ModulePassManager seems to work fine.

aeubanks added inline comments.Feb 3 2021, 9:44 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	I should definitely write some documentation somewhere. Adding it to a `ModulePassManager` you created doesn't do anything since it's not getting added to the overall pipeline. For example, the `FunctionPassManager` there is added to the `CGSCCPassManager`. For the legacy PM it doesn't really make sense to add a module pass at `EP_CGSCCOptimizerLate` since it'll end up breaking the CGSCC pipeline. Normally it runs the the CGSCC passes and the function pipeline on each function in an SCC as it visits the SCCs, but with a module pass in the middle those will get split up. The new PM just makes this whole thing explicit via nesting when adding passes.

JonChesterfield added inline comments.Feb 3 2021, 10:07 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	Ah, I missed the PM.addPass at the end. createCGSCCToModulePassAdaptor doesn't appear to exist, and the types of createModuleToPostOrderCGSCCPassAdaptor suggest that's wrong too. What should I do with this pass then? It's not hugely crucial when it runs, provided it's before PromoteAlloca, which is a function pass in CGSCCOptimizerLate.

Harbormaster completed remote builds in B87722: Diff 321125.Feb 3 2021, 10:34 AM

aeubanks added inline comments.Feb 3 2021, 10:41 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	Does `EP_ModuleOptimizerEarly`/`registerPipelineEarlySimplificationEPCallback()` work?

JonChesterfield added inline comments.Feb 3 2021, 12:40 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	Sure, moving it further forward is fine. What's the difference between registerPipeline and adjustPassManager's addExtension EP_ModuleOptimizerEarly? Do I want both?

aeubanks added inline comments.Feb 3 2021, 1:03 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	`adjustPassManager()` is for the legacy pass manager, and `registerPassBuilderCallbacks()` is for the new pass manager

Better pass registration

JonChesterfield added inline comments.Feb 3 2021, 2:13 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
516	Ah yes, thank you. Should be OK now. It will be a good day when we have one pass manager and drop this duplication.

Harbormaster completed remote builds in B87781: Diff 321226.Feb 3 2021, 4:27 PM

update pipeline test, missed locally because it requires asserts

Harbormaster completed remote builds in B87871: Diff 321376.Feb 4 2021, 5:24 AM

arsenm added inline comments.Feb 4 2021, 7:33 AM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗	(On Diff #316557)	Could also avoid this if the kernel already has a direct reference to the LDS, which it likely does

JonChesterfield added inline comments.Feb 4 2021, 7:36 AM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLDSPass.cpp
188 ↗	(On Diff #316557)	Sure. We wouldn't gain anything though, specifically no saving in LDS. It also carries the slight risk that the existing direct reference to LDS gets dead code eliminated at an inconvenient time.

Ping. Openmp is still blocked on this

sameerds added a subscriber: sameerds.Feb 8 2021, 8:21 AM

rebase

Should add some tests where the same LDS appears in multiple functions/kernels

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188	But if there are no pre-existing uses of the LDS in the kernel, this won't end up getting allocated in the kernel

Harbormaster completed remote builds in B88477: Diff 322416.Feb 9 2021, 10:05 AM

JonChesterfield added inline comments.Feb 10 2021, 9:11 AM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
188	If all uses of LDS are from a kernel, this pass does nothing. Otherwise: every kernel gets a call to llvm.donothing (previously inline asm) that looks like a use of the per-module struct every kernel allocates the size of the per-module struct, regardless of whether the llvm.donothing is present or not See the constructor AMDGPUMachineFunction::AMDGPUMachineFunction. If the symbol llvm.amdgcn.module.lds is present, allocateLDSGlobal is called on it, before any other calls to allocateLDSGlobal in order to reliably guess that the offset returned will be zero.

arsenm added inline comments.Feb 10 2021, 9:15 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43	Ah, I missed this part before. However, this isn't the right place for this. I've been trying to fix having code that depends on the function itself in the MachineFunctionInfo constructor. The point this is constructed isn't well defined (i.e. this is not a pass), so depending on whether you are running a MIR pass or something this may not work as expected. It's a bit hacky, but you could stick this allocation in LowerFormalArguments since we don't really have a better pre-lowering hook

JonChesterfield added inline comments.Feb 10 2021, 9:47 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
43	Ah. I was working on the basis that any instance of this class can be used to call allocateLDSGlobal, thus the constructor neatly catches every path. Bad assumption. If using inline asm, we could put the allocation shortly before the two existing uses of allocateLDSGlobal (from SDag and GlobalISel), as that ensures there will be at least one reference to an LDS global from the kernel. Changing to donothing breaks that, as the call can be removed beforehand, so kernels that don't have any direct LDS uses will miss the handling. LowerFormalArguments should work, will update to allocate it from there.

aeubanks removed a subscriber: aeubanks.Feb 10 2021, 9:49 AM

Move alloc out of machinefunction ctor

In D94648#2551731, @arsenm wrote:

Should add some tests where the same LDS appears in multiple functions/kernels

var0 in lower-module-lds.ll does. I'm not particularly worried about the update misfiring as it's just a call to replaceAllUsesWith. Which turned out to miss compiler.used, but seems to cover everything else.

The tests are a bit mixed. E.g. checking the alignment padding is introduced is in the checks on the generated type. There aren't any IR->ASM tests here, perhaps one should be added.

The most thorough testing done on this was running the openmp test suite using a compiler with the existing function pointer workaround deleted, which is how the compiler.used edge case was caught. That also caught the mis-accounting in promotealloca and a misaligned field. I don't think the IR tests have caught anything that the runtime didn't, though they will fail more precisely if things regress.

JonChesterfield marked an inline comment as done.Feb 15 2021, 10:39 AM

Harbormaster completed remote builds in B89260: Diff 323791.Feb 15 2021, 12:05 PM

Though the filecheck testing still passes, this change no longer works. Looking at some generated IR, it appears that the pass manager registration change leads to the pass being called repeatedly, which it doesn't tolerate very well. In particular, it is run on individual translation units, instead of only once at whole program link time.

A reasonable fix is probably to make the pass safe to run repeatedly. That's probably useful future proofing, as it means the pass can run to eliminate LDS variables, then if another pass decides to introduce some new ones, this can be run again to clean up.

run pass only once, from backend pass manager

Working again now. Mailing list revealed that the new pass manager isn't used for backends yet, so the last patch dropped the invocation from the opt pipeline. Left the plumbing in place (so the pass can still be run with the new manager, as in the tests). When the new pass manager is used for the amdgcn backend, we can slot this pass in roughly the same place as it runs now.

With this patch and amdgpu-fixed-function-abi=true, most of the generic openmp kernels in the aomp test suite pass with the function pointer hashing scheme disabled. That isn't quite the same as most generic kernels passing with trunk clang, though ones that don't use printf or malloc would be expected to.

Started on the path to making this safe to run repeatedly, with more LDS introduced in between each step. That makes the rewrite to access at 0 + offset unsafe. Would instead need to emit uses of the newly created struct (i.e. not 0 + offset), and patch allocateLDSGlobal to consider that specific variable to be fine to access from within a non-kernel function, as well as allocating it at zero as we presently do. That would be equivalent to the current scheme, except slightly more obvious what is going on in the IR, in exchange for being a more invasive change to the back end.

That change plus renaming the variable if already present would be correct for multiple invocations. Cleaner would be to also replace the module variable with scalars, SROA fashion, before starting the pass to avoid the nested struct buildup. I'd like to leave those revisions for later as this patch is already a couple of months in.

To clarify, "backend" means the backend codegen pipeline. This is part of the middle-end optimization pipeline, so it should be added there.

Harbormaster completed remote builds in B90685: Diff 326204.Feb 24 2021, 4:28 PM

In D94648#2586256, @aeubanks wrote:

To clarify, "backend" means the backend codegen pipeline. This is part of the middle-end optimization pipeline, so it should be added there.

This is not an optimization, this is lowering

Oh I'm sorry, I thought this was in adjustPassManager(). Disregard my comment.

arsenm added inline comments.Mar 3 2021, 6:43 PM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142	Needs a comment
175	Needs a comment

JonChesterfield added inline comments.Mar 7 2021, 3:43 PM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142	A comment saying what? The function and helper does what the name and parameter types claim it'll do in almost as boring a fashion as possible.

arsenm added inline comments.Mar 8 2021, 11:39 AM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142	On first read it sounds very generic, and not related to the special intrinsic global variables. I had to read the function to see what it actually did

JonChesterfield added inline comments.Mar 11 2021, 9:28 AM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142	It is generic. If this lands, I'm hoping to move it under ModuleUtils to go alongside appendToUsed which it closely resembles. Probably as one entry point to remove a set/sequence of constants from llvm.used and a different entry point to remove them from llvm.compiler.used. As it's far from certain whether this will land, I don't want to propose a function for ModuleUtils with no users, as it'll be rightly rejected as dead code.

arsenm accepted this revision.Mar 11 2021, 6:04 PM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
142	Committing utlities before uses is not unheard of

This revision is now accepted and ready to land.Mar 11 2021, 6:04 PM

aeubanks removed a subscriber: aeubanks.Mar 11 2021, 6:09 PM

rebase

Harbormaster completed remote builds in B93805: Diff 330642.Mar 15 2021, 7:02 AM

JonChesterfield added inline comments.Mar 15 2021, 7:18 AM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
233	collectUsedGlobalVariables was removed by a patch following D97128, fixing up.

use new api for collectUsedGlobalVariables

Harbormaster completed remote builds in B93810: Diff 330655.Mar 15 2021, 8:05 AM

Closed by commit rG13e49dcee48f: [amdgpu] Implement lower function LDS pass (authored by JonChesterfield). · Explain WhyMar 15 2021, 8:24 AM

This revision was automatically updated to reflect the committed changes.

JonChesterfield added a commit: rG13e49dcee48f: [amdgpu] Implement lower function LDS pass.

Note to self - there is ongoing interest in minimising the LDS usage of applications. This patch allocates the struct in every kernel (see the call to markUsedByKernel, it is applied exactly once to each kernel), in order to support calls to functions that make use of that struct.

This could be refined. Kernels that make no calls don't need to unconditionally allocate this struct. If the kernel itself does use some LDS that was moved into it, that use will remain and suffice to trigger allocation of the struct as normal. More difficult to compute (one for the attributor?), kernels that call no functions that could refer to that struct also don't need to allocate it.

A simplified variant on @hsmhsm's proposal, an LDS variable that is used from an internal function that has not had it's address taken could be passed into the function by pointer from the caller, ultimately leaving the &var, i.e. the use of that variable, in the top level kernel. Access to that variable would be slower than in this patch - an extra dereference, and loss of an argument register to propagate the address down the call tree - but it would move the variable out of the combined struct for a saving of LDS in other kernels. For large variables and scarce LDS that is probably a win.

See also a note further up about maintaining the name of the variable through the IR, instead of using '0' directly, as that would make the IR easier to read. Particularly useful if we end up refining this pass further.

Just run a bug to ground here. Replacing the inline asm with the donothing intrinsic is prettier, but also doesn't work as is. This is a flaw in the above only tested at the IR and application level. This shows up quickly (if opaquely) at the executable unit test scale.

Given:

target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn-amd-amdhsa"

@global_barrier_state = hidden addrspace(3) global i32 undef, align 4

define i32 @rw() #0 {
entry:
  %0 = atomicrmw add i32 addrspace(3)* @global_barrier_state, i32 1 acq_rel, align 4
  ret i32 %0
}

define amdgpu_kernel void @__device_start() {
entry:
  %0 = call i32 @rw()
  ret void
}

attributes #0 = { noinline  }

This transform does exactly what it was intended to, the LDS variable allocated at zero, but the kernel metadata starts:

	.amdhsa_kernel __device_start
		.amdhsa_group_segment_fixed_size 0 ; should be 4, isn't
        .end_amdhsa_kernel

If the inline asm is reintroduced, that goes to 4. Similarly, if the test case that reduced to this is modified to allocate 4 bytes more LDS than the metadata asks for, it works again.

I suspect there is something in hardware that rounds LDS allocation up to a boundary, so as long as the kernel looks like it uses some non-zero amount of LDS, the out of bounds read hits in the allocated region.

I suspect there is something in hardware that rounds LDS allocation up to a boundary, so as long as the kernel looks like it uses some non-zero amount of LDS, the out of bounds read hits in the allocated region.

Yes the LDS size is rounded up as described in the GRANULATED_LDS_SIZE field in the compute_pgm_rsrc2 table at:

https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc2-gfx6-gfx10-table

In D94648#2756696, @JonChesterfield wrote:

Just run a bug to ground here. Replacing the inline asm with the donothing intrinsic is prettier, but also doesn't work as is. This is a flaw in the above only tested at the IR and application level. This shows up quickly (if opaquely) at the executable unit test scale.

Yes I thought I realized this before, where we're not going to see the use in codegen to allocate it. However I thought this was fixed by using the special intrinsic global variable which we always allocate?

Nice reference to GRANULATED_LDS_SIZE, thanks!

We unconditionally allocate the module variable from lowerFormalArgumentsKernel, which still looks right to me. My current theory is there's some hook between that and the metadata writer that needs to be poked from the above code and isn't, but I haven't worked through the metadata setup code yet. Superficially it looks like it is keyed off the AMDGPUMachineFunction.h, but in that case it should be working. Going to need a debug build I think.

Looking back I see

The use disappears for the actual codegen amount so that doesn't quite solve everything

which correlates strongly with this bug, though I didn't make the connection at the time.

Inline asm does keep the use alive long enough to reach the metadata in the binary. An intrinsic would doubtless achieve the same if it was eliminated late enough. Need to find out what late enough is to see how much plumbing that requires.

Worth noting given the recent discussions about LDS usage that this patch puts the module variable in every kernel. If the allocation was pinned to the presence of the intrinsic, or if there was an attribute for no-module-lds-needed-in-this-kernel, that could be eliminated.

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

for (const Argument &Arg : F.args()) { guards the sdag entry, and the same expression guards the gisel entry

In D94648#2757793, @JonChesterfield wrote:

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

In D94648#2758386, @arsenm wrote:

In D94648#2757793, @JonChesterfield wrote:

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

I expected that too, but debug statements around allocateLDSGlobal didn't fire, and looking at the control flow around the sdag and globalisel paths lowerFormalArguments is called from within a loop for (const Argument &Arg : F.args()) {}. I may be missing something of course (didn't put a lot of time into chasing this), but it definitely looks like lowerFormalArguments doesn't get called when there are no arguments.

In D94648#2769780, @JonChesterfield wrote:

In D94648#2758386, @arsenm wrote:

In D94648#2757793, @JonChesterfield wrote:

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

I expected that too, but debug statements around allocateLDSGlobal didn't fire, and looking at the control flow around the sdag and globalisel paths lowerFormalArguments is called from within a loop for (const Argument &Arg : F.args()) {}. I may be missing something of course (didn't put a lot of time into chasing this), but it definitely looks like lowerFormalArguments doesn't get called when there are no arguments.

Hi Jon,

For the 0 argument kernel, did you put debug statement within SITargetLowering::LowerFormalArguments() and test whether it will hit or not? My experimentation shows that it is indeed hit for 0 arg kernels too. So it is not problem with 0 arg kernel.

Problem is within the function - AMDGPUMachineFunction::allocateModuleLDSGlobal() that you wrote.

The statement

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");

is suppose to be replaced by

GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds", true);

My understanding is - For Module to search for a internal global variable successfully within the symbol table, we need to explicitly tell it to do so by passing "true" as an additional arg to getGlobalVariable(). Otherwise, the internal symbol won't be find.

In D94648#2770088, @hsmhsm wrote:
In D94648#2769780, @JonChesterfield wrote:

In D94648#2758386, @arsenm wrote:

In D94648#2757793, @JonChesterfield wrote:

edit: lowerFormalArguments is not called if there are no formal arguments to the kernel. Test case I started from does pass arguments to the kernel, but they were unused and eliminated.

Not seeing how it isn't called with no arguments? It should still be called anyway

I expected that too, but debug statements around allocateLDSGlobal didn't fire, and looking at the control flow around the sdag and globalisel paths lowerFormalArguments is called from within a loop for (const Argument &Arg : F.args()) {}. I may be missing something of course (didn't put a lot of time into chasing this), but it definitely looks like lowerFormalArguments doesn't get called when there are no arguments.

Hi Jon,

For the 0 argument kernel, did you put debug statement within SITargetLowering::LowerFormalArguments() and test whether it will hit or not? My experimentation shows that it is indeed hit for 0 arg kernels too. So it is not problem with 0 arg kernel.

Problem is within the function - AMDGPUMachineFunction::allocateModuleLDSGlobal() that you wrote.

The statement
GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");
is suppose to be replaced by
GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds", true);
My understanding is - For Module to search for a internal global variable successfully within the symbol table, we need to explicitly tell it to do so by passing "true" as an additional arg to getGlobalVariable(). Otherwise, the internal symbol won't be find.

Further, perhaps, you also need to add one lit test like below, call llc and make sure that @llvm.amdgcn.module.lds is allocated at address 0.

@llvm.amdgcn.module.lds = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16

define amdgpu_kernel void @kern() {
  %llvm.amdgcn.module.lds.bc = bitcast [16 x i8] addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*
   store i8 6, i8 addrspace(3)* %llvm.amdgcn.module.lds.bc, align 16

  ret void
}

CMD: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 --amdhsa-code-object-version=4 lds-allocation2.ll -o tmp.s

assembly generated:

v_mov_b32_e32 v0, 0
v_mov_b32_e32 v1, 6
ds_write_b8 v0, v1
s_endpgm

We should probably CHECK for the instruction pattern - `v_mov_b32_e32 v0, 0```

hsmhsm added inline comments.May 20 2021, 5:36 AM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
179	Every GlobalVariable should be Constant. ref - https://llvm.org/doxygen/classllvm_1_1Constant.html. Then, why do we need dyn_cast<>, and an if conditional check here? Cannot we direct cast<> to Constant?

hsmhsm added inline comments.May 20 2021, 5:43 AM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
303	Will the logic within above for loop ensures that first non-padding (real) member of the struct will be accessed at the same address as that of struct instance?

In D94648#2770088, @hsmhsm wrote:

Hi Jon,

For the 0 argument kernel, did you put debug statement within SITargetLowering::LowerFormalArguments() and test whether it will hit or not? My experimentation shows that it is indeed hit for 0 arg kernels too. So it is not problem with 0 arg kernel.

Could be. I'm very limited in the time I can spend on this, debugging was a few spare minutes.

Problem is within the function - AMDGPUMachineFunction::allocateModuleLDSGlobal() that you wrote.

The statement
GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");
is suppose to be replaced by
GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds", true);

API docs agree. This will be a problem in AMDGPULowerModuleLDS::removeFromUsedList as well, the getGlobalVariable at the entry to the function will be ignoring internal variables, so also needs the ,true.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
179	We don't need dyn_cast here, cast is fine
303	it'll be at zero, so, yes

I can't find any evidence that the bug discussed in the comments here was fixed. Checking quickly now, it looks like LowerFormalArguments is called correctly, getGlobalVariable will ignore internal variables and the module lds is created with internal linkage. I think that's still broken, will add it to the todo list for this week.

repro above that used to emit .amdhsa_group_segment_fixed_size 0 now emits .amdhsa_group_segment_fixed_size 4 as it should, I'm unclear where the behaviour changed. Will debug through.

edit: Was fixed in passing by https://reviews.llvm.org/D102882 by replacing getGlobalVariable with getNamedGlobal, will check in the above repro as a regression test.

Herald added a subscriber: foad. · View Herald TranscriptDec 12 2021, 12:11 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

8 lines

AMDGPUCallLowering.cpp

4 lines

AMDGPULowerModuleLDSPass.cpp

380 lines

AMDGPUMachineFunction.h

1 line

AMDGPUMachineFunction.cpp

12 lines

AMDGPUPromoteAlloca.cpp

9 lines

AMDGPUTargetMachine.cpp

15 lines

CMakeLists.txt

1 line

SIISelLowering.cpp

2 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

lds-global-non-entry-func.ll

4 lines

addrspacecast-initializer-unsupported.ll

2 lines

lds-global-non-entry-func.ll

4 lines

lower-module-lds-constantexpr.ll

47 lines

lower-module-lds-inactive.ll

68 lines

lower-module-lds-indirect.ll

39 lines

lower-module-lds-used-list.ll

37 lines

lower-module-lds.ll

56 lines

promote-alloca-to-lds-constantexpr-use.ll

2 lines

Diff 330669

llvm/lib/Target/AMDGPU/AMDGPU.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );		FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );
FunctionPass *createAMDGPUUseNativeCallsPass();		FunctionPass *createAMDGPUUseNativeCallsPass();
FunctionPass *createAMDGPUCodeGenPreparePass();		FunctionPass *createAMDGPUCodeGenPreparePass();
FunctionPass *createAMDGPULateCodeGenPreparePass();		FunctionPass *createAMDGPULateCodeGenPreparePass();
FunctionPass *createAMDGPUMachineCFGStructurizerPass();		FunctionPass *createAMDGPUMachineCFGStructurizerPass();
FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );		FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );
ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );		ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );
FunctionPass *createAMDGPURewriteOutArgumentsPass();		FunctionPass *createAMDGPURewriteOutArgumentsPass();
		ModulePass *createAMDGPULowerModuleLDSPass();
FunctionPass *createSIModeRegisterPass();		FunctionPass *createSIModeRegisterPass();

struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {		struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {
AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}		AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

private:		private:
TargetMachine &TM;		TargetMachine &TM;
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	struct AMDGPUPropagateAttributesLatePass
: PassInfoMixin<AMDGPUPropagateAttributesLatePass> {		: PassInfoMixin<AMDGPUPropagateAttributesLatePass> {
AMDGPUPropagateAttributesLatePass(TargetMachine &TM) : TM(TM) {}		AMDGPUPropagateAttributesLatePass(TargetMachine &TM) : TM(TM) {}
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);

private:		private:
TargetMachine &TM;		TargetMachine &TM;
};		};

		void initializeAMDGPULowerModuleLDSPass(PassRegistry &);
		extern char &AMDGPULowerModuleLDSID;

		struct AMDGPULowerModuleLDSPass : PassInfoMixin<AMDGPULowerModuleLDSPass> {
		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
		};

void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);		void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);
extern char &AMDGPURewriteOutArgumentsID;		extern char &AMDGPURewriteOutArgumentsID;

void initializeGCNDPPCombinePass(PassRegistry &);		void initializeGCNDPPCombinePass(PassRegistry &);
extern char &GCNDPPCombineID;		extern char &GCNDPPCombineID;

void initializeR600ClauseMergePassPass(PassRegistry &);		void initializeR600ClauseMergePassPass(PassRegistry &);
extern char &R600ClauseMergePassID;		extern char &R600ClauseMergePassID;
▲ Show 20 Lines • Show All 263 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp

Show First 20 Lines • Show All 494 Lines • ▼ Show 20 Lines	bool AMDGPUCallLowering::lowerFormalArgumentsKernel(
MachineIRBuilder &B, const Function &F,		MachineIRBuilder &B, const Function &F,
ArrayRef<ArrayRef<Register>> VRegs) const {		ArrayRef<ArrayRef<Register>> VRegs) const {
MachineFunction &MF = B.getMF();		MachineFunction &MF = B.getMF();
const GCNSubtarget *Subtarget = &MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget *Subtarget = &MF.getSubtarget<GCNSubtarget>();
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();		const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();
const SITargetLowering &TLI = *getTLI<SITargetLowering>();		const SITargetLowering &TLI = *getTLI<SITargetLowering>();

const DataLayout &DL = F.getParent()->getDataLayout();		const DataLayout &DL = F.getParent()->getDataLayout();

		Info->allocateModuleLDSGlobal(F.getParent());

SmallVector<CCValAssign, 16> ArgLocs;		SmallVector<CCValAssign, 16> ArgLocs;
CCState CCInfo(F.getCallingConv(), F.isVarArg(), MF, ArgLocs, F.getContext());		CCState CCInfo(F.getCallingConv(), F.isVarArg(), MF, ArgLocs, F.getContext());

allocateHSAUserSGPRs(CCInfo, B, MF, TRI, Info);		allocateHSAUserSGPRs(CCInfo, B, MF, TRI, Info);

unsigned i = 0;		unsigned i = 0;
const Align KernArgBaseAlign(16);		const Align KernArgBaseAlign(16);
const unsigned BaseOffset = Subtarget->getExplicitKernelArgOffset(F);		const unsigned BaseOffset = Subtarget->getExplicitKernelArgOffset(F);
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	bool AMDGPUCallLowering::lowerFormalArguments(
MachineFunction &MF = B.getMF();		MachineFunction &MF = B.getMF();
MachineBasicBlock &MBB = B.getMBB();		MachineBasicBlock &MBB = B.getMBB();
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
const GCNSubtarget &Subtarget = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &Subtarget = MF.getSubtarget<GCNSubtarget>();
const SIRegisterInfo *TRI = Subtarget.getRegisterInfo();		const SIRegisterInfo *TRI = Subtarget.getRegisterInfo();
const DataLayout &DL = F.getParent()->getDataLayout();		const DataLayout &DL = F.getParent()->getDataLayout();

		Info->allocateModuleLDSGlobal(F.getParent());

SmallVector<CCValAssign, 16> ArgLocs;		SmallVector<CCValAssign, 16> ArgLocs;
CCState CCInfo(CC, F.isVarArg(), MF, ArgLocs, F.getContext());		CCState CCInfo(CC, F.isVarArg(), MF, ArgLocs, F.getContext());

if (!IsEntryFunc) {		if (!IsEntryFunc) {
Register ReturnAddrReg = TRI->getReturnAddressReg(MF);		Register ReturnAddrReg = TRI->getReturnAddressReg(MF);
Register LiveInReturn = MF.addLiveIn(ReturnAddrReg,		Register LiveInReturn = MF.addLiveIn(ReturnAddrReg,
&AMDGPU::SGPR_64RegClass);		&AMDGPU::SGPR_64RegClass);
▲ Show 20 Lines • Show All 471 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

This file was added.

				//===-- AMDGPULowerModuleLDSPass.cpp ------------------------------- C++ --=//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass eliminates LDS uses from non-kernel functions.
				//
				// The strategy is to create a new struct with a field for each LDS variable
				// and allocate that struct at the same address for every kernel. Uses of the
				// original LDS variables are then replaced with compile time offsets from that
				// known address. AMDGPUMachineFunction allocates the LDS global.
				//
				// Local variables with constant annotation or non-undef initializer are passed
				// through unchanged for simplication or error diagnostics in later passes.
				//
				// To reduce the memory overhead variables that are only used by kernels are
				// excluded from this transform. The analysis to determine whether a variable
				// is only used by a kernel is cheap and conservative so this may allocate
				// a variable in every kernel when it was not strictly necessary to do so.
				//
				// A possible future refinement is to specialise the structure per-kernel, so
				// that fields can be elided based on more expensive analysis.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "Utils/AMDGPUBaseInfo.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/IR/DerivedTypes.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/InlineAsm.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/Utils/ModuleUtils.h"
				#include <algorithm>
				#include <vector>

				#define DEBUG_TYPE "amdgpu-lower-module-lds"

				using namespace llvm;

				namespace {

				class AMDGPULowerModuleLDS : public ModulePass {

				static bool isKernelCC(Function *Func) {
				return AMDGPU::isModuleEntryFunctionCC(Func->getCallingConv());
				}

				static Align getAlign(DataLayout const &DL, const GlobalVariable *GV) {
				return DL.getValueOrABITypeAlignment(GV->getPointerAlignment(DL),
				GV->getValueType());
				}

				static bool
				userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,
				User *InitialUser) {
				// Any LDS variable can be lowered by moving into the created struct
				// Each variable so lowered is allocated in every kernel, so variables
				// whose users are all known to be safe to lower without the transform
				// are left unchanged.
				SmallPtrSet<User *, 8> Visited;
				SmallVector<User *, 16> Stack;
				Stack.push_back(InitialUser);

				while (!Stack.empty()) {
				User *V = Stack.pop_back_val();
				Visited.insert(V);

				if (auto *G = dyn_cast<GlobalValue>(V->stripPointerCasts())) {
				if (UsedList.contains(G)) {
				continue;
				}
				}

				if (auto *I = dyn_cast<Instruction>(V)) {
				if (isKernelCC(I->getFunction())) {
				continue;
				}
				}

				if (auto *E = dyn_cast<ConstantExpr>(V)) {
				for (Value::user_iterator EU = E->user_begin(); EU != E->user_end();
				++EU) {
				if (Visited.insert(*EU).second) {
				Stack.push_back(*EU);
				}
				}
				continue;
				}

				// Unknown user, conservatively lower the variable
				return true;
				}

				return false;
				}

				static std::vector<GlobalVariable *>
				findVariablesToLower(Module &M,
				const SmallPtrSetImpl<GlobalValue *> &UsedList) {
				std::vector<llvm::GlobalVariable *> LocalVars;
				for (auto &GV : M.globals()) {
				if (GV.getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS) {
				continue;
				}
				if (!GV.hasInitializer()) {
				// addrspace(3) without initializer implies cuda/hip extern __shared__
				// the semantics for such a variable appears to be that all extern
				// __shared__ variables alias one another, in which case this transform
				// is not required
				continue;
				}
				if (!isa<UndefValue>(GV.getInitializer())) {
				// Initializers are unimplemented for local address space.
				// Leave such variables in place for consistent error reporting.
				continue;
				}
				if (GV.isConstant()) {
				// A constant undef variable can't be written to, and any load is
				// undef, so it should be eliminated by the optimizer. It could be
				// dropped by the back end if not. This pass skips over it.
				continue;
				}
				if (std::none_of(GV.user_begin(), GV.user_end(), [&](User *U) {
				return userRequiresLowering(UsedList, U);
				})) {
				continue;
				}
				LocalVars.push_back(&GV);
				}
				return LocalVars;
				}

				static void removeFromUsedList(Module &M, StringRef Name,
				SmallPtrSetImpl<Constant *> &ToRemove) {
				arsenmUnsubmitted Not Done Reply Inline Actions Needs a comment arsenm: Needs a comment
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions A comment saying what? The function and helper does what the name and parameter types claim it'll do in almost as boring a fashion as possible. JonChesterfield: A comment saying what? The function and helper does what the name and parameter types claim…
				arsenmUnsubmitted Not Done Reply Inline Actions On first read it sounds very generic, and not related to the special intrinsic global variables. I had to read the function to see what it actually did arsenm: On first read it sounds very generic, and not related to the special intrinsic global variables.
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions It is generic. If this lands, I'm hoping to move it under ModuleUtils to go alongside appendToUsed which it closely resembles. Probably as one entry point to remove a set/sequence of constants from llvm.used and a different entry point to remove them from llvm.compiler.used. As it's far from certain whether this will land, I don't want to propose a function for ModuleUtils with no users, as it'll be rightly rejected as dead code. JonChesterfield: It is generic. If this lands, I'm hoping to move it under ModuleUtils to go alongside…
				arsenmUnsubmitted Not Done Reply Inline Actions Committing utlities before uses is not unheard of arsenm: Committing utlities before uses is not unheard of
				GlobalVariable *GV = M.getGlobalVariable(Name);
				if (!GV \|\| ToRemove.empty()) {
				return;
				}

				SmallVector<Constant *, 16> Init;
				auto *CA = cast<ConstantArray>(GV->getInitializer());
				for (auto &Op : CA->operands()) {
				// ModuleUtils::appendToUsed only inserts Constants
				Constant *C = cast<Constant>(Op);
				if (!ToRemove.contains(C->stripPointerCasts())) {
				Init.push_back(C);
				}
				}

				if (Init.size() == CA->getNumOperands()) {
				return; // none to remove
				}

				GV->eraseFromParent();

				if (!Init.empty()) {
				ArrayType *ATy =
				ArrayType::get(Type::getInt8PtrTy(M.getContext()), Init.size());
				GV =
				new llvm::GlobalVariable(M, ATy, false, GlobalValue::AppendingLinkage,
				ConstantArray::get(ATy, Init), Name);
				GV->setSection("llvm.metadata");
				}
				}

				static void
				removeFromUsedLists(Module &M,
				arsenmUnsubmitted Not Done Reply Inline Actions Needs a comment arsenm: Needs a comment
				const std::vector<GlobalVariable *> &LocalVars) {
				SmallPtrSet<Constant *, 32> LocalVarsSet;
				for (size_t I = 0; I < LocalVars.size(); I++) {
				if (Constant *C = dyn_cast<Constant>(LocalVars[I]->stripPointerCasts())) {
				hsmhsmUnsubmitted Not Done Reply Inline Actions Every GlobalVariable should be Constant. ref - https://llvm.org/doxygen/classllvm_1_1Constant.html. Then, why do we need dyn_cast<>, and an if conditional check here? Cannot we direct cast<> to Constant? hsmhsm: Every GlobalVariable should be Constant. ref - https://llvm.org/doxygen/classllvm_1_1Constant.
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions We don't need dyn_cast here, cast is fine JonChesterfield: We don't need dyn_cast here, cast is fine
				LocalVarsSet.insert(C);
				}
				}
				removeFromUsedList(M, "llvm.used", LocalVarsSet);
				removeFromUsedList(M, "llvm.compiler.used", LocalVarsSet);
				}

				static void markUsedByKernel(IRBuilder<> &Builder, Function *Func,
				GlobalVariable *SGV) {
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions I quite like the donothing alternative to inline asm. It does indeed keep the use alive long enough. A future change to the pipeline might break that, but it'll do so fairly obviously (all the openmp stuff stops working, for one). I think we go with annotated donothing for now, and implement an intrinsic -> pseudo sequence when/if it becomes necessary. Written a fairly long comment to that effect in the source. JonChesterfield: I quite like the donothing alternative to inline asm. It does indeed keep the use alive long…
				arsenmUnsubmitted Not Done Reply Inline Actions But if there are no pre-existing uses of the LDS in the kernel, this won't end up getting allocated in the kernel arsenm: But if there are no pre-existing uses of the LDS in the kernel, this won't end up getting…
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions If all uses of LDS are from a kernel, this pass does nothing. Otherwise: every kernel gets a call to llvm.donothing (previously inline asm) that looks like a use of the per-module struct every kernel allocates the size of the per-module struct, regardless of whether the llvm.donothing is present or not See the constructor AMDGPUMachineFunction::AMDGPUMachineFunction. If the symbol llvm.amdgcn.module.lds is present, allocateLDSGlobal is called on it, before any other calls to allocateLDSGlobal in order to reliably guess that the offset returned will be zero. JonChesterfield: If all uses of LDS are from a kernel, this pass does nothing. Otherwise: - every kernel gets a…
				// The llvm.amdgcn.module.lds instance is implicitly used by all kernels
				// that might call a function which accesses a field within it. This is
				// presently approximated to 'all kernels' if there are any such functions
				// in the module. This implicit use is reified as an explicit use here so
				// that later passes, specifically PromoteAlloca, account for the required
				// memory without any knowledge of this transform.

				// An operand bundle on llvm.donothing works because the call instruction
				// survives until after the last pass that needs to account for LDS. It is
				// better than inline asm as the latter survives until the end of codegen. A
				// totally robust solution would be a function with the same semantics as
				// llvm.donothing that takes a pointer to the instance and is lowered to a
				// no-op after LDS is allocated, but that is not presently necessary.

				LLVMContext &Ctx = Func->getContext();

				Builder.SetInsertPoint(Func->getEntryBlock().getFirstNonPHI());

				FunctionType *FTy = FunctionType::get(Type::getVoidTy(Ctx), {});

				Function *Decl =
				Intrinsic::getDeclaration(Func->getParent(), Intrinsic::donothing, {});

				Value *UseInstance[1] = {Builder.CreateInBoundsGEP(
				SGV->getValueType(), SGV, ConstantInt::get(Type::getInt32Ty(Ctx), 0))};

				Builder.CreateCall(FTy, Decl, {},
				{OperandBundleDefT<Value *>("ExplicitUse", UseInstance)},
				"");
				}

				static SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {
				SmallPtrSet<GlobalValue *, 32> UsedList;

				SmallVector<GlobalValue *, 32> TmpVec;
				collectUsedGlobalVariables(M, TmpVec, true);
				UsedList.insert(TmpVec.begin(), TmpVec.end());

				TmpVec.clear();
				collectUsedGlobalVariables(M, TmpVec, false);
				UsedList.insert(TmpVec.begin(), TmpVec.end());

				return UsedList;
				}

				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions collectUsedGlobalVariables was removed by a patch following D97128, fixing up. JonChesterfield: collectUsedGlobalVariables was removed by a patch following D97128, fixing up.
				public:
				static char ID;

				AMDGPULowerModuleLDS() : ModulePass(ID) {
				initializeAMDGPULowerModuleLDSPass(*PassRegistry::getPassRegistry());
				}

				bool runOnModule(Module &M) override {
				LLVMContext &Ctx = M.getContext();
				const DataLayout &DL = M.getDataLayout();
				SmallPtrSet<GlobalValue *, 32> UsedList = getUsedList(M);

				// Find variables to move into new struct instance
				std::vector<GlobalVariable *> FoundLocalVars =
				findVariablesToLower(M, UsedList);

				if (FoundLocalVars.empty()) {
				// No variables to rewrite, no changes made.
				return false;
				}

				// Sort by alignment, descending, to minimise padding.
				// On ties, sort by size, descending, then by name, lexicographical.
				llvm::stable_sort(
				FoundLocalVars,
				[&](const GlobalVariable LHS, const GlobalVariable RHS) -> bool {
				Align ALHS = getAlign(DL, LHS);
				Align ARHS = getAlign(DL, RHS);
				if (ALHS != ARHS) {
				return ALHS > ARHS;
				}

				TypeSize SLHS = DL.getTypeAllocSize(LHS->getValueType());
				TypeSize SRHS = DL.getTypeAllocSize(RHS->getValueType());
				if (SLHS != SRHS) {
				return SLHS > SRHS;
				}

				// By variable name on tie for predictable order in test cases.
				return LHS->getName() < RHS->getName();
				});

				std::vector<GlobalVariable *> LocalVars;
				LocalVars.reserve(FoundLocalVars.size()); // will be at least this large
				{
				// This usually won't need to insert any padding, perhaps avoid the alloc
				uint64_t CurrentOffset = 0;
				for (size_t I = 0; I < FoundLocalVars.size(); I++) {
				GlobalVariable *FGV = FoundLocalVars[I];
				Align DataAlign = getAlign(DL, FGV);

				uint64_t DataAlignV = DataAlign.value();
				if (uint64_t Rem = CurrentOffset % DataAlignV) {
				uint64_t Padding = DataAlignV - Rem;

				// Append an array of padding bytes to meet alignment requested
				// Note (o + (a - (o % a)) ) % a == 0
				// (offset + Padding ) % align == 0

				Type *ATy = ArrayType::get(Type::getInt8Ty(Ctx), Padding);
				LocalVars.push_back(new GlobalVariable(
				M, ATy, false, GlobalValue::InternalLinkage, UndefValue::get(ATy),
				"", nullptr, GlobalValue::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS,
				false));
				CurrentOffset += Padding;
				}

				LocalVars.push_back(FGV);
				CurrentOffset += DL.getTypeAllocSize(FGV->getValueType());
				}
				hsmhsmUnsubmitted Not Done Reply Inline Actions Will the logic within above for loop ensures that first non-padding (real) member of the struct will be accessed at the same address as that of struct instance? hsmhsm: Will the logic within above for loop ensures that first non-padding (real) member of the…
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions it'll be at zero, so, yes JonChesterfield: it'll be at zero, so, yes
				}

				std::vector<Type *> LocalVarTypes;
				LocalVarTypes.reserve(LocalVars.size());
				std::transform(
				LocalVars.cbegin(), LocalVars.cend(), std::back_inserter(LocalVarTypes),
				[](const GlobalVariable V) -> Type { return V->getValueType(); });

				StructType *LDSTy = StructType::create(
				Ctx, LocalVarTypes, llvm::StringRef("llvm.amdgcn.module.lds.t"));

				Align MaxAlign = getAlign(DL, LocalVars[0]); // was sorted on alignment
				Constant *InstanceAddress = Constant::getIntegerValue(
				PointerType::get(LDSTy, AMDGPUAS::LOCAL_ADDRESS), APInt(32, 0));

				GlobalVariable *SGV = new GlobalVariable(
				M, LDSTy, false, GlobalValue::InternalLinkage, UndefValue::get(LDSTy),
				"llvm.amdgcn.module.lds", nullptr, GlobalValue::NotThreadLocal,
				AMDGPUAS::LOCAL_ADDRESS, false);
				SGV->setAlignment(MaxAlign);
				appendToCompilerUsed(
				M, {static_cast<GlobalValue *>(
				ConstantExpr::getPointerBitCastOrAddrSpaceCast(
				cast<Constant>(SGV), Type::getInt8PtrTy(Ctx)))});

				// The verifier rejects used lists containing an inttoptr of a constant
				// so remove the variables from these lists before replaceAllUsesWith
				removeFromUsedLists(M, LocalVars);

				// Replace uses of ith variable with a constantexpr to the ith field of the
				// instance that will be allocated by AMDGPUMachineFunction
				Type *I32 = Type::getInt32Ty(Ctx);
				for (size_t I = 0; I < LocalVars.size(); I++) {
				GlobalVariable *GV = LocalVars[I];
				Constant *GEPIdx[] = {ConstantInt::get(I32, 0), ConstantInt::get(I32, I)};
				GV->replaceAllUsesWith(
				ConstantExpr::getGetElementPtr(LDSTy, InstanceAddress, GEPIdx));
				GV->eraseFromParent();
				}

				// Mark kernels with asm that reads the address of the allocated structure
				// This is not necessary for lowering. This lets other passes, specifically
				// PromoteAlloca, accurately calculate how much LDS will be used by the
				// kernel after lowering.
				{
				IRBuilder<> Builder(Ctx);
				SmallPtrSet<Function *, 32> Kernels;
				for (auto &I : M.functions()) {
				Function *Func = &I;
				if (isKernelCC(Func) && !Kernels.contains(Func)) {
				markUsedByKernel(Builder, Func, SGV);
				Kernels.insert(Func);
				}
				}
				}
				return true;
				}
				};

				} // namespace
				char AMDGPULowerModuleLDS::ID = 0;

				char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;

				INITIALIZE_PASS(AMDGPULowerModuleLDS, DEBUG_TYPE,
				"Lower uses of LDS variables from non-kernel functions", false,
				false)

				ModulePass *llvm::createAMDGPULowerModuleLDSPass() {
				return new AMDGPULowerModuleLDS();
				}

				PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,
				ModuleAnalysisManager &) {
				return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()
				: PreservedAnalyses::all();
				}

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	bool isMemoryBound() const {
return MemoryBound;		return MemoryBound;
}		}

bool needsWaveLimiter() const {		bool needsWaveLimiter() const {
return WaveLimiter;		return WaveLimiter;
}		}

unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalVariable &GV);		unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalVariable &GV);
		void allocateModuleLDSGlobal(const Module *M);

Align getDynLDSAlign() const { return DynLDSAlign; }		Align getDynLDSAlign() const { return DynLDSAlign; }

void setDynLDSAlign(const DataLayout &DL, const GlobalVariable &GV);		void setDynLDSAlign(const DataLayout &DL, const GlobalVariable &GV);
};		};

}		}
#endif		#endif

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

Show All 34 Lines	AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF)
WaveLimiter = WaveLimitAttr.isStringAttribute() &&		WaveLimiter = WaveLimitAttr.isStringAttribute() &&
WaveLimitAttr.getValueAsString() == "true";		WaveLimitAttr.getValueAsString() == "true";

CallingConv::ID CC = F.getCallingConv();		CallingConv::ID CC = F.getCallingConv();
if (CC == CallingConv::AMDGPU_KERNEL \|\| CC == CallingConv::SPIR_KERNEL)		if (CC == CallingConv::AMDGPU_KERNEL \|\| CC == CallingConv::SPIR_KERNEL)
ExplicitKernArgSize = ST.getExplicitKernArgSize(F, MaxKernArgAlign);		ExplicitKernArgSize = ST.getExplicitKernArgSize(F, MaxKernArgAlign);
}		}

unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,		unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,
		madhur13490Unsubmitted Done Reply Inline Actions I think this name can be a bit more mangled. It is easy to have a lier in the file. Probably use mechanism to randomly generate a string and use that to name and use the same random algorithm while de-referencing. This is too fancy but a bit more mangled name should be used. madhur13490: I think this name can be a bit more mangled. It is easy to have a lier in the file. Probably…
		arsenmUnsubmitted Done Reply Inline Actions Ah, I missed this part before. However, this isn't the right place for this. I've been trying to fix having code that depends on the function itself in the MachineFunctionInfo constructor. The point this is constructed isn't well defined (i.e. this is not a pass), so depending on whether you are running a MIR pass or something this may not work as expected. It's a bit hacky, but you could stick this allocation in LowerFormalArguments since we don't really have a better pre-lowering hook arsenm: Ah, I missed this part before. However, this isn't the right place for this. I've been trying…
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Ah. I was working on the basis that any instance of this class can be used to call allocateLDSGlobal, thus the constructor neatly catches every path. Bad assumption. If using inline asm, we could put the allocation shortly before the two existing uses of allocateLDSGlobal (from SDag and GlobalISel), as that ensures there will be at least one reference to an LDS global from the kernel. Changing to donothing breaks that, as the call can be removed beforehand, so kernels that don't have any direct LDS uses will miss the handling. LowerFormalArguments should work, will update to allocate it from there. JonChesterfield: Ah. I was working on the basis that any instance of this class can be used to call…
const GlobalVariable &GV) {		const GlobalVariable &GV) {
auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));		auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));
if (!Entry.second)		if (!Entry.second)
return Entry.first->second;		return Entry.first->second;

Align Alignment =		Align Alignment =
DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());		DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());

/// TODO: We should sort these to minimize wasted space due to alignment		/// TODO: We should sort these to minimize wasted space due to alignment
/// padding. Currently the padding is decided by the first encountered use		/// padding. Currently the padding is decided by the first encountered use
/// during lowering.		/// during lowering.
unsigned Offset = StaticLDSSize = alignTo(StaticLDSSize, Alignment);		unsigned Offset = StaticLDSSize = alignTo(StaticLDSSize, Alignment);

Entry.first->second = Offset;		Entry.first->second = Offset;
StaticLDSSize += DL.getTypeAllocSize(GV.getValueType());		StaticLDSSize += DL.getTypeAllocSize(GV.getValueType());

// Update the LDS size considering the padding to align the dynamic shared		// Update the LDS size considering the padding to align the dynamic shared
// memory.		// memory.
LDSSize = alignTo(StaticLDSSize, DynLDSAlign);		LDSSize = alignTo(StaticLDSSize, DynLDSAlign);

return Offset;		return Offset;
}		}

		void AMDGPUMachineFunction::allocateModuleLDSGlobal(const Module *M) {
		if (isModuleEntryFunction()) {
		GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");
		if (GV) {
		unsigned Offset = allocateLDSGlobal(M->getDataLayout(), *GV);
		(void)Offset;
		assert(Offset == 0 &&
		"Module LDS expected to be allocated before other LDS");
		}
		}
		}

void AMDGPUMachineFunction::setDynLDSAlign(const DataLayout &DL,		void AMDGPUMachineFunction::setDynLDSAlign(const DataLayout &DL,
const GlobalVariable &GV) {		const GlobalVariable &GV) {
assert(DL.getTypeAllocSize(GV.getValueType()).isZero());		assert(DL.getTypeAllocSize(GV.getValueType()).isZero());

Align Alignment =		Align Alignment =
DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());		DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());
if (Alignment <= DynLDSAlign)		if (Alignment <= DynLDSAlign)
return;		return;

LDSSize = alignTo(StaticLDSSize, Alignment);		LDSSize = alignTo(StaticLDSSize, Alignment);
DynLDSAlign = Alignment;		DynLDSAlign = Alignment;
}		}

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	public:
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace

char AMDGPUPromoteAlloca::ID = 0;		char AMDGPUPromoteAlloca::ID = 0;
char AMDGPUPromoteAllocaToVector::ID = 0;		char AMDGPUPromoteAllocaToVector::ID = 0;

INITIALIZE_PASS(AMDGPUPromoteAlloca, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(AMDGPUPromoteAlloca, DEBUG_TYPE,
		"AMDGPU promote alloca to vector or LDS", false, false)
		// Move LDS uses from functions to kernels before promote alloca for accurate
		// estimation of LDS available
		INITIALIZE_PASS_DEPENDENCY(AMDGPULowerModuleLDS)
		INITIALIZE_PASS_END(AMDGPUPromoteAlloca, DEBUG_TYPE,
"AMDGPU promote alloca to vector or LDS", false, false)		"AMDGPU promote alloca to vector or LDS", false, false)

INITIALIZE_PASS(AMDGPUPromoteAllocaToVector, DEBUG_TYPE "-to-vector",		INITIALIZE_PASS(AMDGPUPromoteAllocaToVector, DEBUG_TYPE "-to-vector",
"AMDGPU promote alloca to vector", false, false)		"AMDGPU promote alloca to vector", false, false)

char &llvm::AMDGPUPromoteAllocaID = AMDGPUPromoteAlloca::ID;		char &llvm::AMDGPUPromoteAllocaID = AMDGPUPromoteAlloca::ID;
char &llvm::AMDGPUPromoteAllocaToVectorID = AMDGPUPromoteAllocaToVector::ID;		char &llvm::AMDGPUPromoteAllocaToVectorID = AMDGPUPromoteAllocaToVector::ID;

bool AMDGPUPromoteAlloca::runOnFunction(Function &F) {		bool AMDGPUPromoteAlloca::runOnFunction(Function &F) {
▲ Show 20 Lines • Show All 1,003 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableScalarIRPasses(
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

static cl::opt<bool> EnableStructurizerWorkarounds(		static cl::opt<bool> EnableStructurizerWorkarounds(
"amdgpu-enable-structurizer-workarounds",		"amdgpu-enable-structurizer-workarounds",
cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),		cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),
cl::Hidden);		cl::Hidden);

		static cl::opt<bool>
		DisableLowerModuleLDS("amdgpu-disable-lower-module-lds", cl::Hidden,
		cl::desc("Disable lower module lds pass"),
		cl::init(false));

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());		RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeR600ClauseMergePassPass(*PR);		initializeR600ClauseMergePassPass(*PR);
initializeR600ControlFlowFinalizerPass(*PR);		initializeR600ControlFlowFinalizerPass(*PR);
Show All 26 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUPreLegalizerCombinerPass(*PR);		initializeAMDGPUPreLegalizerCombinerPass(*PR);
initializeAMDGPURegBankCombinerPass(*PR);		initializeAMDGPURegBankCombinerPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
initializeAMDGPUPromoteAllocaToVectorPass(*PR);		initializeAMDGPUPromoteAllocaToVectorPass(*PR);
initializeAMDGPUCodeGenPreparePass(*PR);		initializeAMDGPUCodeGenPreparePass(*PR);
initializeAMDGPULateCodeGenPreparePass(*PR);		initializeAMDGPULateCodeGenPreparePass(*PR);
initializeAMDGPUPropagateAttributesEarlyPass(*PR);		initializeAMDGPUPropagateAttributesEarlyPass(*PR);
initializeAMDGPUPropagateAttributesLatePass(*PR);		initializeAMDGPUPropagateAttributesLatePass(*PR);
		initializeAMDGPULowerModuleLDSPass(*PR);
initializeAMDGPURewriteOutArgumentsPass(*PR);		initializeAMDGPURewriteOutArgumentsPass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertHardClausesPass(*PR);		initializeSIInsertHardClausesPass(*PR);
initializeSIInsertWaitcntsPass(*PR);		initializeSIInsertWaitcntsPass(*PR);
initializeSIModeRegisterPass(*PR);		initializeSIModeRegisterPass(*PR);
initializeSIWholeQuadModePass(*PR);		initializeSIWholeQuadModePass(*PR);
initializeSILowerControlFlowPass(*PR);		initializeSILowerControlFlowPass(*PR);
▲ Show 20 Lines • Show All 255 Lines • ▼ Show 20 Lines	PB.registerPipelineParsingCallback(
if (PassName == "amdgpu-printf-runtime-binding") {		if (PassName == "amdgpu-printf-runtime-binding") {
PM.addPass(AMDGPUPrintfRuntimeBindingPass());		PM.addPass(AMDGPUPrintfRuntimeBindingPass());
return true;		return true;
}		}
if (PassName == "amdgpu-always-inline") {		if (PassName == "amdgpu-always-inline") {
PM.addPass(AMDGPUAlwaysInlinePass());		PM.addPass(AMDGPUAlwaysInlinePass());
return true;		return true;
}		}
		if (PassName == "amdgpu-lower-module-lds") {
		PM.addPass(AMDGPULowerModuleLDSPass());
		aeubanksUnsubmitted Not Done Reply Inline Actions This should always be added here regardless of the flag, this is just for registering that the pass exists. Rather, the pass should also be added in `registerCGSCCOptimizerLateEPCallback()` below guarded by the flag. aeubanks: This should always be added here regardless of the flag, this is just for registering that the…
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Welcome! Thank you very much for the input, there's been more guesswork in the new pass manager parts than I like. Can't seem to find any documentation on how it works. Dropped the test here, added to CGSCC. It's a module pass and the other things there were function passes, but instantiating a new ModulePassManager seems to work fine. JonChesterfield: Welcome! Thank you very much for the input, there's been more guesswork in the new pass manager…
		aeubanksUnsubmitted Not Done Reply Inline Actions I should definitely write some documentation somewhere. Adding it to a `ModulePassManager` you created doesn't do anything since it's not getting added to the overall pipeline. For example, the `FunctionPassManager` there is added to the `CGSCCPassManager`. For the legacy PM it doesn't really make sense to add a module pass at `EP_CGSCCOptimizerLate` since it'll end up breaking the CGSCC pipeline. Normally it runs the the CGSCC passes and the function pipeline on each function in an SCC as it visits the SCCs, but with a module pass in the middle those will get split up. The new PM just makes this whole thing explicit via nesting when adding passes. aeubanks: I should definitely write some documentation somewhere. Adding it to a `ModulePassManager` you…
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Ah, I missed the PM.addPass at the end. createCGSCCToModulePassAdaptor doesn't appear to exist, and the types of createModuleToPostOrderCGSCCPassAdaptor suggest that's wrong too. What should I do with this pass then? It's not hugely crucial when it runs, provided it's before PromoteAlloca, which is a function pass in CGSCCOptimizerLate. JonChesterfield: Ah, I missed the PM.addPass at the end. createCGSCCToModulePassAdaptor doesn't appear to exist…
		aeubanksUnsubmitted Not Done Reply Inline Actions Does `EP_ModuleOptimizerEarly`/`registerPipelineEarlySimplificationEPCallback()` work? aeubanks: Does `EP_ModuleOptimizerEarly`/`registerPipelineEarlySimplificationEPCallback()` work?
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Sure, moving it further forward is fine. What's the difference between registerPipeline and adjustPassManager's addExtension EP_ModuleOptimizerEarly? Do I want both? JonChesterfield: Sure, moving it further forward is fine. What's the difference between registerPipeline and…
		aeubanksUnsubmitted Not Done Reply Inline Actions `adjustPassManager()` is for the legacy pass manager, and `registerPassBuilderCallbacks()` is for the new pass manager aeubanks: `adjustPassManager()` is for the legacy pass manager, and `registerPassBuilderCallbacks()` is…
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Ah yes, thank you. Should be OK now. It will be a good day when we have one pass manager and drop this duplication. JonChesterfield: Ah yes, thank you. Should be OK now. It will be a good day when we have one pass manager and…
		return true;
		}
return false;		return false;
});		});
PB.registerPipelineParsingCallback(		PB.registerPipelineParsingCallback(
[this](StringRef PassName, FunctionPassManager &PM,		[this](StringRef PassName, FunctionPassManager &PM,
ArrayRef<PassBuilder::PipelineElement>) {		ArrayRef<PassBuilder::PipelineElement>) {
if (PassName == "amdgpu-simplifylib") {		if (PassName == "amdgpu-simplifylib") {
PM.addPass(AMDGPUSimplifyLibCallsPass(*this));		PM.addPass(AMDGPUSimplifyLibCallsPass(*this));
return true;		return true;
Show All 13 Lines	PB.registerPipelineParsingCallback(
if (PassName == "amdgpu-lower-kernel-attributes") {		if (PassName == "amdgpu-lower-kernel-attributes") {
PM.addPass(AMDGPULowerKernelAttributesPass());		PM.addPass(AMDGPULowerKernelAttributesPass());
return true;		return true;
}		}
if (PassName == "amdgpu-propagate-attributes-early") {		if (PassName == "amdgpu-propagate-attributes-early") {
PM.addPass(AMDGPUPropagateAttributesEarlyPass(*this));		PM.addPass(AMDGPUPropagateAttributesEarlyPass(*this));
return true;		return true;
}		}

return false;		return false;
});		});

PB.registerAnalysisRegistrationCallback([](FunctionAnalysisManager &FAM) {		PB.registerAnalysisRegistrationCallback([](FunctionAnalysisManager &FAM) {
FAM.registerPass([&] { return AMDGPUAA(); });		FAM.registerPass([&] { return AMDGPUAA(); });
});		});

PB.registerParseAACallback([](StringRef AAName, AAManager &AAM) {		PB.registerParseAACallback([](StringRef AAName, AAManager &AAM) {
▲ Show 20 Lines • Show All 332 Lines • ▼ Show 20 Lines	void AMDGPUPassConfig::addIRPasses() {

// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.		// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.
if (TM.getTargetTriple().getArch() == Triple::r600)		if (TM.getTargetTriple().getArch() == Triple::r600)
addPass(createR600OpenCLImageTypeLoweringPass());		addPass(createR600OpenCLImageTypeLoweringPass());

// Replace OpenCL enqueued block function pointers with global variables.		// Replace OpenCL enqueued block function pointers with global variables.
addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());		addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());

		// Can increase LDS used by kernel so runs before PromoteAlloca
		if (!DisableLowerModuleLDS)
		addPass(createAMDGPULowerModuleLDSPass());

if (TM.getOptLevel() > CodeGenOpt::None) {		if (TM.getOptLevel() > CodeGenOpt::None) {
addPass(createInferAddressSpacesPass());		addPass(createInferAddressSpacesPass());
addPass(createAMDGPUPromoteAlloca());		addPass(createAMDGPUPromoteAlloca());

if (EnableSROA)		if (EnableSROA)
addPass(createSROAPass());		addPass(createSROAPass());

if (EnableScalarIRPasses)		if (EnableScalarIRPasses)
▲ Show 20 Lines • Show All 489 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPUGlobalISelUtils.cpp		AMDGPUGlobalISelUtils.cpp
AMDGPULateCodeGenPrepare.cpp		AMDGPULateCodeGenPrepare.cpp
AMDGPULegalizerInfo.cpp		AMDGPULegalizerInfo.cpp
AMDGPULibCalls.cpp		AMDGPULibCalls.cpp
AMDGPULibFunc.cpp		AMDGPULibFunc.cpp
AMDGPULowerIntrinsics.cpp		AMDGPULowerIntrinsics.cpp
AMDGPULowerKernelArguments.cpp		AMDGPULowerKernelArguments.cpp
AMDGPULowerKernelAttributes.cpp		AMDGPULowerKernelAttributes.cpp
		AMDGPULowerModuleLDSPass.cpp
AMDGPUMachineCFGStructurizer.cpp		AMDGPUMachineCFGStructurizer.cpp
AMDGPUMachineFunction.cpp		AMDGPUMachineFunction.cpp
AMDGPUMachineModuleInfo.cpp		AMDGPUMachineModuleInfo.cpp
AMDGPUMacroFusion.cpp		AMDGPUMacroFusion.cpp
AMDGPUMCInstLower.cpp		AMDGPUMCInstLower.cpp
AMDGPUMIRFormatter.cpp		AMDGPUMIRFormatter.cpp
AMDGPUOpenCLEnqueuedBlockLowering.cpp		AMDGPUOpenCLEnqueuedBlockLowering.cpp
AMDGPUPostLegalizerCombiner.cpp		AMDGPUPostLegalizerCombiner.cpp
▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,257 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerFormalArguments(

if (Subtarget->isAmdHsaOS() && AMDGPU::isGraphics(CallConv)) {		if (Subtarget->isAmdHsaOS() && AMDGPU::isGraphics(CallConv)) {
DiagnosticInfoUnsupported NoGraphicsHSA(		DiagnosticInfoUnsupported NoGraphicsHSA(
Fn, "unsupported non-compute shaders with HSA", DL.getDebugLoc());		Fn, "unsupported non-compute shaders with HSA", DL.getDebugLoc());
DAG.getContext()->diagnose(NoGraphicsHSA);		DAG.getContext()->diagnose(NoGraphicsHSA);
return DAG.getEntryNode();		return DAG.getEntryNode();
}		}

		Info->allocateModuleLDSGlobal(Fn.getParent());

SmallVector<ISD::InputArg, 16> Splits;		SmallVector<ISD::InputArg, 16> Splits;
SmallVector<CCValAssign, 16> ArgLocs;		SmallVector<CCValAssign, 16> ArgLocs;
BitVector Skipped(Ins.size());		BitVector Skipped(Ins.size());
CCState CCInfo(CallConv, isVarArg, DAG.getMachineFunction(), ArgLocs,		CCState CCInfo(CallConv, isVarArg, DAG.getMachineFunction(), ArgLocs,
*DAG.getContext());		*DAG.getContext());

bool IsGraphics = AMDGPU::isGraphics(CallConv);		bool IsGraphics = AMDGPU::isGraphics(CallConv);
bool IsKernel = AMDGPU::isKernel(CallConv);		bool IsKernel = AMDGPU::isKernel(CallConv);
▲ Show 20 Lines • Show All 9,836 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-non-entry-func.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -o - %s 2> %t \| FileCheck --check-prefix=GFX8 %s			; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -amdgpu-disable-lower-module-lds=true -o - %s 2> %t \| FileCheck --check-prefix=GFX8 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s 2> %t \| FileCheck --check-prefix=GFX9 %s			; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-disable-lower-module-lds=true -o - %s 2> %t \| FileCheck --check-prefix=GFX9 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	@lds = internal addrspace(3) global float undef, align 4			@lds = internal addrspace(3) global float undef, align 4

	; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function			; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function
	define void @func_use_lds_global() {			define void @func_use_lds_global() {
	; GFX8-LABEL: func_use_lds_global:			; GFX8-LABEL: func_use_lds_global:
	; GFX8: ; %bb.0:			; GFX8: ; %bb.0:
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/addrspacecast-initializer-unsupported.ll

	; RUN: not --crash llc -march=amdgcn -verify-machineinstrs < %s 2>&1 \| FileCheck -check-prefix=ERROR %s			; RUN: not --crash llc -march=amdgcn -verify-machineinstrs -amdgpu-disable-lower-module-lds=true < %s 2>&1 \| FileCheck -check-prefix=ERROR %s

	; ERROR: LLVM ERROR: Unsupported expression in static initializer: addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*)			; ERROR: LLVM ERROR: Unsupported expression in static initializer: addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*)

	@lds.arr = unnamed_addr addrspace(3) global [256 x i32] undef, align 4			@lds.arr = unnamed_addr addrspace(3) global [256 x i32] undef, align 4

	@gv_flatptr_from_lds = unnamed_addr addrspace(2) global i32 addrspace(4)* getelementptr ([256 x i32], [256 x i32] addrspace(4)* addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*), i64 0, i64 8), align 4			@gv_flatptr_from_lds = unnamed_addr addrspace(2) global i32 addrspace(4)* getelementptr ([256 x i32], [256 x i32] addrspace(4)* addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*), i64 0, i64 8), align 4

llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -o - %s 2> %t \| FileCheck -check-prefixes=GCN,GFX8 %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -o - -amdgpu-disable-lower-module-lds=true %s 2> %t \| FileCheck -check-prefixes=GCN,GFX8 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s 2> %t \| FileCheck -check-prefixes=GCN,GFX9 %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - -amdgpu-disable-lower-module-lds=true %s 2> %t \| FileCheck -check-prefixes=GCN,GFX9 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	@lds = internal addrspace(3) global float undef, align 4			@lds = internal addrspace(3) global float undef, align 4

	; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function			; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function
	define void @func_use_lds_global() {			define void @func_use_lds_global() {
	; GFX8-LABEL: func_use_lds_global:			; GFX8-LABEL: func_use_lds_global:
	; GFX8: ; %bb.0:			; GFX8: ; %bb.0:
	Show All 33 Lines

llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; CHECK: %llvm.amdgcn.module.lds.t = type { float, float }

				@func = addrspace(3) global float undef, align 4

				; @kern is only used from a kernel so it is left unchanged
				; CHECK: @kern = addrspace(3) global float undef, align 4
				@kern = addrspace(3) global float undef, align 4

				; @func is only used from a non-kernel function so is rewritten
				; CHECK-NOT: @func
				; @both is used from a non-kernel function so is rewritten
				; CHECK-NOT: @both
				; sorted both < func, so @both at null and @func at 4
				@both = addrspace(3) global float undef, align 4

				; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 4

				; CHECK-LABEL: @get_func()
				; CHECK: %0 = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 1) to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 1) to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				define i32 @get_func() local_unnamed_addr #0 {
				entry:
				%0 = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret i32 %0
				}

				; CHECK-LABEL: @set_func(i32 %x)
				; CHECK: store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* null to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* null to i32) to i64)) to i32), align 4
				define void @set_func(i32 %x) local_unnamed_addr #1 {
				entry:
				store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @both to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @both to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

				; CHECK-LABEL: @timestwo()
				; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				; CHECK: %ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* null to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				; CHECK: %mul = mul i32 %ld, 2
				; CHECK: store i32 %mul, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* null to i32) to i64)) to i32), align 4
				define amdgpu_kernel void @timestwo() {
				%ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @both to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				%mul = mul i32 %ld, 2
				store i32 %mul, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @both to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; Variables that are not lowered by this pass are left unchanged
				; CHECK-NOT: asm
				; CHECK-NOT: llvm.amdgcn.module.lds
				; CHECK-NOT: llvm.amdgcn.module.lds.t

				; var1, var2 would be transformed were they used from a non-kernel function
				; CHECK: @var1 = addrspace(3) global i32 undef
				; CHECK: @var2 = addrspace(3) global float undef
				@var1 = addrspace(3) global i32 undef
				@var2 = addrspace(3) global float undef

				; constant variables are left to the optimizer / error diagnostics
				; CHECK: @const_undef = addrspace(3) constant i32 undef
				; CHECK: @const_with_init = addrspace(3) constant i64 8
				@const_undef = addrspace(3) constant i32 undef
				@const_with_init = addrspace(3) constant i64 8

				; External and constant are both left to the optimizer / error diagnostics
				; CHECK: @extern = external addrspace(3) global i32
				@extern = external addrspace(3) global i32

				; Use of an addrspace(3) variable with an initializer is skipped,
				; so as to preserve the unimplemented error from llc
				; CHECK: @with_init = addrspace(3) global i64 0
				@with_init = addrspace(3) global i64 0

				; Only local addrspace variables are transformed
				; CHECK: @addr4 = addrspace(4) global i64 undef
				@addr4 = addrspace(4) global i64 undef

				; Assign to self is treated as any other initializer, i.e. ignored by this pass
				; CHECK: @toself = addrspace(3) global float addrspace(3)* bitcast (float addrspace(3)* addrspace(3)* @toself to float addrspace(3)*), align 8
				@toself = addrspace(3) global float addrspace(3)* bitcast (float addrspace(3)* addrspace(3)* @toself to float addrspace(3)*), align 8

				; Use by .used lists doesn't trigger lowering
				; CHECK: @llvm.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (i32 addrspace(3)* @var1 to i8 addrspace(3)) to i8)], section "llvm.metadata"
				@llvm.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (i32 addrspace(3)* @var1 to i8 addrspace(3)) to i8)], section "llvm.metadata"

				; CHECK: @llvm.compiler.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @var2 to i8 addrspace(3)) to i8)], section "llvm.metadata"
				@llvm.compiler.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @var2 to i8 addrspace(3)) to i8)], section "llvm.metadata"

				; Access from a function would cause lowering for non-excluded cases
				; CHECK-LABEL: @use_variables()
				; CHECK: %c0 = load i32, i32 addrspace(3)* @const_undef, align 4
				; CHECK: %c1 = load i64, i64 addrspace(3)* @const_with_init, align 4
				; CHECK: %v0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 seq_cst
				; CHECK: %v1 = cmpxchg i32 addrspace(3)* @extern, i32 4, i32 %c0 acq_rel monotonic
				; CHECK: %v2 = atomicrmw add i64 addrspace(4)* @addr4, i64 %c1 monotonic
				define void @use_variables() {
				%c0 = load i32, i32 addrspace(3)* @const_undef, align 4
				%c1 = load i64, i64 addrspace(3)* @const_with_init, align 4
				%v0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 seq_cst
				%v1 = cmpxchg i32 addrspace(3)* @extern, i32 4, i32 %c0 acq_rel monotonic
				%v2 = atomicrmw add i64 addrspace(4)* @addr4, i64 %c1 monotonic
				ret void
				}

				; Use by kernel doesn't trigger lowering
				; CHECK-LABEL: @kern_use()
				; CHECK: %inc = atomicrmw add i32 addrspace(3)* @var1, i32 1 monotonic
				define amdgpu_kernel void @kern_use() {
				%inc = atomicrmw add i32 addrspace(3)* @var1, i32 1 monotonic
				call void @use_variables()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lower-module-lds-indirect.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; CHECK: %llvm.amdgcn.module.lds.t = type { double, float }

				; CHECK: @function_indirect = addrspace(1) global float* addrspacecast (float addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 1) to float*), align 8

				; CHECK: @kernel_indirect = addrspace(1) global double* addrspacecast (double addrspace(3)* null to double*), align 8

				; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 8

				@function_target = addrspace(3) global float undef, align 4
				@function_indirect = addrspace(1) global float* addrspacecast (float addrspace(3)* @function_target to float*), align 8

				@kernel_target = addrspace(3) global double undef, align 8
				@kernel_indirect = addrspace(1) global double* addrspacecast (double addrspace(3)* @kernel_target to double*), align 8

				; CHECK-LABEL: @function(float %x)
				; CHECK: %0 = load float, float addrspace(1)* @function_indirect, align 8
				define void @function(float %x) local_unnamed_addr #5 {
				entry:
				%0 = load float, float addrspace(1)* @function_indirect, align 8
				store float %x, float* %0, align 4
				ret void
				}

				; CHECK-LABEL: @kernel(double %x)
				; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				; CHECK: %0 = load double, double addrspace(1)* @kernel_indirect, align 8
				define amdgpu_kernel void @kernel(double %x) local_unnamed_addr #5 {
				entry:
				%0 = load double, double addrspace(1)* @kernel_indirect, align 8
				store double %x, double* %0, align 8
				ret void
				}

llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; Check new struct is added to compiler.used and that the replaced variable is removed

				; CHECK: %llvm.amdgcn.module.lds.t = type { float }
				; CHECK: @ignored = addrspace(1) global i64 0
				; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 8

				; CHECK-NOT: @tolower

				@tolower = addrspace(3) global float undef, align 8

				; A variable that is unchanged by pass
				@ignored = addrspace(1) global i64 0


				; @ignored still in list, @tolower removed, llvm.amdgcn.module.lds appended
				; Start with one value to replace and one to ignore in the .use list

				; @ignored still in list, @tolower removed
				; CHECK: @llvm.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored to i8 addrspace(1)) to i8)], section "llvm.metadata"

				@llvm.used = appending global [2 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @tolower to i8 addrspace(3)) to i8), i8* addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored to i8 addrspace(1)) to i8)], section "llvm.metadata"

				; @ignored still in list, @tolower removed, llvm.amdgcn.module.lds appended
				; CHECK: @llvm.compiler.used = appending global [2 x i8] [i8 addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored to i8 addrspace(1)) to i8), i8* addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)) to i8)], section "llvm.metadata"

				@llvm.compiler.used = appending global [2 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @tolower to i8 addrspace(3)) to i8), i8* addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored to i8 addrspace(1)) to i8)], section "llvm.metadata"

				; CHECK-LABEL: @func()
				; CHECK: %dec = atomicrmw fsub float addrspace(3)* null, float 1.0
				define void @func() {
				%dec = atomicrmw fsub float addrspace(3)* @tolower, float 1.0 monotonic
				%unused0 = atomicrmw add i64 addrspace(1)* @ignored, i64 1 monotonic
				ret void
				}

llvm/test/CodeGen/AMDGPU/lower-module-lds.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; Padding to meet alignment, so references to @var1 replaced with gep ptr, 0, 2
				; No i64 as addrspace(3) types with initializers are ignored. Likewise no addrspace(4).
				; CHECK: %llvm.amdgcn.module.lds.t = type { float, [4 x i8], i32 }

				; Variables removed by pass
				; CHECK-NOT: @var0
				; CHECK-NOT: @var1

				@var0 = addrspace(3) global float undef, align 8
				@var1 = addrspace(3) global i32 undef, align 8

				@ptr = addrspace(1) global i32 addrspace(3)* @var1, align 4

				; A variable that is unchanged by pass
				; CHECK: @with_init = addrspace(3) global i64 0
				@with_init = addrspace(3) global i64 0

				; Instance of new type, aligned to max of element alignment
				; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 8

				; Use in func rewritten to access struct at address zero, which prints as null
				; CHECK-LABEL: @func()
				; CHECK: %dec = atomicrmw fsub float addrspace(3)* null, float 1.0
				; CHECK: %val0 = load i32, i32 addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 2), align 4
				; CHECK: %val1 = add i32 %val0, 4
				; CHECK: store i32 %val1, i32 addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 2), align 4
				; CHECK: %unused0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 monotonic
				define void @func() {
				%dec = atomicrmw fsub float addrspace(3)* @var0, float 1.0 monotonic
				%val0 = load i32, i32 addrspace(3)* @var1, align 4
				%val1 = add i32 %val0, 4
				store i32 %val1, i32 addrspace(3)* @var1, align 4
				%unused0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 monotonic
				ret void
				}

				; This kernel calls a function that uses LDS so needs the block
				; CHECK-LABEL: @kern_call()
				; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				; CHECK: call void @func()
				; CHECK: %dec = atomicrmw fsub float addrspace(3)* null, float 2.0
				define amdgpu_kernel void @kern_call() {
				call void @func()
				%dec = atomicrmw fsub float addrspace(3)* @var0, float 2.0 monotonic
				ret void
				}

				; This kernel does not need to alloc the LDS block as it makes no calls
				; CHECK-LABEL: @kern_empty()
				; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				define spir_kernel void @kern_empty() {
				ret void
				}

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-constantexpr-use.ll

	; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s			; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s
	; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s \| FileCheck -check-prefix=ASM %s			; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-disable-lower-module-lds=true < %s \| FileCheck -check-prefix=ASM %s

	target datalayout = "A5"			target datalayout = "A5"

	@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4			@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4
	@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4			@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4

	@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4			@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4
	@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4			@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4
	▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Implement lower function LDS passClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 330669

llvm/lib/Target/AMDGPU/AMDGPU.h

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/lib/Target/AMDGPU/CMakeLists.txt

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-non-entry-func.ll

llvm/test/CodeGen/AMDGPU/addrspacecast-initializer-unsupported.ll

llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-indirect.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-constantexpr-use.ll

[amdgpu] Implement lower function LDS pass
ClosedPublic