This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
-
AMDGPUAlwaysInlinePass.cpp
-
AMDGPULowerModuleLDSPass.cpp
4/9
AMDGPUReplaceLDSUseWithPointer.cpp
3/5
AMDGPUTargetMachine.h
1
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
-
Utils/
1/2
AMDGPUGeneralUtils.h
2/4
AMDGPUGeneralUtils.cpp
-
CMakeLists.txt
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
lds-global-non-entry-func.ll
-
addrspacecast-initializer-unsupported.ll
-
force-alwaysinline-lds-global-address-codegen.ll
-
force-alwaysinline-lds-global-address.ll
-
lds-global-non-entry-func.ll
-
lower-module-lds-constantexpr.ll
-
lower-module-lds-inactive.ll
-
lower-module-lds-indirect.ll
-
lower-module-lds-inline-asm-call.ll
-
lower-module-lds-used-list.ll
1
lower-module-lds.ll
-
promote-alloca-to-lds-constantexpr-use.ll
1
replace_lds_report_error_no_func_def.ll
-
replace_lds_test_direct_call_diamond_shape.ll
1
replace_lds_test_direct_call_misc.ll
-
replace_lds_test_ignored_lds.ll
-
replace_lds_test_indirect_call_diamond_shape.ll
-
replace_lds_test_indirect_call_misc.ll
-
replace_lds_test_indirect_call_misc2.ll
-
replace_lds_test_indirect_call_no_addr_taken.ll
-
replace_lds_test_indirect_call_no_init.ll
-
replace_lds_test_llvm_insts.ll
-
replace_lds_test_types_misc.ll
-
replace_lds_test_types_pointers.ll
-
replace_lds_test_types_pointers_misc.ll

Differential D91516

[AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers.
AbandonedPublic

Authored by hsmhsm on Nov 15 2020, 11:09 PM.

Download Raw Diff

Details

Reviewers

b-sumner
t-tye
arsenm
yaxunl
jdoerfert
madhur13490
sameerds
rampitec
JonChesterfield

Summary

One of the memory types being supported within AMD GPU memory hierarchy is
shared memory, also called Local Data Share or LDS for short. LDS memory
is the second fastest memory in the AMD GPU memory hierarchy (with register
file being fastest available memory in the hierarchy). Being faster also
means LDS memory is comparatively costlier and hence is a limited available
memory resource.

Being global scoped, an LDS variable is accessible within kernel functions
and non-kernel functions, but two different kernel execution paths, say
called from two kernels K1 and K2, cannot access the same instance of an LDS
variable, say L. Both K1 and K2 has to own its own instance of L. This puts
some challenges, especially to lower the LDS variables used within non-kernel
functions.

So, the pass - "Lower Module LDS" lowers the LDS globals by packing them
within in a struct type, and by creating an instance of that struct type
within every kerenl at address zero. Though, the pass - "Lower Module LDS"
makes some effort to minimize unnecessary LDS allocation, it is limited by
means of the fundamental basis and assumption upon which the pass is
implemented.

The current pass acts as an helping aid to the pass - "Lower Module LDS" with
the intention of minimizing unnecessary LDS allocation as much as possible.

The main idea behind the current pass is:

(1) To identify the LDS globals used within non-kernel function scope and
global scope.
(2) To push the use of all the above identified LDS globals to kernel
function scope by initializing their addresses to newly created LDS
global pointer variables (within kernel functions).
(3) To replace the uses of original LDS globals within non-kernel functions
by their pointer counter-parts.
(4) This way, the transformation makes sure that the pass "Lower Module LDS"
packs only pointer variables within struct type, and hence significantly
minimizes unnecessary LDS allocation, espacically when the original LDS
globals are big arrays (as this is the common LDS use case).

NOTE: The pass - "Lower Module LDS" now has a tight dependency on the current pass, and the current pass should always be run before running the pass "Lower Module LDS". Running the pass "Lower Module LDS" alone may lead to surprizing results.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	3,470 ms	x64 debian > libFuzzer.libFuzzer::entropic-scale-per-exec-time.test

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

JonChesterfield mentioned this in D94648: [amdgpu] Implement lower function LDS pass.Jan 15 2021, 11:47 AM

[0] Started re-implementing from scratch again.
[1] Added a new pass, namely, amdgpu-lower-function-local-lds.
[2] Implemented required initial plumbing work for both old and new pass

managers.

[3] An option, namely, amdgpu-enable-function-local-lds-lowering is

added, when passed, it enables the pass.

hsmhsm retitled this revision from [AMDGPU] Support for device scope shared variables to [AMDGPU][WIP] Lower Function Local LDS Variables..Jan 19 2021, 2:27 AM

hsmhsm edited the summary of this revision. (Show Details)

Started to implement the feature from scratch again. The previous experience tells me that - "a single very big patch is very problamatic and confusing for a meanigful review process". Hence this time, I am planning to submit small patches (time to time) which can be reasonably reviewed. This first patch implements the following.

[1] Add new pass, namely, amdgpu-lower-function-local-lds.
[2] Implement required initial plumbing work for both old and new pass managers.
[3] Add an option, namely, amdgpu-enable-function-local-lds-lowering, when passed, it enables the pass.

Harbormaster completed remote builds in B85686: Diff 317500.Jan 19 2021, 3:29 AM

You can just use the done checkbox, you don't need to comment on each point

llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
1510 ↗	(On Diff #313659)	You should only try to preserve things that are important, otherwise you are adding cost and complexity for no benefit
1512 ↗	(On Diff #313659)	The IR is language independent and none of the constructs here are tied to a language
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
38	This isn't a user exposed flag, and there shouldn't be a need for users to set one.

hsmhsm marked 3 inline comments as done.Jan 19 2021, 7:31 PM

Based on the Matt's comment for previous patch, some changes are done w.r.t the handling of the guarding flag - amdgpu-enable-function-local-lds-lowering.

Though Matt is against the usage of any guarding flag for this pass, I personally feel the need of it for below two reasons.

(1) Presence of this pass means, we should disable the forcefull inlining as it is done within the pass - amdgpu-always-inline. Otherwise, this pass does not make any sense at all. It is a better idea to disable this forcefull inlining via a flag.

(2) In case of any emergency issue within this pass, customer should have an handy approach to disable the pass and temporarily move on until the fix available.

So the amdgpu-enable-function-local-lds-lowering is a hidden flag, and it is enabled by default. It works as below:

(1) Default behavoir is to run the pass as shown below.

Old pass manager:

mahesha@brego:[tmp]$ hipcc main.cpp
Running the pass - LowerFunctionLocalLDS
mahesha@brego:[tmp]$

New pass manager:

mahesha@brego:[tmp]$ hipcc -fexperimental-new-pass-manager main.cpp
Running the pass - LowerFunctionLocalLDS
mahesha@brego:[tmp]$

(2) The pass will not run when it is explicitly turned off as shown below.

Old pass manager:

mahesha@brego:[tmp]$ hipcc -mllvm --amdgpu-enable-function-local-lds-lowering=false main.cpp
mahesha@brego:[tmp]$

New pass manager:

mahesha@brego:[tmp]$ hipcc -fexperimental-new-pass-manager  -mllvm --amdgpu-enable-function-local-lds-lowering=false main.cpp
mahesha@brego:[tmp]$

hsmhsm edited the summary of this revision. (Show Details)Jan 19 2021, 9:44 PM

Harbormaster completed remote builds in B85823: Diff 317763.Jan 19 2021, 9:49 PM

Build all the required data structures which will be later used to lower function local LDS. Below are the data structures being built.

[1] Kernel Set - Holds all the kernels in the module
[2] Function Local LDS Set - Holds all the function local LDS from all functions
[3] Function Address Taken Set - Holds all the functions whose address is taken within the module

[4] LDS to Function Map - Maps each function local LDS to a function within which the LDS is defined
[5] Function to LDS Map - Reverse of above map, which maps each functon F to a SET of LDS which are defined within F

[6] Kernel to Callee Map - Maps each kernel K to a SET of functions which define LDS and there exists call graph path from K to these functions.
[7] Kernel to LDS - Maps each kernel K to a set of function local LDS which are supposed to be lowered w.r.t K.

Data structures [1], [2], and [3] are built by iterating over the globals and functions defined within the module.
Data structures [4] and [5] are built using BOTTOM-UP based on the use list of function local LDS.
Data structure [6] is built using TOP-DOWN via call graph traversal.
Data structure [7] is built using the result of above BOTTOM-UP and TOP-DOWN constructed data structures.

hsmhsm added reviewers: madhur13490, sameerds.Jan 22 2021, 8:03 PM

Add missing "static" keyword to a function isKernel().

Added a FIXME comment.

Harbormaster completed remote builds in B86390: Diff 318724.Jan 22 2021, 8:55 PM

Harbormaster completed remote builds in B86389: Diff 318723.

Add missing explicit keyword for constructor.

hsmhsm edited the summary of this revision. (Show Details)Jan 22 2021, 9:46 PM

hsmhsm edited the summary of this revision. (Show Details)

Fix few spell mistakes in comments.

Fix comments.

Harbormaster completed remote builds in B86394: Diff 318730.Jan 22 2021, 10:24 PM

Harbormaster completed remote builds in B86395: Diff 318731.Jan 22 2021, 10:34 PM

Harbormaster completed remote builds in B86396: Diff 318732.Jan 22 2021, 10:38 PM

Harbormaster completed remote builds in B86397: Diff 318733.Jan 22 2021, 10:49 PM

Harbormaster completed remote builds in B86391: Diff 318726.Jan 22 2021, 10:57 PM

Re-arrange code for more readability.

Fixed clang-tidy warnings.

Harbormaster completed remote builds in B86399: Diff 318736.Jan 22 2021, 11:51 PM

rampitec removed a reviewer: rampitec.Jan 23 2021, 12:02 AM

Harbormaster completed remote builds in B86400: Diff 318738.Jan 23 2021, 12:32 AM

Code re-organization.

Corrected few comments.

Harbormaster completed remote builds in B86501: Diff 318898.Jan 24 2021, 9:12 PM

Harbormaster completed remote builds in B86502: Diff 318899.Jan 24 2021, 9:48 PM

Fixed one of the FIXME comments which is associated with indirect calls.

Harbormaster completed remote builds in B86505: Diff 318904.Jan 24 2021, 11:27 PM

Make use of llvm append_range() api.

Harbormaster completed remote builds in B86555: Diff 318985.Jan 25 2021, 7:07 AM

Improvements to code at few places.

hsmhsm edited the summary of this revision. (Show Details)Jan 25 2021, 9:01 AM

hsmhsm edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B86580: Diff 319026.Jan 25 2021, 9:45 AM

All tests are now missing

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
32 ↗	(On Diff #319026)	This function is pointless, just directly use isModuleEntryFunctionCC
41–44 ↗	(On Diff #319026)	This function is useless. Assert strings also don't need to end in \n
56 ↗	(On Diff #319026)	cast<>, don't dyn_cast and assert
66 ↗	(On Diff #319026)	Pointless comment
68 ↗	(On Diff #319026)	Pointless comment
74–75 ↗	(On Diff #319026)	Pointless comment
82 ↗	(On Diff #319026)	Extra private
92 ↗	(On Diff #319026)	Extra private
93 ↗	(On Diff #319026)	Typo unhanlded
111 ↗	(On Diff #319026)	.contains
142 ↗	(On Diff #319026)	Extra private
233 ↗	(On Diff #319026)	Copy of set unnecessary
320 ↗	(On Diff #319026)	Don't need all these newlines in assert strings
376–379 ↗	(On Diff #319026)	I think you're overcomplicating the CallGraph usage by ignoring most of what it gives you. You should be able to just iterate directly through the CallGraph to get functions reachable from the parent
410–411 ↗	(On Diff #319026)	This concept doesn't quite work for the IR. The same global can appear in multiple functions
449 ↗	(On Diff #319026)	isa<>, no \n
479 ↗	(On Diff #319026)	Should not be checking the function name. Should just skip all declarations
492–495 ↗	(On Diff #319026)	Return !Kernels.empty()
502 ↗	(On Diff #319026)	Probably should skip declarations. Also not sure about the linkage check
506–509 ↗	(On Diff #319026)	Return !empty()
555–558 ↗	(On Diff #319026)	Return !empty()
561–571 ↗	(On Diff #319026)	Don't understand the point of this stub function

Fixed review comments (by Matt).

hsmhsm marked 22 inline comments as done.Jan 26 2021, 9:12 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
111 ↗	(On Diff #319026)	The data structures, ValueMap<>, SmallPtrSet<> do not have member function - `.contains()`. W.r.t std::set<>, this member function is supported in C++20.
376–379 ↗	(On Diff #319026)	As far as I understand it, llvm `CallGraph` infrastructure does not provide any facility as such. Implementer needs to explicitly iterate the callees of the caller.
410–411 ↗	(On Diff #319026)	My understanding is that - scope of the shared variable is function/statement block scope. It is not available to access outside this scope. It is just that we implement it as global, just like how the local static variables are implemented in C/C++? Can you give an example of the use-case that you are claiming?
502 ↗	(On Diff #319026)	The linkage test is required to ignore the `dynamic shared variables` like the one defined as `extern __shared__ int dy_sm[];` where size of `dy_sm` is not known at compile time, but is passed as one of the kernel execution configuration parameters at run time.
561–571 ↗	(On Diff #319026)	This is a driver function, it looks like a stub now, since implementation is not complete yet. Once this patch is accepted, next step is to (1) define kernel specific LDS layouts (2) create 2D offset table and (3) add new implicit argument.

Harbormaster completed remote builds in B86724: Diff 319316.Jan 26 2021, 9:40 AM

arsenm added inline comments.Jan 26 2021, 7:34 PM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
111 ↗	(On Diff #319026)	There is an llvm::is_contained. Also why use std::set? You randomly switch set types around here
125–128 ↗	(On Diff #319316)	Should just inline this function
182 ↗	(On Diff #319316)	Should just inline this function

arsenm added inline comments.Jan 26 2021, 7:34 PM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
410–411 ↗	(On Diff #319026)	The IR has absolutely no concept of these scopes. The global variables have global scope and no restriction on where their uses can appear. Whether or not this directly corresponds to a direct language feature is unimportant. Some IPO transforms can push global variable references into other functions. The example is just two functions that refer to the same variable: @lds = ... define void @func0() { store i32 0, i32* @lds ret void } define void @func1() { store i32 0, i32* @lds ret void }
502 ↗	(On Diff #319026)	That's a function of it having 0 size, not the linkage
561–571 ↗	(On Diff #319026)	If it's going to be split, I'd rather see the full stack for the review
168 ↗	(On Diff #319316)	Should just inline this function
205 ↗	(On Diff #319316)	Copy here, just directly use this in the for loop

arsenm added inline comments.Jan 26 2021, 7:45 PM

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
376–379 ↗	(On Diff #319026)	The CallGraph as a whole gives you the functions reachable from each other. I don't think you need to do a stack walk to find the callees. You don't need to care about which functions specifically call which, just that they are all connected

Pushing renamed code (not for review).

Harbormaster completed remote builds in B87117: Diff 320037.Jan 28 2021, 8:51 PM

Fixed Matt's comments.

hsmhsm retitled this revision from [AMDGPU][WIP] Lower Function Local LDS Variables. to [AMDGPU][WIP] Lower LDS Global Variables..Jan 29 2021, 3:11 AM

Harbormaster completed remote builds in B87143: Diff 320090.Jan 29 2021, 3:52 AM

hsmhsm edited the summary of this revision. (Show Details)Feb 2 2021, 7:59 PM

Save current work.

Harbormaster completed remote builds in B88393: Diff 322270.Feb 8 2021, 8:07 PM

Save current work.

Harbormaster completed remote builds in B88949: Diff 323247.Feb 12 2021, 3:05 AM

Could we get some tests *and* a commit message that explains what this is supposed to do.

arsenm added inline comments.Feb 12 2021, 8:36 AM

llvm/lib/Target/AMDGPU/AMDGPULowerLDSGlobal.cpp
311 ↗	(On Diff #323247)	You should only need to do the use replacement, you aren't changing the types of the instructions so cloning/hacking on them shouldn't be needed
446–448 ↗	(On Diff #323247)	Don't need this, the IR would have failed the verifier to get here
467–468 ↗	(On Diff #323247)	Should not const_cast
517–518 ↗	(On Diff #323247)	Too much auto for me
553 ↗	(On Diff #323247)	Should use lowercase, period separator naming convention with an llvm.amdgcn prefix
565 ↗	(On Diff #323247)	Should use lowercase, period separator naming convention with an llvm.amdgcn prefix
608 ↗	(On Diff #323247)	This still needs to add in alignment padding
668 ↗	(On Diff #323247)	Braces, Also can use range loop
742 ↗	(On Diff #323247)	I don't see why you need to build your own stack. The call graph already found the reachable functions for you
776 ↗	(On Diff #323247)	I think trying to handle callees is left for a later patch. Additionally, I think this should be the CallGraph analysis's responsibility to deal with
854–857 ↗	(On Diff #323247)	The callgraph should already give this to you. Iterating the call graph should give you all of the functions you care about. You don't actually need to worry about which functions call which, since you need to touch every function in the SCC
986–987 ↗	(On Diff #323247)	I would swap the order of these checks

In D91516#2560020, @jdoerfert wrote:

Could we get some tests *and* a commit message that explains what this is supposed to do.

This is WIP, will add test and commit messages at the end before final review.

Address Matt's comments.

hsmhsm marked 7 inline comments as done.Feb 12 2021, 8:38 PM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULowerLDSGlobal.cpp
311 ↗	(On Diff #323247)	Added FIXME comment, will see how to fix it.
608 ↗	(On Diff #323247)	Added FIXME comment, will see how to fix it.
742 ↗	(On Diff #323247)	Added FIXME comment, will see how to fix it.
776 ↗	(On Diff #323247)	Added FIXME comment, will see how to fix it.
854–857 ↗	(On Diff #323247)	Added FIXME comment, will see how to fix it.

Harbormaster completed remote builds in B89084: Diff 323514.Feb 12 2021, 9:14 PM

Assert that both caller and callee appear in same translation unit.

Harbormaster completed remote builds in B89087: Diff 323518.Feb 12 2021, 10:48 PM

Address clang-tidy warnings.

Remove over created auto variables.

Harbormaster completed remote builds in B89096: Diff 323532.Feb 13 2021, 1:07 AM

Harbormaster completed remote builds in B89095: Diff 323530.Feb 13 2021, 1:16 AM

In D91516#2561383, @hsmhsm wrote:

In D91516#2560020, @jdoerfert wrote:

Could we get some tests *and* a commit message that explains what this is supposed to do.

This is WIP, will add test and commit messages at the end before final review.

I generally would recommend against that but I guess you can use phabricator this way.
However, once people can figure out what this is actually supposed to do, they might effectively restart the entire review process if the design is questioned.
I say this because I have a hunch, or maybe a hope, about the intent of this patch. If it would be that, I'd very much like this to be a generic, non-AMDGPU pass. I might be wrong about what this does, and that is what I'd like to figure out rather sooner than later.

This revision now requires changes to proceed.Feb 13 2021, 5:10 PM

In D91516#2562027, @jdoerfert wrote:

In D91516#2561383, @hsmhsm wrote:

In D91516#2560020, @jdoerfert wrote:

Could we get some tests *and* a commit message that explains what this is supposed to do.

This is WIP, will add test and commit messages at the end before final review.

I generally would recommend against that but I guess you can use phabricator this way.
However, once people can figure out what this is actually supposed to do, they might effectively restart the entire review process if the design is questioned.
I say this because I have a hunch, or maybe a hope, about the intent of this patch. If it would be that, I'd very much like this to be a generic, non-AMDGPU pass. I might be wrong about what this does, and that is what I'd like to figure out rather sooner than later.

OK let's wait for the complete patch then. I will only going to push complete patch next time. But, it may take some time since there are some major hurdles to overcome.

Implemented a new approach based on initializing LDS globals to pointers.

Herald added a subscriber: jfb. · View Herald TranscriptMar 22 2021, 11:32 PM

hsmhsm retitled this revision from [AMDGPU][WIP] Lower LDS Global Variables. to [AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers..Mar 22 2021, 11:33 PM

hsmhsm edited the summary of this revision. (Show Details)

hsmhsm edited the summary of this revision. (Show Details)Mar 22 2021, 11:35 PM

hsmhsm added a reviewer: rampitec.Mar 22 2021, 11:41 PM

hsmhsm mentioned this in D98865: [AMDGPU] Disable forceful inline of non-kernel functions which use LDS..Mar 22 2021, 11:52 PM

This is much more complicated than I expected. Is the large amount of comments largely from a previous patch doing different things that has been hammered into this one?

@jdoerfert the transform I think this is intended to do is:

find a large shared variable used from a function
add a new void*, also in shared, pointing to it
initialize that void* only in kernels that can call functions that use the large variable
replace all uses with

That means, on amdgcn, the large variable only costs LDS space in kernels that definitely use it. I don't know how cuda lowers shared accesses from functions, it could plausibly benefit from the same transform.

Harbormaster completed remote builds in B95174: Diff 332541.Mar 23 2021, 12:46 AM

I can't work out which LDS variables you intend to replace with pointers from the code. Could you spell out what the condition under which you intend to replace one is?

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
177	Why not isa<GlobalVariable> / function needs a different name
204	Functions define hasAddressTaken, but also I don't think this pass needs to distinguish between direct and indirect calls
224	This I haven't read yet, but it looks like far too much state. Expected a set of LDS globals called 'toReplaceWithPointer' or similar instead of all the maps
384	Why do we want to replace constexpr with instructions? This comment contradicts the implementation
llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.cpp
20	Perhaps name the new files after LDS to make it clearer that they're used for LDS lowering an optimisation, not necessarily general purpose. Also move the functions out in a separate commit, without changes to their implementation, as that improves the signal/noise of the functional change.
59	e.g. I recognise this as newly introduced by the comment, but in phab it's hard to distinguish from things that haven't changed
llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.h
16	Include list should be limited to those that are used by the header, with the ones used by the source included there

JonChesterfield added a reviewer: JonChesterfield.Mar 23 2021, 3:43 AM

In D91516#2643733, @JonChesterfield wrote:

This is much more complicated than I expected.

We need to *really* discuss, what is complicated here and what is violated here from the internal email discussions.

Is the large amount of comments largely from a previous patch doing different things that has been hammered into this one?

No, nothing is hammered from the previous patch. The current patch is implementing what is planned via internal emial discussion.

@jdoerfert the transform I think this is intended to do is:

find a large shared variable used from a function

add a new void*, also in shared, pointing to it

initialize that void* only in kernels that can call functions that use the large variable

replace all uses with

No, the intended implementation plan which is implemented here is as follows.

(1) Identify the LDS globals (whether large or small) which are used within non-kernel function scope and in global scope.
(2) Create new LDS glboals of i16 type corresponding to every LDS global identified above. The i16 typed LDS globals act as pointers to corresponding original LDS globals.
(3) push the *use* of above identified LDS globals to kernels by adding instructions within the kernels which initialize the address of original LDS globals to their respective pointers. This is will make sure that per kernel LDS allocation for these LDS globals correctly happen.
(4) Within non-kernel functions, replace the *use* of original LDS globals by thier respective pointers.
(5) Keep the global scope use of original LDS globals unchanged since now they should work automatically as the use of these original LDS globals (pointer initialiation) also there within all kernels and hence it will semantically work correct as expected because of per kernel LDS allocation for these LDS globals.

That means, on amdgcn, the large variable only costs LDS space in kernels that definitely use it. I don't know how cuda lowers shared accesses from functions, it could plausibly benefit from the same transform.

Let's not bother about how CUDA handles it since there is lot of differences here. And focus on only AMDGCN.

In D91516#2644163, @JonChesterfield wrote:

I can't work out which LDS variables you intend to replace with pointers from the code. Could you spell out what the condition under which you intend to replace one is?

All those LDS globals which are used within non-kernel functions and within global scope requires pointer initialization within kernels.

hsmhsm added inline comments.Mar 29 2021, 1:13 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
177	Because LDS would be nested within the const expr within global scope use.
204	I am not getting this comment, probably we can discuss it offline.
224	These maps are required for the logic where we really need to restrict the LDS set for kernel based on kernel excecution paths.
384	Again not clear about what you intended here - let's take it offline.
llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.cpp
20	will think about it.
59	not sure what you mean here. Let's discuss offline.
llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.h
16	agree.

Hi Jon,

I have replied to some of your review comments, and few other comments require internal discussion for better and quick unambiguous conclusion. I am expecting a response from you.

In D91516#2655086, @hsmhsm wrote:

In D91516#2643733, @JonChesterfield wrote:

This is much more complicated than I expected.
Is the large amount of comments largely from a previous patch doing different things that has been hammered into this one?

No, nothing is hammered from the previous patch. The current patch is implementing what is planned via internal emial discussion.

This review dates from November 16 last year and contains hundreds of review comments against code that may or may not still be in the latest revision, this being diff #34 at time of commenting. If I'm following along successfully, the design has changed significantly and repeatedly during that process. It is therefore very difficult to determine what the design intent behind the current revision is. That is what I mean by 'previous patch has been hammered into this one'.

The algorithm I had in mind was along the lines of:

for each LDS variable:
  if should-transform
    create 16 bit integer in LDS
    initialize that global with (constexpr) address of variable
    replace all uses of variable with a (constexpr) access through new pointer

where
should-transform:
 if (sizeof) < 8ish return false
 if used by instruction in indirectly called function return false
 if only used by kernels return false
 probably other exclusions
 return true

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
384	I was thinking the introduced 16 bit pointers will be initialised with constexpr from the corresponding variable. This patch presently initialises them with undef, which I think thwarts using constexpr everywhere, and means we insert stores in the kernel entry basic block here. If we fix the back end to handle LDS variables with initializers (at least the simple case of only used from kernel and initialized with address of some other variable), then quite a lot of the complexity of this patch drops out.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
196	I think this test reads better as proposed here - 'enable-lower-module-lds=true' is better than 'disable-lower-module-lds=false'. Separable from the rest of this patch, we could land a patch that just inverts that commandline flag and updates the tests to match. That removes some noise from this review.
llvm/test/CodeGen/AMDGPU/lower-module-lds.ll
22	This test should only check the behaviour of lower-module-lds. Separate tests check the behaviour of amdgpu-replace-lds-use-with-pointer. Equally, running amdgpu-lower-module-lds by itself should not automatically run amdgpu-replace-lds-use-with-pointer and vice versa.
llvm/test/CodeGen/AMDGPU/replace_lds_report_error_no_func_def.ll
3	This is an error in the implementation, not something that should have a test checking the implementation is broken. Instead of assuming the definition of both are in the same module and crashing if they aren't, the pass should ignore a variable which doesn't meet that requirement.
llvm/test/CodeGen/AMDGPU/replace_lds_test_direct_call_misc.ll
15	These tests would be more robust if the new pointer was named based on the global it is intended to reference, as then the regex can check that we created load from the correct pointer (as opposed to just one of the new pointers).

hsmhsm mentioned this in D100594: [AMDGPU] Bit of code clean-up around collection of LDS globals which requires lowering.Apr 16 2021, 6:38 AM

In D91516#2682974, @JonChesterfield wrote:

The algorithm I had in mind was along the lines of:

for each LDS variable:
  if should-transform
    create 16 bit integer in LDS
    initialize that global with (constexpr) address of variable
    replace all uses of variable with a (constexpr) access through new pointer

where
should-transform:
 if (sizeof) < 8ish return false
 if used by instruction in indirectly called function return false
 if only used by kernels return false
 probably other exclusions
 return true

I think, we all of us, who are involved in discussing about the functionalities related to this patch are not on the same page. First, we need to internally discuss it and make sure that we are all on the same page, before I start making any further changes to this patch.

By the way, this patch has gone through too many revisions, and is becoming too complex to go back to any previous history of the patch when required. So, I think, it is better to abandon this patch, start on a clean slate with fresh new patch. If I do not get any objection to abandon this patch, then I will be abonding it in a day or two.

As mentioned eariler, I am abandoning this patch. Let's start with a clean slate, and decide on the implementation.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

9 lines

AMDGPUAlwaysInlinePass.cpp

23 lines

AMDGPULowerModuleLDSPass.cpp

106 lines

AMDGPUReplaceLDSUseWithPointer.cpp

758 lines

AMDGPUTargetMachine.h

1 line

AMDGPUTargetMachine.cpp

16 lines

CMakeLists.txt

1 line

Utils/

AMDGPUGeneralUtils.h

43 lines

AMDGPUGeneralUtils.cpp

168 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

GlobalISel/

lds-global-non-entry-func.ll

4 lines

addrspacecast-initializer-unsupported.ll

2 lines

force-alwaysinline-lds-global-address-codegen.ll

6 lines

force-alwaysinline-lds-global-address.ll

8 lines

lds-global-non-entry-func.ll

4 lines

lower-module-lds-constantexpr.ll

6 lines

lower-module-lds-inactive.ll

2 lines

lower-module-lds-indirect.ll

50 lines

lower-module-lds-inline-asm-call.ll

31 lines

lower-module-lds-used-list.ll

8 lines

lower-module-lds.ll

50 lines

promote-alloca-to-lds-constantexpr-use.ll

2 lines

replace_lds_report_error_no_func_def.ll

20 lines

replace_lds_test_direct_call_diamond_shape.ll

63 lines

replace_lds_test_direct_call_misc.ll

89 lines

replace_lds_test_ignored_lds.ll

80 lines

replace_lds_test_indirect_call_diamond_shape.ll

81 lines

replace_lds_test_indirect_call_misc.ll

113 lines

replace_lds_test_indirect_call_misc2.ll

128 lines

replace_lds_test_indirect_call_no_addr_taken.ll

77 lines

replace_lds_test_indirect_call_no_init.ll

69 lines

replace_lds_test_llvm_insts.ll

32 lines

replace_lds_test_types_misc.ll

39 lines

replace_lds_test_types_pointers.ll

55 lines

replace_lds_test_types_pointers_misc.ll

65 lines

Diff 332541

llvm/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 335 Lines • ▼ Show 20 Lines
	extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;			extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;

	void initializeGCNRegBankReassignPass(PassRegistry &);			void initializeGCNRegBankReassignPass(PassRegistry &);
	extern char &GCNRegBankReassignID;			extern char &GCNRegBankReassignID;

	void initializeGCNNSAReassignPass(PassRegistry &);			void initializeGCNNSAReassignPass(PassRegistry &);
	extern char &GCNNSAReassignID;			extern char &GCNNSAReassignID;

				ModulePass *createAMDGPUReplaceLDSUseWithPointerPass();
				void initializeAMDGPUReplaceLDSUseWithPointerPass(PassRegistry &);
				extern char &AMDGPUReplaceLDSUseWithPointerID;
				struct AMDGPUReplaceLDSUseWithPointerPass
				: PassInfoMixin<AMDGPUReplaceLDSUseWithPointerPass> {
				AMDGPUReplaceLDSUseWithPointerPass() {}
				PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
				};

	namespace AMDGPU {			namespace AMDGPU {
	enum TargetIndex {			enum TargetIndex {
	TI_CONSTDATA_START,			TI_CONSTDATA_START,
	TI_SCRATCH_RSRC_DWORD0,			TI_SCRATCH_RSRC_DWORD0,
	TI_SCRATCH_RSRC_DWORD1,			TI_SCRATCH_RSRC_DWORD1,
	TI_SCRATCH_RSRC_DWORD2,			TI_SCRATCH_RSRC_DWORD2,
	TI_SCRATCH_RSRC_DWORD3			TI_SCRATCH_RSRC_DWORD3
	};			};
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUAlwaysInlinePass.cpp

Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	static bool alwaysInlineImpl(Module &M, bool GlobalOpt) {
// is something of a workaround because we don't have a way of supporting LDS		// is something of a workaround because we don't have a way of supporting LDS
// objects defined in functions. LDS is always allocated by a kernel, and it		// objects defined in functions. LDS is always allocated by a kernel, and it
// is difficult to manage LDS usage if a function may be used by multiple		// is difficult to manage LDS usage if a function may be used by multiple
// kernels.		// kernels.
//		//
// OpenCL doesn't allow declaring LDS in non-kernels, so in practice this		// OpenCL doesn't allow declaring LDS in non-kernels, so in practice this
// should only appear when IPO passes manages to move LDs defined in a kernel		// should only appear when IPO passes manages to move LDs defined in a kernel
// into a single user function.		// into a single user function.
		//
		// Since now, LDS uses within non-kernel functions are being handled in the
		// pass - `LowerModuleLDS`, we NO need to forcefully inline non-kernel
		// functions just because they use LDS. Do forceful inlining only when the
		// pass - `LowerModuleLDS` is not enabled. It is enabled by default.

		if (!AMDGPUTargetMachine::EnableLowerModuleLDS) {
for (GlobalVariable &GV : M.globals()) {		for (GlobalVariable &GV : M.globals()) {
// TODO: Region address		// TODO: Region address
unsigned AS = GV.getAddressSpace();		unsigned AS = GV.getAddressSpace();
if (AS != AMDGPUAS::LOCAL_ADDRESS && AS != AMDGPUAS::REGION_ADDRESS)		if (AS != AMDGPUAS::LOCAL_ADDRESS && AS != AMDGPUAS::REGION_ADDRESS)
continue;		continue;

recursivelyVisitUsers(GV, FuncsToAlwaysInline);		recursivelyVisitUsers(GV, FuncsToAlwaysInline);
}		}
		}

if (!AMDGPUTargetMachine::EnableFunctionCalls \|\| StressCalls) {		if (!AMDGPUTargetMachine::EnableFunctionCalls \|\| StressCalls) {
auto IncompatAttr		auto IncompatAttr
= StressCalls ? Attribute::AlwaysInline : Attribute::NoInline;		= StressCalls ? Attribute::AlwaysInline : Attribute::NoInline;

for (Function &F : M) {		for (Function &F : M) {
if (!F.isDeclaration() && !F.use_empty() &&		if (!F.isDeclaration() && !F.use_empty() &&
!F.hasFnAttribute(IncompatAttr)) {		!F.hasFnAttribute(IncompatAttr)) {
Show All 31 Lines

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

Show All 22 Lines
//		//
// A possible future refinement is to specialise the structure per-kernel, so		// A possible future refinement is to specialise the structure per-kernel, so
// that fields can be elided based on more expensive analysis.		// that fields can be elided based on more expensive analysis.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "Utils/AMDGPUBaseInfo.h"		#include "Utils/AMDGPUBaseInfo.h"
		#include "Utils/AMDGPUGeneralUtils.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/InlineAsm.h"		#include "llvm/IR/InlineAsm.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
Show All 9 Lines
namespace {		namespace {

class AMDGPULowerModuleLDS : public ModulePass {		class AMDGPULowerModuleLDS : public ModulePass {

static bool isKernelCC(Function *Func) {		static bool isKernelCC(Function *Func) {
return AMDGPU::isModuleEntryFunctionCC(Func->getCallingConv());		return AMDGPU::isModuleEntryFunctionCC(Func->getCallingConv());
}		}

static Align getAlign(DataLayout const &DL, const GlobalVariable *GV) {
return DL.getValueOrABITypeAlignment(GV->getPointerAlignment(DL),
GV->getValueType());
}

static bool
userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,
User *InitialUser) {
// Any LDS variable can be lowered by moving into the created struct
// Each variable so lowered is allocated in every kernel, so variables
// whose users are all known to be safe to lower without the transform
// are left unchanged.
SmallPtrSet<User *, 8> Visited;
SmallVector<User *, 16> Stack;
Stack.push_back(InitialUser);

while (!Stack.empty()) {
User *V = Stack.pop_back_val();
Visited.insert(V);

if (auto *G = dyn_cast<GlobalValue>(V->stripPointerCasts())) {
if (UsedList.contains(G)) {
continue;
}
}

if (auto *I = dyn_cast<Instruction>(V)) {
if (isKernelCC(I->getFunction())) {
continue;
}
}

if (auto *E = dyn_cast<ConstantExpr>(V)) {
for (Value::user_iterator EU = E->user_begin(); EU != E->user_end();
++EU) {
if (Visited.insert(*EU).second) {
Stack.push_back(*EU);
}
}
continue;
}

// Unknown user, conservatively lower the variable
return true;
}

return false;
}

static std::vector<GlobalVariable *>		static std::vector<GlobalVariable *>
findVariablesToLower(Module &M,		findVariablesToLower(Module &M,
const SmallPtrSetImpl<GlobalValue *> &UsedList) {		const SmallPtrSetImpl<GlobalValue *> &UsedList) {
std::vector<llvm::GlobalVariable *> LocalVars;		std::vector<llvm::GlobalVariable *> LocalVars;
for (auto &GV : M.globals()) {		for (auto &GV : M.globals()) {
if (GV.getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS) {		if (isLDSLowereringRequired(&GV, UsedList, /IsLDSLoweringPass=/true))
continue;
}
if (!GV.hasInitializer()) {
// addrspace(3) without initializer implies cuda/hip extern __shared__
// the semantics for such a variable appears to be that all extern
// __shared__ variables alias one another, in which case this transform
// is not required
continue;
}
if (!isa<UndefValue>(GV.getInitializer())) {
// Initializers are unimplemented for local address space.
// Leave such variables in place for consistent error reporting.
continue;
}
if (GV.isConstant()) {
// A constant undef variable can't be written to, and any load is
// undef, so it should be eliminated by the optimizer. It could be
// dropped by the back end if not. This pass skips over it.
continue;
}
if (std::none_of(GV.user_begin(), GV.user_end(), [&](User *U) {
return userRequiresLowering(UsedList, U);
})) {
continue;
}
LocalVars.push_back(&GV);		LocalVars.push_back(&GV);
}		}
return LocalVars;		return LocalVars;
}		}

static void removeFromUsedList(Module &M, StringRef Name,		static void removeFromUsedList(Module &M, StringRef Name,
SmallPtrSetImpl<Constant *> &ToRemove) {		SmallPtrSetImpl<Constant *> &ToRemove) {
GlobalVariable *GV = M.getGlobalVariable(Name);		GlobalVariable *GV = M.getGlobalVariable(Name);
if (!GV \|\| ToRemove.empty()) {		if (!GV \|\| ToRemove.empty()) {
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	static void markUsedByKernel(IRBuilder<> &Builder, Function *Func,
Value *UseInstance[1] = {Builder.CreateInBoundsGEP(		Value *UseInstance[1] = {Builder.CreateInBoundsGEP(
SGV->getValueType(), SGV, ConstantInt::get(Type::getInt32Ty(Ctx), 0))};		SGV->getValueType(), SGV, ConstantInt::get(Type::getInt32Ty(Ctx), 0))};

Builder.CreateCall(FTy, Decl, {},		Builder.CreateCall(FTy, Decl, {},
{OperandBundleDefT<Value *>("ExplicitUse", UseInstance)},		{OperandBundleDefT<Value *>("ExplicitUse", UseInstance)},
"");		"");
}		}

static SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {
SmallPtrSet<GlobalValue *, 32> UsedList;

SmallVector<GlobalValue *, 32> TmpVec;
collectUsedGlobalVariables(M, TmpVec, true);
UsedList.insert(TmpVec.begin(), TmpVec.end());

TmpVec.clear();
collectUsedGlobalVariables(M, TmpVec, false);
UsedList.insert(TmpVec.begin(), TmpVec.end());

return UsedList;
}

public:		public:
static char ID;		static char ID;

AMDGPULowerModuleLDS() : ModulePass(ID) {		AMDGPULowerModuleLDS() : ModulePass(ID) {
initializeAMDGPULowerModuleLDSPass(*PassRegistry::getPassRegistry());		initializeAMDGPULowerModuleLDSPass(*PassRegistry::getPassRegistry());
}		}

bool runOnModule(Module &M) override {		bool runOnModule(Module &M) override {
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	public:
}		}
};		};

} // namespace		} // namespace
char AMDGPULowerModuleLDS::ID = 0;		char AMDGPULowerModuleLDS::ID = 0;

char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;		char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;

INITIALIZE_PASS(AMDGPULowerModuleLDS, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(AMDGPULowerModuleLDS, DEBUG_TYPE,
"Lower uses of LDS variables from non-kernel functions", false,		"Lower uses of LDS variables from non-kernel functions",
false)		false, false)
		// Before runnning current LDS lower pass, replace LDS uses within non-kernel
		// functions by pointers so that the current pass minimizes the unnecessary per
		// kernel allocation of LDS memory.
		INITIALIZE_PASS_DEPENDENCY(AMDGPUReplaceLDSUseWithPointer)
		INITIALIZE_PASS_END(AMDGPULowerModuleLDS, DEBUG_TYPE,
		"Lower uses of LDS variables from non-kernel functions",
		false, false)

ModulePass *llvm::createAMDGPULowerModuleLDSPass() {		ModulePass *llvm::createAMDGPULowerModuleLDSPass() {
return new AMDGPULowerModuleLDS();		return new AMDGPULowerModuleLDS();
}		}

PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,		PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,
ModuleAnalysisManager &) {		ModuleAnalysisManager &) {
return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()		return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()
: PreservedAnalyses::all();		: PreservedAnalyses::all();
}		}

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

This file was added.

				//===-- AMDGPUReplaceLDSUseWithPointer.cpp --------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// One of the memory types being supported within AMD GPU memory hierarchy is
				// `shared memory`, also called `Local Data Share` or LDS for short. LDS memory
				// is the `second` fastest memory in the AMD GPU memory hierarchy (with register
				// file being fastest available memory in the hierarchy). Being faster also
				// means LDS memory is comparatively costlier and hence is a `limited` available
				// memory resource.
				//
				// Being global scoped, an LDS variable is accessible within kernel functions
				// and non-kernel functions, but two different kernel execution paths, say
				// called from two kernels K1 and K2, cannot access the same instance of an LDS
				// variable, say L. Both K1 and K2 has to own its own instance of L. This puts
				// some challenges, especially to lower the LDS variables used within non-kernel
				// functions.
				//
				// So, the pass - "Lower Module LDS" lowers the LDS globals by packing them
				// within in a struct type, and by creating an instance of that struct type
				// within every kerenl at address zero. Though, the pass - "Lower Module LDS"
				// makes some effort to minimize unnecessary LDS allocation, it is limited by
				// means of the fundamental basis and assumption upon which the pass is
				// implemented.
				//
				// The current pass acts as an helping aid to the pass - "Lower Module LDS" with
				// the intention of minimizing unnecessary LDS allocation as much as possible.
				//
				// The main idea behind the current pass is:
				//
				// (1) To identify the LDS globals used within non-kernel function scope and
				// global scope,
				// (2) To push the use of all the above identified LDS globals to kernel
				// function scope by initializing their addresses to newly created LDS
				// global pointer variables (within kernel functions),
				// (3) To replace the uses of original LDS globals within non-kernel functions
				// by their pointer counter-parts.
				// (4) This way, the transformation makes sure that the pass "Lower Module LDS"
				// packs only pointer variables within struct type, and hence significantly
				// minimizes unnecessary LDS allocation, espacically when the original LDS
				// globals are big arrays (as this is the common LDS use case).
				//
				// NOTE: The pass - "Lower Module LDS" now has a tight dependency on the current
				// pass, and the current pass should always be run before running the pass
				// "Lower Module LDS". Running the pass "Lower Module LDS" alone may lead
				// to surprizing results.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "Utils/AMDGPUBaseInfo.h"
				#include "Utils/AMDGPUGeneralUtils.h"
				#include "llvm/ADT/SCCIterator.h"
				#include "llvm/ADT/SetVector.h"
				#include "llvm/ADT/SmallPtrSet.h"
				#include "llvm/ADT/SmallSet.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/Analysis/CallGraph.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/MDBuilder.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/ValueMap.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Transforms/Utils/Cloning.h"
				#include <map>
				#include <queue>
				#include <set>

				#define DEBUG_TYPE "amdgpu-replace-lds-use-with-pointer"

				using namespace llvm;

				namespace {

				// Error kinds for handling the errors within the context of current pass.
				enum ReplaceLDSErrorKind : uint32_t {
				LLEK_EndOfList = 0u,
				LLEK_InternalError = 2u,
				LLEK_NoCalleeDefinitionError = 3u
				};

				} // namespace

				// Report error within the context of current pass based on the error kind.
				static void reportReplaceLDSError(ReplaceLDSErrorKind EK, Value *V = nullptr) {
				std::string ErrStr("The pass \"Replace LDS Use With Pointer\" ");

				switch (EK) {
				default:
				case LLEK_InternalError: {
				ErrStr = ErrStr + std::string("has encountered an internal error.");
				break;
				}
				case LLEK_NoCalleeDefinitionError: {
				ErrStr =
				ErrStr +
				std::string("assumes that the definitions of both caller and callee "
				"appear within same module. But, definition for the "
				"callee \"") +
				V->getName().str() + std::string("\" not available.");
				break;
				}
				}

				report_fatal_error(ErrStr);
				}

				// Helper function around `ValueMap` to detect if an element exists within it.
				template <typename R, typename E>
				static bool contains(R &&VMap, const E &Element) {
				return VMap.find(Element) != VMap.end();
				}

				// Within User `U` replace the use(s) of `OldValue` by `NewValue`.
				static void updateUserOperand(User U, Value OldValue, Value *NewValue) {
				unsigned Ind = 0;
				for (Use &UU : U->operands()) {
				if (UU.get() == OldValue)
				U->setOperand(Ind, NewValue);
				++Ind;
				}
				}

				// Convert `ConstantExpr CE` to a corresponding set of instructions, and update
				// users of `CE` to use corresponding instructions.
				static Instruction *
				replaceConstExprByInst(ConstantExpr *CE,
				SmallPtrSetImpl<Instruction *> &Insts) {
				Instruction *NI = nullptr;

				SmallVector<User *, 8> CEUsers;
				append_range(CEUsers, CE->users());

				for (auto *U : CEUsers) {
				auto *I = dyn_cast<Instruction>(U);
				if (!I) {
				auto *CE2 = dyn_cast<ConstantExpr>(U);
				assert(CE2 && "Constant expression expected.");
				I = replaceConstExprByInst(CE2, Insts);
				}

				NI = CE->getAsInstruction();
				NI->insertBefore(I);
				updateUserOperand(I, CE, NI);
				CE->removeDeadConstantUsers();
				Insts.insert(NI);
				}

				assert(NI && "Instruction expected.");

				return NI;
				}

				// `U` should be either `Instruction` OR `ConstantExpr`. If it is `Instruction`
				// return it, if it is `ConstantExpr` break it into a set of instructions and
				// return it.
				static void getInstructions(User U, SmallPtrSetImpl<Instruction > &Insts) {
				if (auto *I = dyn_cast<Instruction>(U)) {
				// Return instruction `I`.
				Insts.insert(I);
				} else if (auto *CE = dyn_cast<ConstantExpr>(U)) {
				// Break const expression `CE` into a set of instructions.
				replaceConstExprByInst(CE, Insts);
				} else {
				// Unexpected control flow - what else is missing?
				reportReplaceLDSError(LLEK_InternalError);
				}
				}

				// Return true if the user `U` is a global variable.
				static bool isUserGlobalVariable(User *U) {
				SmallVector<User *, 8> UserStack;
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Why not isa<GlobalVariable> / function needs a different name JonChesterfield: Why not isa<GlobalVariable> / function needs a different name
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions Because LDS would be nested within the const expr within global scope use. hsmhsm: Because LDS would be nested within the const expr within global scope use.
				SmallPtrSet<User *, 8> VisitedUsers;

				UserStack.push_back(U);

				while (!UserStack.empty()) {
				auto *U = UserStack.pop_back_val();

				if (!VisitedUsers.insert(U).second)
				continue;

				if (isa<GlobalVariable>(U))
				return true;

				if (isa<Constant>(U)) {
				append_range(UserStack, U->users());
				continue;
				}

				if (isa<Instruction>(U))
				return false;
				}

				return false;
				}

				// Collect functions whose address is taken within the module.
				static void collectAddressTakenFunctions(
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Functions define hasAddressTaken, but also I don't think this pass needs to distinguish between direct and indirect calls JonChesterfield: Functions define hasAddressTaken, but also I don't think this pass needs to distinguish between…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I am not getting this comment, probably we can discuss it offline. hsmhsm: I am not getting this comment, probably we can discuss it offline.
				CallGraph &CG, SmallPtrSetImpl<CallGraphNode *> &AddressTakenSet) {
				auto *ExternalCallingNode = CG.getExternalCallingNode();
				for (auto GI = ExternalCallingNode->begin(), GE = ExternalCallingNode->end();
				GI != GE; ++GI) {
				auto *CGN = GI->second;
				auto *F = CGN->getFunction();
				// Note that we intentionally collect "declared only" address taken fuctions
				// too here, but later, error will be thrown when we check for the
				// definition of callees since this pass assumes that both caller and callee
				// appear within the same module.
				// FIXME: Anything else need to be excluded?
				if (!F \|\| AMDGPU::isModuleEntryFunctionCC(F->getCallingConv()))
				continue;
				AddressTakenSet.insert(CGN);
				}
				}

				namespace {

				class ReplaceLDSUseImpl {
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This I haven't read yet, but it looks like far too much state. Expected a set of LDS globals called 'toReplaceWithPointer' or similar instead of all the maps JonChesterfield: This I haven't read yet, but it looks like far too much state. Expected a set of LDS globals…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions These maps are required for the logic where we really need to restrict the LDS set for kernel based on kernel excecution paths. hsmhsm: These maps are required for the logic where we really need to restrict the LDS set for kernel…
				Module &M;
				LLVMContext &Ctx;
				const DataLayout &DL;

				// Holds all kernels defined within the module `M`.
				SmallPtrSet<Function *, 8> Kernels;

				// Holds all LDS globals defined within the module `M`.
				SmallPtrSet<GlobalVariable *, 8> LDSGlobals;

				// Holds all those LDS globals which are used as initializers within some
				// other global variable definitions.
				SmallPtrSet<GlobalVariable *, 8> LDSGlobalsAsInitializers;

				// Associates LDS global to a list of functions which references that LDS.
				ValueMap<GlobalVariable , SmallPtrSet<Function , 8>> LDSGlobalToAccessors;

				// Associates function to a list of LDS globals which are referenced within
				// that function.
				ValueMap<Function , SmallPtrSet<GlobalVariable , 8>> AccessorToLDSGlobals;

				// Associates kernel to a list of non-kernel functions which are reachable
				// from that kernel.
				ValueMap<Function , SmallPtrSet<Function , 8>> KernelToCallees;

				// Associates kernel to a list of LDS globals which are referenced along the
				// run time kernel execution paths (within non-kernel functions) associated
				// with that kernel.
				ValueMap<Function , SmallPtrSet<GlobalVariable , 8>> KernelToLDSGlobals;

				// Associates LDS global to a unique pointer which points to that LDS global.
				ValueMap<GlobalVariable , GlobalVariable > LDSToPointer;

				// Associates non-kernel function to an LDS global to a list of int-to-ptr
				// instructions.
				std::map<Function , std::map<GlobalVariable , Value *>> FunctionToLDSToInst;

				public:
				explicit ReplaceLDSUseImpl(Module &M)
				: M(M), Ctx(M.getContext()), DL(M.getDataLayout()) {}

				// Entry-point function.
				bool replace();

				private:
				//===--------------------------------------------------------------------===//
				// Methods which aid in creating new global LDS pointers which point to
				// original LDS globals which are referenced within non-kernel functions.
				//===--------------------------------------------------------------------===//

				// Construct an `IntToPtr` instruction which replaces `LDS` within F.
				Value getIntToPtrInst(Function F, GlobalVariable *LDS,
				GlobalVariable *LDSPointer);

				// Replace all uses of original LDS globals within all non-kernel functions by
				// their respective LDS poitners.
				void replaceUsesOfLDSGlobalsByPointers();

				// Insert global LDS pointers (which point to original LDS globals which are
				// referenced within non-kernel functions) and initialize them within kernels
				// to point to respective LDS globals.
				void insertAndInitializeLDSPointers();

				//===--------------------------------------------------------------------===//
				// Methods which aid in creating the various `map` data structures.
				//===--------------------------------------------------------------------===//

				// Associate each kernel K with LDS globals which are being accessed by K
				// and/or by the callees of K.
				void createKernelToLDSGlobalsMap();

				// Collect all call graph nodes which are reachable from the node `CGN`.
				void
				collectReachableCallGraphNodes(CallGraphNode *CGN,
				SetVector<CallGraphNode *> &ReachableCGNodes);

				// Resolve all indirect call sites within the the call graph node `CGN`.
				void
				resolveIndirectCallSites(CallGraphNode *CGN, CallGraph &CG,
				SmallPtrSetImpl<CallGraphNode *> &AddressTakenSet,
				SetVector<CallGraphNode *> &ReachableCGNodes);

				// Traverse `CallGraph` starting from the `CallGraphNode` associated with each
				// kernel `K` and collect all callees which are reachable from K (including
				// indirectly called callees).
				void createKernelToCalleesMap();

				// Associate each kernel/function with the LDS globals which are being
				// accessed within them.
				void createAccessorToLDSGlobalsMap();

				// For each `LDS`, recursively visit its user list and find all those
				// kernels/functions within which the `LDS` is being accessed.
				void createLDSGlobalToAccessorsMap();

				// For each kernel `K`, collect LDS globals which are being accessed during
				// the execution of `K`.
				bool collectPerKernelAccessibleLDSGlobals();

				//===--------------------------------------------------------------------===//
				// Methods which aid in creating the various `set` data structures.
				//===--------------------------------------------------------------------===//

				// Collect all the LDS globals defined within the current module which require
				// pointer replacement.
				bool collectLDSGlobals();

				// Collect all the amdgpu kernels defined within the current module.
				bool collectKernels();
				};

				// Construct an `IntToPtr` instruction which replaces `LDS` within F.
				Value ReplaceLDSUseImpl::getIntToPtrInst(Function F, GlobalVariable *LDS,
				GlobalVariable *LDSPointer) {
				// Create an entry for `F` within `FunctionToLDSToInst`.
				if (!contains(FunctionToLDSToInst, F))
				FunctionToLDSToInst[F] = std::map<GlobalVariable , Value >();

				// `IntToPtr` instruction to be constructed.
				Value *IToP = nullptr;

				auto &LDSToInst = FunctionToLDSToInst[F];
				if (!contains(LDSToInst, LDS)) {
				// Get the instruction insertion point within the beginning of the entry
				// block of current non-kernel function.
				auto EI = &((F->getEntryBlock().getFirstInsertionPt()));
				IRBuilder<> Builder(EI);

				// Insert Load and IntToPtr instructions.
				IToP = Builder.CreateIntToPtr(
				Builder.CreateLoad(LDSPointer->getValueType(), LDSPointer),
				LDS->getType());
				LDSToInst[LDS] = IToP;
				} else
				IToP = LDSToInst[LDS];

				return IToP;
				}

				// Replace all uses of original LDS globals within all non-kernel functions by
				// their respective LDS poitners.
				void ReplaceLDSUseImpl::replaceUsesOfLDSGlobalsByPointers() {
				for (auto LI = LDSToPointer.begin(), LE = LDSToPointer.end(); LI != LE;
				++LI) {
				auto *LDS = LI->first;
				auto *LDSPointer = LI->second;

				SmallVector<User *, 16> LDSUsers(LDS->users());
				for (auto *U : LDSUsers) {
				// `U` is a global variable (from different address space) which got
				// initialized with `LDS`. No need to handle it.
				if (isUserGlobalVariable(U))
				continue;

				// `U` is from within some function. Since the replacers of `LDS` within
				// `U` are instructions, and if `U` is a const expression, then we cannot
				// embed instructions within const expressions. Hence, get appropriate
				// instructions if `U` is a const expression.
				SmallPtrSet<Instruction *, 8> Insts;
				getInstructions(U, Insts);
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Why do we want to replace constexpr with instructions? This comment contradicts the implementation JonChesterfield: Why do we want to replace constexpr with instructions? This comment contradicts the…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions Again not clear about what you intended here - let's take it offline. hsmhsm: Again not clear about what you intended here - let's take it offline.
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions I was thinking the introduced 16 bit pointers will be initialised with constexpr from the corresponding variable. This patch presently initialises them with undef, which I think thwarts using constexpr everywhere, and means we insert stores in the kernel entry basic block here. If we fix the back end to handle LDS variables with initializers (at least the simple case of only used from kernel and initialized with address of some other variable), then quite a lot of the complexity of this patch drops out. JonChesterfield: I was thinking the introduced 16 bit pointers will be initialised with constexpr from the…

				for (auto *I : Insts) {
				// Get function to which `I` belongs to.
				auto *F = I->getParent()->getParent();

				// Ignore uses within kernels.
				if (AMDGPU::isModuleEntryFunctionCC(F->getCallingConv()))
				continue;

				// Construct an `IntToPtr` instruction which replaces `LDS` within F.
				auto *IToP = getIntToPtrInst(F, LDS, LDSPointer);

				// Replace the uses of `LDS` within `I` by `IToP`.
				updateUserOperand(I, LDS, IToP);
				}
				}
				}
				}

				// Insert global LDS pointers (which point to original LDS globals which are
				// referenced within non-kernel functions) and initialize them within kernels to
				// point to respective LDS globals.
				void ReplaceLDSUseImpl::insertAndInitializeLDSPointers() {
				unsigned LID = 0;

				for (auto KI = KernelToLDSGlobals.begin(), KE = KernelToLDSGlobals.end();
				KI != KE; ++KI) {
				// Get the instruction insertion point within the beginning of entry block
				// of current kernel.
				auto EI = &((KI->first->getEntryBlock().getFirstInsertionPt()));
				IRBuilder<> Builder(EI);

				// Insert and initialize LDS pointers for all LDS globals which associated
				// with current kernel.
				for (auto *LDS : KI->second) {
				GlobalVariable *LDSPointer = nullptr;

				if (!contains(LDSToPointer, LDS)) {
				// `LDS` is encountered for first type, create an LDS pointer which is
				// suppose to point to `LDS`.
				++LID;
				auto *I16Ty = Type::getInt16Ty(Ctx);
				LDSPointer = new GlobalVariable(
				M, I16Ty, false, GlobalValue::InternalLinkage,
				UndefValue::get(I16Ty),
				Twine("llvm.amdgcn.lds.pointer.") + Twine(LID), nullptr,
				GlobalVariable::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS);
				LDSPointer->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
				LDSPointer->setAlignment(getAlign(M.getDataLayout(), LDSPointer));
				LDSToPointer[LDS] = LDSPointer;
				} else {
				// An LDS pointer which points to `LDS` is already created, get it.
				LDSPointer = LDSToPointer[LDS];
				}

				// Insert instructions at `EI` in order to initialize `LDSPointer` to
				// point to `LDS`.
				Builder.CreateStore(Builder.CreatePtrToInt(LDS, Type::getInt16Ty(Ctx)),
				LDSPointer);
				}
				}
				}

				// Associate each kernel K with LDS globals which are being accessed by K and/or
				// by the callees of K.
				void ReplaceLDSUseImpl::createKernelToLDSGlobalsMap() {
				for (auto *K : Kernels) {
				SmallPtrSet<GlobalVariable *, 8> LDSSet(LDSGlobalsAsInitializers.begin(),
				LDSGlobalsAsInitializers.end());

				// Collect all those LDS globals which are being accessed by the callees of
				// kernel K.
				if (contains(KernelToCallees, K)) {
				for (auto *Callee : KernelToCallees[K]) {
				if (contains(AccessorToLDSGlobals, Callee))
				LDSSet.insert(AccessorToLDSGlobals[Callee].begin(),
				AccessorToLDSGlobals[Callee].end());
				}
				}

				if (!LDSSet.empty())
				KernelToLDSGlobals[K] = LDSSet;
				}
				}

				// Collect all call graph nodes which are reachable from the node `CGN`.
				void ReplaceLDSUseImpl::collectReachableCallGraphNodes(
				CallGraphNode CGN, SetVector<CallGraphNode > &ReachableCGNodes) {
				for (scc_iterator<CallGraphNode *> I = scc_begin(CGN); !I.isAtEnd(); ++I) {
				const std::vector<CallGraphNode > &SCC = I;
				assert(!SCC.empty() && "SCC with no functions?");
				for (auto *CGNode : SCC)
				ReachableCGNodes.insert(CGNode);
				}
				}

				// Resolve all indirect call sites within the the call graph node `CGN`.
				void ReplaceLDSUseImpl::resolveIndirectCallSites(
				CallGraphNode *CGN, CallGraph &CG,
				SmallPtrSetImpl<CallGraphNode *> &AddressTakenSet,
				SetVector<CallGraphNode *> &ReachableCGNodes) {
				for (auto GI = CGN->begin(), GE = CGN->end(); GI != GE; ++GI) {
				auto *CB = cast<CallBase>(GI->first.getValue());

				// If the call site `CB` is not an indirect call site, ignore it, and go to
				// next one, otherwise, resolve the indirect call site `CB` to a set of
				// potential callees.
				if (!CB->isIndirectCall())
				continue;

				// "Inline asm call sites" cannot be handled. Ignore it.
				if (CB->isInlineAsm())
				continue;

				// `CB` is an indirect call, handle it.
				//
				if (auto *MD = CB->getMetadata(LLVMContext::MD_callees)) {
				// The metadata "!callee" is available at the indirect call site `CB`,
				// which means, all the potential target callees for the call site `CB` is
				// successfully resolved at compile time. Collect them.
				for (const auto &Op : MD->operands()) {
				auto *GCN = CG[mdconst::extract_or_null<Function>(Op)];
				collectReachableCallGraphNodes(GCN, ReachableCGNodes);
				}
				} else {
				// The metadata "!callee" is NOT available at the indirect call site
				// `CB`, which means, `CB` has NO information about potential target
				// callees. The simplest possible SAFE assumption that we can make here
				// is to consider all those "address taken" functions whose singature
				// matches with that of the call site `CB`, and assume that all these
				// signature matched "address taken" functions are possible potential
				// callees. So, collect all these signature matchable "address taken"
				// functions.
				auto *CBFTy = CB->getFunctionType();
				for (auto *CGN : AddressTakenSet) {
				if (CGN->getFunction()->getFunctionType() == CBFTy)
				collectReachableCallGraphNodes(CGN, ReachableCGNodes);
				}
				}
				}
				}

				// Traverse `CallGraph` starting from the `CallGraphNode` associated with each
				// kernel `K` and collect all the callees which are reachable from K (including
				// indirectly called callees).
				void ReplaceLDSUseImpl::createKernelToCalleesMap() {
				// Create the call graph `CG` of the module `M`, collect all the address taken
				// functions, and explore `CG` to collect all the reachable callees (including
				// indirectly called callees) from all kernels.
				CallGraph CG = CallGraph(M);

				// Holds call graph nodes associated with the functions whose addresses are
				// taken within the module.
				SmallPtrSet<CallGraphNode *, 8> AddressTakenSet;

				// Collect all address taken functions within the module `M`.
				collectAddressTakenFunctions(CG, AddressTakenSet);

				for (auto *K : Kernels) {
				// Get `CallGraphNode` representing kernel `K`.
				auto *KernCGNode = CG[K];

				// Collect all call graph nodes which are reachable from `KernCGNode`.
				SetVector<CallGraphNode *> ReachableCGNodes;
				collectReachableCallGraphNodes(KernCGNode, ReachableCGNodes);

				// Remove `CallGraphNode` representing kernel `K` from reachable node set.
				ReachableCGNodes.remove(KernCGNode);

				// Collect all callees (including potential indirect callees) which are
				// reachable from kernel `K`. First, resolve all indirect call sites within
				// kernel `K`, and then `recursively` within all reachable callees from
				// kernel `K`.
				SmallPtrSet<Function *, 8> ReachableCallees;
				SmallPtrSet<CallGraphNode *, 8> VisitedCGNodes;

				resolveIndirectCallSites(KernCGNode, CG, AddressTakenSet, ReachableCGNodes);

				while (!ReachableCGNodes.empty()) {
				auto *CGN = ReachableCGNodes.pop_back_val();

				// If `CGN` is already handled OR if there is not callee associated with
				// `CGN`, then ignore it.
				if (!VisitedCGNodes.insert(CGN).second \|\| !CGN->getFunction())
				continue;

				auto *F = CGN->getFunction();

				// This pass expects both caller and callee to appear in the same module.
				// Report an error if `F` is a non-kernel function and is not definition.
				if (!AMDGPU::isModuleEntryFunctionCC(F->getCallingConv()) &&
				F->isDeclaration())
				reportReplaceLDSError(LLEK_NoCalleeDefinitionError, F);

				// Callee associated with `CGN` is reachable from kernel `K`.
				ReachableCallees.insert(F);

				// Resolve all indirect call sites within the callee `Callee`.
				resolveIndirectCallSites(CGN, CG, AddressTakenSet, ReachableCGNodes);
				}

				KernelToCallees[K] = ReachableCallees;
				}
				}

				// Associate each kernel/function with the LDS globals which are being accessed
				// within them.
				void ReplaceLDSUseImpl::createAccessorToLDSGlobalsMap() {
				for (auto LI = LDSGlobalToAccessors.begin(), LE = LDSGlobalToAccessors.end();
				LI != LE; ++LI) {
				for (auto *A : LI->second) {
				if (!contains(AccessorToLDSGlobals, A)) {
				SmallPtrSet<GlobalVariable *, 8> LDSSet;
				LDSSet.insert(LI->first);
				AccessorToLDSGlobals[A] = LDSSet;
				} else
				AccessorToLDSGlobals[A].insert(LI->first);
				}
				}
				}

				// For each `LDS`, recursively visit its user list and find all those
				// kernels/functions within which the `LDS` is being accessed.
				void ReplaceLDSUseImpl::createLDSGlobalToAccessorsMap() {
				for (auto *LDS : LDSGlobals) {
				SmallPtrSet<Function *, 8> LDSAccessors;
				SmallVector<User *, 8> UserStack(LDS->users());
				SmallPtrSet<User *, 8> VisitedUsers;

				while (!UserStack.empty()) {
				auto *U = UserStack.pop_back_val();

				// `U` is already visited? continue to next one.
				if (!VisitedUsers.insert(U).second)
				continue;

				// `U` is a global variable (from different address space) which is
				// initialized with `LDS`. Ignore `U`.
				if (isa<GlobalVariable>(U)) {
				LDSGlobalsAsInitializers.insert(LDS);
				continue;
				}

				// `U` is `Constant`. Push-back users of `U`, and continue further
				// exploring the stack until an `Instruction` is found.
				if (isa<Constant>(U)) {
				append_range(UserStack, U->users());
				continue;
				}

				// `U` should be an instruction. Otherwise something is wrong.
				auto *I = dyn_cast<Instruction>(U);
				if (!I)
				reportReplaceLDSError(LLEK_InternalError);

				// We have successfully found a kernel/function within which the `LDS` is
				// being accessed, insert it into `LDSAccessors` set.
				LDSAccessors.insert(I->getParent()->getParent());
				}

				LDSGlobalToAccessors[LDS] = LDSAccessors;
				}
				}

				// For each kernel `K`, collect LDS globals which are being accessed during the
				// execution of `K`.
				bool ReplaceLDSUseImpl::collectPerKernelAccessibleLDSGlobals() {
				// Associate each LDS with the kernels/functions within which the LDS is being
				// accessed.
				createLDSGlobalToAccessorsMap();

				// Associate each kernel/function with the LDS globals which are being
				// accessed within them.
				createAccessorToLDSGlobalsMap();

				// Associate each kernel K with callees which are reachable from K (including
				// indirectly called callees).
				createKernelToCalleesMap();

				// Associate each kernel K with LDS globals which are being accessed by K
				// and/or by the callees of K.
				createKernelToLDSGlobalsMap();

				// If none of the kernels associate with any LDS globals which needs pointer
				// replacement, then nothing do.
				return !KernelToLDSGlobals.empty();
				}

				// Collect all the (static) LDS globals defined within the current module which
				// require pointer replacement.
				bool ReplaceLDSUseImpl::collectLDSGlobals() {
				SmallPtrSet<GlobalValue *, 32> UsedList = getUsedList(M);
				for (auto &GV : M.globals()) {
				if (isLDSLowereringRequired(&GV, UsedList))
				LDSGlobals.insert(&GV);
				}

				return !LDSGlobals.empty();
				}

				// Collect all the amdgpu kernels defined within the current module.
				bool ReplaceLDSUseImpl::collectKernels() {
				for (auto &F : M.functions()) {
				// Collect `F` if it is a definition of an entry point function.
				if (!F.isDeclaration() &&
				AMDGPU::isModuleEntryFunctionCC(F.getCallingConv()))
				Kernels.insert(&F);
				}

				return !Kernels.empty();
				}

				// Entry-point function.
				bool ReplaceLDSUseImpl::replace() {
				// If there are no kernels defined within the module, or if there are no
				// LDS globals defined within the module, then nothing to do.
				if (!collectKernels() \|\| !collectLDSGlobals())
				return false;

				// There are kernels and LDS globals defined within the module, but, if none
				// of the LDS globals are being accessed within non-kernel functions along the
				// run time kernels execution paths, then nonthing to do.
				if (!collectPerKernelAccessibleLDSGlobals())
				return false;

				// Insert global LDS pointers (which point to original LDS globals which are
				// referenced within non-kernel functions) and initialize them within kernels
				// to point to respective LDS globals.
				insertAndInitializeLDSPointers();

				// Replace all uses of original LDS globals within all non-kernel functions by
				// their respective LDS poitners.
				replaceUsesOfLDSGlobalsByPointers();

				return true;
				}

				class AMDGPUReplaceLDSUseWithPointer : public ModulePass {
				public:
				static char ID;

				AMDGPUReplaceLDSUseWithPointer() : ModulePass(ID) {
				initializeAMDGPUReplaceLDSUseWithPointerPass(
				*PassRegistry::getPassRegistry());
				}

				bool runOnModule(Module &M) override;
				};

				} // namespace

				char AMDGPUReplaceLDSUseWithPointer::ID = 0;
				char &llvm::AMDGPUReplaceLDSUseWithPointerID =
				AMDGPUReplaceLDSUseWithPointer::ID;

				INITIALIZE_PASS(AMDGPUReplaceLDSUseWithPointer, DEBUG_TYPE,
				"Replace non-kernel use of LDS with pointer",
				false /only look at the cfg/, false /analysis pass/)

				bool AMDGPUReplaceLDSUseWithPointer::runOnModule(Module &M) {
				ReplaceLDSUseImpl LDSReplacer{M};
				return LDSReplacer.replace();
				}

				ModulePass *llvm::createAMDGPUReplaceLDSUseWithPointerPass() {
				return new AMDGPUReplaceLDSUseWithPointer();
				}

				PreservedAnalyses
				AMDGPUReplaceLDSUseWithPointerPass::run(Module &M, ModuleAnalysisManager &AM) {
				ReplaceLDSUseImpl LDSReplacer{M};
				LDSReplacer.replace();
				return PreservedAnalyses::all();
				}

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h

Show All 29 Lines	protected:

StringRef getGPUName(const Function &F) const;		StringRef getGPUName(const Function &F) const;
StringRef getFeatureString(const Function &F) const;		StringRef getFeatureString(const Function &F) const;

public:		public:
static bool EnableLateStructurizeCFG;		static bool EnableLateStructurizeCFG;
static bool EnableFunctionCalls;		static bool EnableFunctionCalls;
static bool EnableFixedFunctionABI;		static bool EnableFixedFunctionABI;
		static bool EnableLowerModuleLDS;
		arsenmUnsubmitted Not Done Reply Inline Actions I don't think this needs to be exposed here, there's no reason other places would need to inspect this arsenm: I don't think this needs to be exposed here, there's no reason other places would need to…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions This is needed in the function `AMDGPUAlwaysInlinePass.cpp`. Please look at the changes being made to this file. hsmhsm: This is needed in the function `AMDGPUAlwaysInlinePass.cpp`. Please look at the changes being…
		arsenmUnsubmitted Not Done Reply Inline Actions This should be a required pass without an option to disable it that the pass would need to be aware of. Is this just for debug/bringup? arsenm: This should be a required pass without an option to disable it that the pass would need to be…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Current idea and common consensus (as I understand it) is to guard this feature by a flag, and disable it by default. HIP application programmer need to pass the option to enable this feature, at least in the begining. May be in future, we can think of getting rid of the option. hsmhsm: Current idea and common consensus (as I understand it) is to guard this feature by a flag, and…
		arsenmUnsubmitted Done Reply Inline Actions This isn't a user exposed flag, and there shouldn't be a need for users to set one. arsenm: This isn't a user exposed flag, and there shouldn't be a need for users to set one.

AMDGPUTargetMachine(const Target &T, const Triple &TT, StringRef CPU,		AMDGPUTargetMachine(const Target &T, const Triple &TT, StringRef CPU,
StringRef FS, TargetOptions Options,		StringRef FS, TargetOptions Options,
Optional<Reloc::Model> RM, Optional<CodeModel::Model> CM,		Optional<Reloc::Model> RM, Optional<CodeModel::Model> CM,
CodeGenOpt::Level OL);		CodeGenOpt::Level OL);
~AMDGPUTargetMachine() override;		~AMDGPUTargetMachine() override;

const TargetSubtargetInfo *getSubtargetImpl() const;		const TargetSubtargetInfo *getSubtargetImpl() const;
▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableScalarIRPasses(
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

static cl::opt<bool> EnableStructurizerWorkarounds(		static cl::opt<bool> EnableStructurizerWorkarounds(
"amdgpu-enable-structurizer-workarounds",		"amdgpu-enable-structurizer-workarounds",
cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),		cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),
cl::Hidden);		cl::Hidden);

static cl::opt<bool>		static cl::opt<bool, true> EnableLowerModuleLDS(
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions I think this test reads better as proposed here - 'enable-lower-module-lds=true' is better than 'disable-lower-module-lds=false'. Separable from the rest of this patch, we could land a patch that just inverts that commandline flag and updates the tests to match. That removes some noise from this review. JonChesterfield: I think this test reads better as proposed here - 'enable-lower-module-lds=true' is better than…
DisableLowerModuleLDS("amdgpu-disable-lower-module-lds", cl::Hidden,		"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),
cl::desc("Disable lower module lds pass"),		cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),
cl::init(false));		cl::Hidden);

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());		RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeR600ClauseMergePassPass(*PR);		initializeR600ClauseMergePassPass(*PR);
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUExternalAAWrapperPass(*PR);		initializeAMDGPUExternalAAWrapperPass(*PR);
initializeAMDGPUUseNativeCallsPass(*PR);		initializeAMDGPUUseNativeCallsPass(*PR);
initializeAMDGPUSimplifyLibCallsPass(*PR);		initializeAMDGPUSimplifyLibCallsPass(*PR);
initializeAMDGPUPrintfRuntimeBindingPass(*PR);		initializeAMDGPUPrintfRuntimeBindingPass(*PR);
initializeGCNRegBankReassignPass(*PR);		initializeGCNRegBankReassignPass(*PR);
initializeGCNNSAReassignPass(*PR);		initializeGCNNSAReassignPass(*PR);
initializeSIAddIMGInitPass(*PR);		initializeSIAddIMGInitPass(*PR);
		initializeAMDGPUReplaceLDSUseWithPointerPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
return std::make_unique<AMDGPUTargetObjectFile>();		return std::make_unique<AMDGPUTargetObjectFile>();
}		}

static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {		static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {
return new ScheduleDAGMILive(C, std::make_unique<R600SchedStrategy>());		return new ScheduleDAGMILive(C, std::make_unique<R600SchedStrategy>());
▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	AMDGPUTargetMachine::AMDGPUTargetMachine(const Target &T, const Triple &TT,
if (TT.getOS() == Triple::AMDHSA &&		if (TT.getOS() == Triple::AMDHSA &&
EnableAMDGPUFixedFunctionABIOpt.getNumOccurrences() == 0)		EnableAMDGPUFixedFunctionABIOpt.getNumOccurrences() == 0)
EnableFixedFunctionABI = true;		EnableFixedFunctionABI = true;
}		}

bool AMDGPUTargetMachine::EnableLateStructurizeCFG = false;		bool AMDGPUTargetMachine::EnableLateStructurizeCFG = false;
bool AMDGPUTargetMachine::EnableFunctionCalls = false;		bool AMDGPUTargetMachine::EnableFunctionCalls = false;
bool AMDGPUTargetMachine::EnableFixedFunctionABI = false;		bool AMDGPUTargetMachine::EnableFixedFunctionABI = false;
		bool AMDGPUTargetMachine::EnableLowerModuleLDS = false;

AMDGPUTargetMachine::~AMDGPUTargetMachine() = default;		AMDGPUTargetMachine::~AMDGPUTargetMachine() = default;

StringRef AMDGPUTargetMachine::getGPUName(const Function &F) const {		StringRef AMDGPUTargetMachine::getGPUName(const Function &F) const {
Attribute GPUAttr = F.getFnAttribute("target-cpu");		Attribute GPUAttr = F.getFnAttribute("target-cpu");
return GPUAttr.isValid() ? GPUAttr.getValueAsString() : getTargetCPU();		return GPUAttr.isValid() ? GPUAttr.getValueAsString() : getTargetCPU();
}		}

▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	PB.registerPipelineParsingCallback(
PM.addPass(AMDGPUPrintfRuntimeBindingPass());		PM.addPass(AMDGPUPrintfRuntimeBindingPass());
return true;		return true;
}		}
if (PassName == "amdgpu-always-inline") {		if (PassName == "amdgpu-always-inline") {
PM.addPass(AMDGPUAlwaysInlinePass());		PM.addPass(AMDGPUAlwaysInlinePass());
return true;		return true;
}		}
if (PassName == "amdgpu-lower-module-lds") {		if (PassName == "amdgpu-lower-module-lds") {
		PM.addPass(AMDGPUReplaceLDSUseWithPointerPass());
PM.addPass(AMDGPULowerModuleLDSPass());		PM.addPass(AMDGPULowerModuleLDSPass());
return true;		return true;
}		}
return false;		return false;
});		});
PB.registerPipelineParsingCallback(		PB.registerPipelineParsingCallback(
[this](StringRef PassName, FunctionPassManager &PM,		[this](StringRef PassName, FunctionPassManager &PM,
ArrayRef<PassBuilder::PipelineElement>) {		ArrayRef<PassBuilder::PipelineElement>) {
▲ Show 20 Lines • Show All 344 Lines • ▼ Show 20 Lines	void AMDGPUPassConfig::addIRPasses() {
// bitcast calls.		// bitcast calls.
addPass(createAMDGPUFixFunctionBitcastsPass());		addPass(createAMDGPUFixFunctionBitcastsPass());

// A call to propagate attributes pass in the backend in case opt was not run.		// A call to propagate attributes pass in the backend in case opt was not run.
addPass(createAMDGPUPropagateAttributesEarlyPass(&TM));		addPass(createAMDGPUPropagateAttributesEarlyPass(&TM));

addPass(createAtomicExpandPass());		addPass(createAtomicExpandPass());


addPass(createAMDGPULowerIntrinsicsPass());		addPass(createAMDGPULowerIntrinsicsPass());

// Function calls are not supported, so make sure we inline everything.		// Function calls are not supported, so make sure we inline everything.
addPass(createAMDGPUAlwaysInlinePass());		addPass(createAMDGPUAlwaysInlinePass());
addPass(createAlwaysInlinerLegacyPass());		addPass(createAlwaysInlinerLegacyPass());
// We need to add the barrier noop pass, otherwise adding the function		// We need to add the barrier noop pass, otherwise adding the function
// inlining pass will cause all of the PassConfigs passes to be run		// inlining pass will cause all of the PassConfigs passes to be run
// one function at a time, which means if we have a nodule with two		// one function at a time, which means if we have a nodule with two
// functions, then we will generate code for the first function		// functions, then we will generate code for the first function
// without ever running any passes on the second.		// without ever running any passes on the second.
addPass(createBarrierNoopPass());		addPass(createBarrierNoopPass());

// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.		// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.
if (TM.getTargetTriple().getArch() == Triple::r600)		if (TM.getTargetTriple().getArch() == Triple::r600)
addPass(createR600OpenCLImageTypeLoweringPass());		addPass(createR600OpenCLImageTypeLoweringPass());

// Replace OpenCL enqueued block function pointers with global variables.		// Replace OpenCL enqueued block function pointers with global variables.
addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());		addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());

// Can increase LDS used by kernel so runs before PromoteAlloca		// Can increase LDS used by kernel so runs before PromoteAlloca
if (!DisableLowerModuleLDS)		if (EnableLowerModuleLDS) {
		addPass(createAMDGPUReplaceLDSUseWithPointerPass());
addPass(createAMDGPULowerModuleLDSPass());		addPass(createAMDGPULowerModuleLDSPass());
		}

if (TM.getOptLevel() > CodeGenOpt::None) {		if (TM.getOptLevel() > CodeGenOpt::None) {
addPass(createInferAddressSpacesPass());		addPass(createInferAddressSpacesPass());
addPass(createAMDGPUPromoteAlloca());		addPass(createAMDGPUPromoteAlloca());

if (EnableSROA)		if (EnableSROA)
addPass(createSROAPass());		addPass(createSROAPass());

▲ Show 20 Lines • Show All 490 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPUMIRFormatter.cpp		AMDGPUMIRFormatter.cpp
AMDGPUOpenCLEnqueuedBlockLowering.cpp		AMDGPUOpenCLEnqueuedBlockLowering.cpp
AMDGPUPostLegalizerCombiner.cpp		AMDGPUPostLegalizerCombiner.cpp
AMDGPUPreLegalizerCombiner.cpp		AMDGPUPreLegalizerCombiner.cpp
AMDGPUPromoteAlloca.cpp		AMDGPUPromoteAlloca.cpp
AMDGPUPropagateAttributes.cpp		AMDGPUPropagateAttributes.cpp
AMDGPURegBankCombiner.cpp		AMDGPURegBankCombiner.cpp
AMDGPURegisterBankInfo.cpp		AMDGPURegisterBankInfo.cpp
		AMDGPUReplaceLDSUseWithPointer.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
AMDGPUPerfHintAnalysis.cpp		AMDGPUPerfHintAnalysis.cpp
▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.h

This file was added.

				//===- AMDGPUGeneralUtils.h - general helper functions -- C++ -----------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// AMDGPU target related general helper utility functions.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPUGENERALUTILS_H
				#define LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPUGENERALUTILS_H

				#include "AMDGPU.h"
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Include list should be limited to those that are used by the header, with the ones used by the source included there JonChesterfield: Include list should be limited to those that are used by the header, with the ones used by the…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions agree. hsmhsm: agree.
				#include "Utils/AMDGPUBaseInfo.h"
				#include "Utils/AMDGPUGeneralUtils.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/IR/DerivedTypes.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/InlineAsm.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/Utils/ModuleUtils.h"

				namespace llvm {

				// Check if `GV` is an LDS global, and lowerering is required for it.
				bool isLDSLowereringRequired(GlobalVariable *GV,
				const SmallPtrSetImpl<GlobalValue *> &UsedList,
				bool IsLDSLoweringPass = false);

				// Get the required alignment for global variable `GV`.
				Align getAlign(const DataLayout &DL, const GlobalVariable *GV);

				// Get a list of all used global values in the module `M`.
				SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M);

				} // end namespace llvm

				#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPUGENERALUTILS_H

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.cpp

This file was added.

				//===- AMDGPUGeneralUtils.cpp ---------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// AMDGPU target related general helper utility functions.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPUGeneralUtils.h"

				using namespace llvm;

				namespace llvm {

				// Check if we can skip the lowering for current LDS global `GV`.
				static bool skipLowering(GlobalVariable *GV,
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Perhaps name the new files after LDS to make it clearer that they're used for LDS lowering an optimisation, not necessarily general purpose. Also move the functions out in a separate commit, without changes to their implementation, as that improves the signal/noise of the functional change. JonChesterfield: Perhaps name the new files after LDS to make it clearer that they're used for LDS lowering an…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions will think about it. hsmhsm: will think about it.
				const SmallPtrSetImpl<GlobalValue *> &UsedList,
				bool IsLDSLoweringPass) {
				bool UsedAsInitializer = false;
				bool UsedAsNonLLVMUsedInitializer = false;
				bool UsedWithinKernelFunction = false;
				bool UsedWithinNonKernelFunction = false;
				SmallPtrSet<User *, 8> VisitedUsers;

				// There are no users for `GV`, skip lowering for `GV`.
				SmallVector<User *, 8> UserStack(GV->users());
				if (UserStack.empty())
				return true;

				while (!UserStack.empty()) {
				auto *U = UserStack.pop_back_val();

				// `U` is already visited? continue to next one.
				if (!VisitedUsers.insert(U).second)
				continue;

				if (isa<GlobalVariable>(U)) {
				// `U` is a global variable, and `GV` is used as its initializer.
				UsedAsInitializer = true;
				if (!UsedList.contains(GV)) {
				// Used as initializer of normal globals apart from "llvm.used" or
				// "llvm.compiler.used".
				UsedAsNonLLVMUsedInitializer = true;
				}
				continue;
				}

				if (isa<Constant>(U)) {
				// `U` is `Constant`. Push-back users of `U`, and continue further
				// exploring the stack.
				append_range(UserStack, U->users());
				continue;
				}

				// `U` should be an instruction belonging to some function.
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions e.g. I recognise this as newly introduced by the comment, but in phab it's hard to distinguish from things that haven't changed JonChesterfield: e.g. I recognise this as newly introduced by the comment, but in phab it's hard to distinguish…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions not sure what you mean here. Let's discuss offline. hsmhsm: not sure what you mean here. Let's discuss offline.
				auto *I = dyn_cast<Instruction>(U);
				assert(I && "Instruction expected.");
				auto *F = I->getFunction();
				if (AMDGPU::isModuleEntryFunctionCC(F->getCallingConv())) {
				UsedWithinKernelFunction = true;
				} else {
				UsedWithinNonKernelFunction = true;
				}
				}

				if (UsedWithinNonKernelFunction) {
				// `GV` is used within non-kernel function, it requires lowering.
				return false;
				}

				if (UsedAsInitializer) {
				// `GV` is used as an initializer of some global variable.
				if (UsedAsNonLLVMUsedInitializer) {
				// `GV` is used as global variable initializer of normal globals.
				if (!IsLDSLoweringPass) {
				// This is "LDS replace with pointer" pass, and let this pass make sure
				// that pointer variable is created for `GV`, and that pointer variable
				// is initialized with `GV` within all kernels.
				return false;
				} else {
				// "LDS replace with pointer" pass makes sure that a pointer variable is
				// created for `GV`, and it is initialized with `GV` within all kernels,
				// and which means that per kernel specific `GV` will be created, and
				// hence "LDS lowering pass" no need to touch it.
				return true;
				}
				} else {
				// `GV` is used as global variable initializer of "llvm.used" or
				// "llvm.compiler.used". Ignore lowering.
				return true;
				}
				}

				if (UsedWithinKernelFunction) {
				// `GV` is only used within kernel, it does not require lowering.
				return true;
				}

				// Ideally control should not reach here. If it is, then, we need to take a
				// re-look at the above logic.
				assert(false && "Internal error.");
				return true;
				}

				// Check if `GV` is an LDS global, and lowerering is required for it.
				bool isLDSLowereringRequired(GlobalVariable *GV,
				const SmallPtrSetImpl<GlobalValue *> &UsedList,
				bool IsLDSLoweringPass) {
				if (GV->getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS) {
				// Ignore addrspace other than 3.
				return false;
				}

				if (!GV->hasInitializer()) {
				// addrspace(3) without initializer implies cuda/hip extern __shared__ the
				// semantics for such a variable appears to be that all extern // __shared__
				// variables alias one another, in which case this transform is not required
				return false;
				}

				if (!isa<UndefValue>(GV->getInitializer())) {
				// Initializers are unimplemented for local address space. Leave such
				// variables in place for consistent error reporting.
				return false;
				}

				if (GV->isConstant()) {
				// A constant undef variable can't be written to, and any load is undef, so
				// it should be eliminated by the optimizer. It could be dropped by the back
				// end if not. This pass skips over it.
				return false;
				}

				if (skipLowering(GV, UsedList, IsLDSLoweringPass)) {
				// We can safely ignore the users of GV, hence lowering of GV is not
				// required.
				return false;
				}

				return true;
				}

				// Get the required alignment for global variable `GV`.
				Align getAlign(const DataLayout &DL, const GlobalVariable *GV) {
				return DL.getValueOrABITypeAlignment(GV->getPointerAlignment(DL),
				GV->getValueType());
				}

				// Get a list of all used global values in the module `M`.
				SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {
				SmallPtrSet<GlobalValue *, 32> UsedList;

				SmallVector<GlobalValue *, 32> TmpVec;
				collectUsedGlobalVariables(M, TmpVec, true);
				UsedList.insert(TmpVec.begin(), TmpVec.end());

				TmpVec.clear();
				collectUsedGlobalVariables(M, TmpVec, false);
				UsedList.insert(TmpVec.begin(), TmpVec.end());

				return UsedList;
				}

				} // end namespace llvm

llvm/lib/Target/AMDGPU/Utils/CMakeLists.txt

	add_llvm_component_library(LLVMAMDGPUUtils			add_llvm_component_library(LLVMAMDGPUUtils
	AMDGPUBaseInfo.cpp			AMDGPUBaseInfo.cpp
	AMDKernelCodeTUtils.cpp			AMDKernelCodeTUtils.cpp
	AMDGPUAsmUtils.cpp			AMDGPUAsmUtils.cpp
				AMDGPUGeneralUtils.cpp
	AMDGPUPALMetadata.cpp			AMDGPUPALMetadata.cpp

	LINK_COMPONENTS			LINK_COMPONENTS
	Core			Core
	MC			MC
	BinaryFormat			BinaryFormat
	Support			Support

	ADD_TO_COMPONENT			ADD_TO_COMPONENT
	AMDGPU			AMDGPU
	)			)

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-non-entry-func.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -amdgpu-disable-lower-module-lds=true -o - %s 2> %t \| FileCheck --check-prefix=GFX8 %s			; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -amdgpu-enable-lower-module-lds=false -o - %s 2> %t \| FileCheck --check-prefix=GFX8 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-disable-lower-module-lds=true -o - %s 2> %t \| FileCheck --check-prefix=GFX9 %s			; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-enable-lower-module-lds=false -o - %s 2> %t \| FileCheck --check-prefix=GFX9 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	@lds = internal addrspace(3) global float undef, align 4			@lds = internal addrspace(3) global float undef, align 4

	; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function			; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function
	define void @func_use_lds_global() {			define void @func_use_lds_global() {
	; GFX8-LABEL: func_use_lds_global:			; GFX8-LABEL: func_use_lds_global:
	; GFX8: ; %bb.0:			; GFX8: ; %bb.0:
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/addrspacecast-initializer-unsupported.ll

	; RUN: not --crash llc -march=amdgcn -verify-machineinstrs -amdgpu-disable-lower-module-lds=true < %s 2>&1 \| FileCheck -check-prefix=ERROR %s			; RUN: not --crash llc -march=amdgcn -verify-machineinstrs -amdgpu-enable-lower-module-lds=false < %s 2>&1 \| FileCheck -check-prefix=ERROR %s

	; ERROR: LLVM ERROR: Unsupported expression in static initializer: addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*)			; ERROR: LLVM ERROR: Unsupported expression in static initializer: addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*)

	@lds.arr = unnamed_addr addrspace(3) global [256 x i32] undef, align 4			@lds.arr = unnamed_addr addrspace(3) global [256 x i32] undef, align 4

	@gv_flatptr_from_lds = unnamed_addr addrspace(2) global i32 addrspace(4)* getelementptr ([256 x i32], [256 x i32] addrspace(4)* addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*), i64 0, i64 8), align 4			@gv_flatptr_from_lds = unnamed_addr addrspace(2) global i32 addrspace(4)* getelementptr ([256 x i32], [256 x i32] addrspace(4)* addrspacecast ([256 x i32] addrspace(3)* @lds.arr to [256 x i32] addrspace(4)*), i64 0, i64 8), align 4

llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address-codegen.ll

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -amdgpu-function-calls -amdgpu-stress-function-calls < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -amdgpu-function-calls -amdgpu-stress-function-calls -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -amdgpu-stress-function-calls < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -amdgpu-stress-function-calls -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=GCN %s

	@lds0 = addrspace(3) global i32 undef, align 4			@lds0 = addrspace(3) global i32 undef, align 4

	; GCN-NOT: load_lds_simple			; GCN-NOT: load_lds_simple

	define internal i32 @load_lds_simple() {			define internal i32 @load_lds_simple() {
	%load = load i32, i32 addrspace(3)* @lds0, align 4			%load = load i32, i32 addrspace(3)* @lds0, align 4
	ret i32 %load			ret i32 %load
	Show All 10 Lines

llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address.ll

	; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-always-inline %s \| FileCheck --check-prefix=ALL %s			; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-always-inline -amdgpu-enable-lower-module-lds=false %s \| FileCheck --check-prefix=ALL %s
	; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-always-inline %s \| FileCheck --check-prefix=ALL %s			; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-always-inline -amdgpu-enable-lower-module-lds=false %s \| FileCheck --check-prefix=ALL %s
	; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-stress-function-calls -amdgpu-always-inline %s \| FileCheck --check-prefix=ALL %s			; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-stress-function-calls -amdgpu-always-inline -amdgpu-enable-lower-module-lds=false %s \| FileCheck --check-prefix=ALL %s
	; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-stress-function-calls -passes=amdgpu-always-inline %s \| FileCheck --check-prefix=ALL %s			; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-stress-function-calls -passes=amdgpu-always-inline -amdgpu-enable-lower-module-lds=false %s \| FileCheck --check-prefix=ALL %s

	target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5"			target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5"

	@lds0 = addrspace(3) global i32 undef, align 4			@lds0 = addrspace(3) global i32 undef, align 4
	@lds1 = addrspace(3) global [512 x i32] undef, align 4			@lds1 = addrspace(3) global [512 x i32] undef, align 4
	@nested.lds.address = addrspace(1) global i32 addrspace(3)* @lds0, align 4			@nested.lds.address = addrspace(1) global i32 addrspace(3)* @lds0, align 4
	@gds0 = addrspace(2) global i32 undef, align 4			@gds0 = addrspace(2) global i32 undef, align 4

	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -o - -amdgpu-disable-lower-module-lds=true %s 2> %t \| FileCheck -check-prefixes=GCN,GFX8 %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -o - -amdgpu-enable-lower-module-lds=false %s 2> %t \| FileCheck -check-prefixes=GCN,GFX8 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - -amdgpu-disable-lower-module-lds=true %s 2> %t \| FileCheck -check-prefixes=GCN,GFX9 %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - -amdgpu-enable-lower-module-lds=false %s 2> %t \| FileCheck -check-prefixes=GCN,GFX9 %s
	; RUN: FileCheck -check-prefix=ERR %s < %t			; RUN: FileCheck -check-prefix=ERR %s < %t

	@lds = internal addrspace(3) global float undef, align 4			@lds = internal addrspace(3) global float undef, align 4

	; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function			; ERR: warning: <unknown>:0:0: in function func_use_lds_global void (): local memory global used by non-kernel function
	define void @func_use_lds_global() {			define void @func_use_lds_global() {
	; GFX8-LABEL: func_use_lds_global:			; GFX8-LABEL: func_use_lds_global:
	; GFX8: ; %bb.0:			; GFX8: ; %bb.0:
	Show All 33 Lines

llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s			; The pass - `amdgpu-lower-module-lds` should be run with its prerequisite pass `amdgpu-replace-lds-use-with-pointer`
	; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; The LDS globals @func and @both are used within non-kernel functions, but they are not called from kernel. Hence the
				; pass `amdgpu-replace-lds-use-with-pointer` has become help-less here, thus, the pass `amdgpu-lower-module-lds` results
				; in creating `float` members. But, this is only true in this test case, in reality those non-called functions, and
				; hence LDS globals referenced within them would have eliminated as not used globals.
	; CHECK: %llvm.amdgcn.module.lds.t = type { float, float }			; CHECK: %llvm.amdgcn.module.lds.t = type { float, float }

	@func = addrspace(3) global float undef, align 4			@func = addrspace(3) global float undef, align 4

	; @kern is only used from a kernel so it is left unchanged			; @kern is only used from a kernel so it is left unchanged
	; CHECK: @kern = addrspace(3) global float undef, align 4			; CHECK: @kern = addrspace(3) global float undef, align 4
	@kern = addrspace(3) global float undef, align 4			@kern = addrspace(3) global float undef, align 4

	Show All 36 Lines

llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s			; The pass - `amdgpu-lower-module-lds` should be run with its prerequisite pass `amdgpu-replace-lds-use-with-pointer`
	; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

	; Variables that are not lowered by this pass are left unchanged			; Variables that are not lowered by this pass are left unchanged
	; CHECK-NOT: asm			; CHECK-NOT: asm
	; CHECK-NOT: llvm.amdgcn.module.lds			; CHECK-NOT: llvm.amdgcn.module.lds
	; CHECK-NOT: llvm.amdgcn.module.lds.t			; CHECK-NOT: llvm.amdgcn.module.lds.t

	; var1, var2 would be transformed were they used from a non-kernel function			; var1, var2 would be transformed were they used from a non-kernel function
	▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lower-module-lds-indirect.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s			; The pass - `amdgpu-lower-module-lds` should be run with its prerequisite pass `amdgpu-replace-lds-use-with-pointer`
	; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

	; CHECK: %llvm.amdgcn.module.lds.t = type { double, float }			; The original LDS globals - `@function_target` and `@kernel_target` are used as initializers of globals -
				; `@function_indirect` and `@kernel_indirect`, and they are not referenced directly anywhere else. The pass -
	; CHECK: @function_indirect = addrspace(1) global float* addrspacecast (float addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 1) to float*), align 8			; `amdgpu-replace-lds-use-with-pointer` makes sure that they are referenced within all kernels by assigning
				; their addresses to respective pointers within all kernels, and hence global initialization of `@function_indirect`
	; CHECK: @kernel_indirect = addrspace(1) global double* addrspacecast (double addrspace(3)* null to double*), align 8			; and `@kernel_indirect` are taken care. Hence, the pass - `amdgpu-lower-module-lds` does not do any further
				; lowering here.
	; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 8			; CHECK-NOT: %llvm.amdgcn.module.lds.t

				; Original globals left unchanged.
				; CHECK: @function_target = addrspace(3) global float undef, align 4
				; CHECK: @function_indirect = addrspace(1) global float* addrspacecast (float addrspace(3)* @function_target to float*), align 8
				; CHECK: @kernel_target = addrspace(3) global double undef, align 8
				; CHECK: @kernel_indirect = addrspace(1) global double* addrspacecast (double addrspace(3)* @kernel_target to double*), align 8
	@function_target = addrspace(3) global float undef, align 4			@function_target = addrspace(3) global float undef, align 4
	@function_indirect = addrspace(1) global float* addrspacecast (float addrspace(3)* @function_target to float*), align 8			@function_indirect = addrspace(1) global float* addrspacecast (float addrspace(3)* @function_target to float*), align 8

	@kernel_target = addrspace(3) global double undef, align 8			@kernel_target = addrspace(3) global double undef, align 8
	@kernel_indirect = addrspace(1) global double* addrspacecast (double addrspace(3)* @kernel_target to double*), align 8			@kernel_indirect = addrspace(1) global double* addrspacecast (double addrspace(3)* @kernel_target to double*), align 8

	; CHECK-LABEL: @function(float %x)			; New pointers introduced by the pass - `amdgpu-replace-lds-use-with-pointer`
				; CHECK: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; No changes to function - @function
				; CHECK-LABEL: entry:
	; CHECK: %0 = load float, float addrspace(1)* @function_indirect, align 8			; CHECK: %0 = load float, float addrspace(1)* @function_indirect, align 8
				; CHECK: store float %x, float* %0, align 4
				; CHECK: ret void
	define void @function(float %x) local_unnamed_addr #5 {			define void @function(float %x) local_unnamed_addr #5 {
	entry:			entry:
	%0 = load float, float addrspace(1)* @function_indirect, align 8			%0 = load float, float addrspace(1)* @function_indirect, align 8
	store float %x, float* %0, align 4			store float %x, float* %0, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: @kernel(double %x)			; The LDS globals @function_target and @kernel_target are referenced within kernel by initializing them
	; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]			; to respective pointers.
	; CHECK: %0 = load double, double addrspace(1)* @kernel_indirect, align 8			; CHECK-LABEL: entry:
				; CHECK: %{{[0-9]+}} = ptrtoint {{[a-z]+}} addrspace(3)* @{{[a-z]+}}_target to i16
				; CHECK: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[12]}}, align 2
				; CHECK: %{{[0-9]+}} = ptrtoint {{[a-z]+}} addrspace(3)* @{{[a-z]+}}_target to i16
				; CHECK: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[12]}}, align 2
				; CHECK: %{{[0-9]+}} = load double, double addrspace(1)* @kernel_indirect, align 8
				; CHECK: store double %x, double* %{{[0-9]+}}, align 8
				; CHECK: ret void
	define amdgpu_kernel void @kernel(double %x) local_unnamed_addr #5 {			define amdgpu_kernel void @kernel(double %x) local_unnamed_addr #5 {
	entry:			entry:
	%0 = load double, double addrspace(1)* @kernel_indirect, align 8			%0 = load double, double addrspace(1)* @kernel_indirect, align 8
	store double %x, double* %0, align 8			store double %x, double* %0, align 8
	ret void			ret void
	}			}

llvm/test/CodeGen/AMDGPU/lower-module-lds-inline-asm-call.ll

This file was added.

				; The pass - `amdgpu-lower-module-lds` should be run with its prerequisite pass `amdgpu-replace-lds-use-with-pointer`
				; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; The pass - `amdgpu-replace-lds-use-with-pointer` cannot handle inline asm calls, and hence the
				; pass - `amdgpu-lower-module-lds` need to handle LDS global @func.
				; CHECK: %llvm.amdgcn.module.lds.t = type { i32 }

				; @func is only used from a non-kernel function so is rewritten
				; CHECK-NOT: @func
				@func = addrspace(3) global i32 undef, align 4

				; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 4
				; CHECK: @llvm.compiler.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)) to i8)], section "llvm.metadata"

				; CHECK-LABEL: @function()
				; CHECK: %0 = load i32, i32 addrspace(3)* null, align 4
				; CHECK: ret i32 %0
				define i32 @function() local_unnamed_addr {
				entry:
				%0 = load i32, i32 addrspace(3)* @func, align 4
				ret i32 %0
				}

				; CHECK-LABEL: @kernel()
				; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				; CHECK: call void asm sideeffect "", "~{v23}"()
				; CHECK: ret void
				define amdgpu_kernel void @kernel() {
				call void asm sideeffect "", "~{v23}"()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s			; The pass - `amdgpu-lower-module-lds` should be run with its prerequisite pass `amdgpu-replace-lds-use-with-pointer`
	; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

	; Check new struct is added to compiler.used and that the replaced variable is removed			; The LDS global @tolower is used within non-kernel function, but it is not called from kernel. Hence the
				; pass `amdgpu-replace-lds-use-with-pointer` has become help-less here, thus, the pass `amdgpu-lower-module-lds`
				; results in creating `float` member. But, this is only true in this test case, in reality this non-called
				; function, and hence the LDS global referenced within it would have eliminated as not used globals.
	; CHECK: %llvm.amdgcn.module.lds.t = type { float }			; CHECK: %llvm.amdgcn.module.lds.t = type { float }
	; CHECK: @ignored = addrspace(1) global i64 0			; CHECK: @ignored = addrspace(1) global i64 0
	; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 8			; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 8

	; CHECK-NOT: @tolower			; CHECK-NOT: @tolower

	@tolower = addrspace(3) global float undef, align 8			@tolower = addrspace(3) global float undef, align 8

	Show All 24 Lines

llvm/test/CodeGen/AMDGPU/lower-module-lds.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s			; The pass - `amdgpu-lower-module-lds` should be run with its prerequisite pass `amdgpu-replace-lds-use-with-pointer`
	; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

	; Padding to meet alignment, so references to @var1 replaced with gep ptr, 0, 2			; The LDS glboals @var0 and @var1 are replaced by pointers, and hence two i16 members.
	; No i64 as addrspace(3) types with initializers are ignored. Likewise no addrspace(4).			; CHECK: %llvm.amdgcn.module.lds.t = type { i16, i16 }
	; CHECK: %llvm.amdgcn.module.lds.t = type { float, [4 x i8], i32 }
				; Orignal LDS globals
	; Variables removed by pass			; CHECK: @var0 = addrspace(3) global float undef, align 8
	; CHECK-NOT: @var0			; CHECK: @var1 = addrspace(3) global i32 undef, align 8
	; CHECK-NOT: @var1

	@var0 = addrspace(3) global float undef, align 8			@var0 = addrspace(3) global float undef, align 8
	@var1 = addrspace(3) global i32 undef, align 8			@var1 = addrspace(3) global i32 undef, align 8

				; Initializer @var1 is left untouched
				; CHECK: @ptr = addrspace(1) global i32 addrspace(3)* @var1, align 4
	@ptr = addrspace(1) global i32 addrspace(3)* @var1, align 4			@ptr = addrspace(1) global i32 addrspace(3)* @var1, align 4

	; A variable that is unchanged by pass			; A variable that is untouched by pass because of wrong initialization to LDS global
	; CHECK: @with_init = addrspace(3) global i64 0			; CHECK: @with_init = addrspace(3) global i64 0
	@with_init = addrspace(3) global i64 0			@with_init = addrspace(3) global i64 0

				; The two i16 pointers which are introduced by the pass `amdgpu-replace-lds-use-with-pointer` should be removed by the pass `amdgpu-lower-module-lds`.
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This test should only check the behaviour of lower-module-lds. Separate tests check the behaviour of amdgpu-replace-lds-use-with-pointer. Equally, running amdgpu-lower-module-lds by itself should not automatically run amdgpu-replace-lds-use-with-pointer and vice versa. JonChesterfield: This test should only check the behaviour of lower-module-lds. Separate tests check the…
				; CHECK-NOT: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK-NOT: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2

	; Instance of new type, aligned to max of element alignment			; Instance of new type, aligned to max of element alignment
	; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 8			; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 2

	; Use in func rewritten to access struct at address zero, which prints as null			; Use in func rewritten to access struct at address zero, which prints as null
	; CHECK-LABEL: @func()			; CHECK-LABEL: @func()
	; CHECK: %dec = atomicrmw fsub float addrspace(3)* null, float 1.0			; CHECK: %1 = load i16, i16 addrspace(3)*
	; CHECK: %val0 = load i32, i32 addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 2), align 4			; CHECK: %2 = inttoptr i16 %1 to {{[a-z0-9]+}} addrspace(3)*
				; CHECK: %3 = load i16, i16 addrspace(3)*
				; CHECK: %4 = inttoptr i16 %3 to {{[a-z0-9]+}} addrspace(3)*
				; CHECK: %dec = atomicrmw fsub float addrspace(3)* %{{[0-9]+}}, float 1.000000e+00 monotonic, align 4
				; CHECK: %val0 = load i32, i32 addrspace(3)* %{{[0-9]+}}, align 4
	; CHECK: %val1 = add i32 %val0, 4			; CHECK: %val1 = add i32 %val0, 4
	; CHECK: store i32 %val1, i32 addrspace(3)* getelementptr (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* null, i32 0, i32 2), align 4			; CHECK: store i32 %val1, i32 addrspace(3)* %{{[0-9]+}}, align 4
	; CHECK: %unused0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 monotonic			; CHECK: %unused0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 monotonic, align 8
				; CHECK: ret void
	define void @func() {			define void @func() {
	%dec = atomicrmw fsub float addrspace(3)* @var0, float 1.0 monotonic			%dec = atomicrmw fsub float addrspace(3)* @var0, float 1.0 monotonic
	%val0 = load i32, i32 addrspace(3)* @var1, align 4			%val0 = load i32, i32 addrspace(3)* @var1, align 4
	%val1 = add i32 %val0, 4			%val1 = add i32 %val0, 4
	store i32 %val1, i32 addrspace(3)* @var1, align 4			store i32 %val1, i32 addrspace(3)* @var1, align 4
	%unused0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 monotonic			%unused0 = atomicrmw add i64 addrspace(3)* @with_init, i64 1 monotonic
	ret void			ret void
	}			}

	; This kernel calls a function that uses LDS so needs the block			; This kernel calls a function that uses LDS so needs the block
	; CHECK-LABEL: @kern_call()			; CHECK-LABEL: @kern_call()
	; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]			; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				; CHECK: %1 = ptrtoint {{[a-z0-9]+}} addrspace(3)* @var{{[01]}} to i16
				; CHECK: store i16 %1, i16 addrspace(3)*
				; CHECK: %2 = ptrtoint {{[a-z0-9]+}} addrspace(3)* @var{{[01]}} to i16
				; CHECK: store i16 %2, i16 addrspace(3)*
	; CHECK: call void @func()			; CHECK: call void @func()
	; CHECK: %dec = atomicrmw fsub float addrspace(3)* null, float 2.0			; CHECK: %dec = atomicrmw fsub float addrspace(3)* @var0, float 2.000000e+00 monotonic, align 4
				; CHECK: ret void
	define amdgpu_kernel void @kern_call() {			define amdgpu_kernel void @kern_call() {
	call void @func()			call void @func()
	%dec = atomicrmw fsub float addrspace(3)* @var0, float 2.0 monotonic			%dec = atomicrmw fsub float addrspace(3)* @var0, float 2.0 monotonic
	ret void			ret void
	}			}

	; This kernel does not need to alloc the LDS block as it makes no calls			; Though the kernel does not make call, because @var1 is used as initializer, it still need to alloc the LDS block.
	; CHECK-LABEL: @kern_empty()			; CHECK-LABEL: @kern_empty()
	; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]			; CHECK: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				; CHECK: %1 = ptrtoint i32 addrspace(3)* @var1 to i16
				; CHECK: store i16 %1, i16 addrspace(3)*
				; CHECK: ret void
	define spir_kernel void @kern_empty() {			define spir_kernel void @kern_empty() {
	ret void			ret void
	}			}

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-constantexpr-use.ll

	; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s			; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s
	; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-disable-lower-module-lds=true < %s \| FileCheck -check-prefix=ASM %s			; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=ASM %s

	target datalayout = "A5"			target datalayout = "A5"

	@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4			@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4
	@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4			@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4

	@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4			@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4
	@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4			@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4
	▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace_lds_report_error_no_func_def.ll

This file was added.

				; RUN: not --crash opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s 2>&1 \| FileCheck -check-prefix=ERROR %s

				; ERROR: LLVM ERROR: The pass "Replace LDS Use With Pointer" assumes that the definitions of both caller and callee appear within same module. But, definition for the callee "callee_1" not available.
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This is an error in the implementation, not something that should have a test checking the implementation is broken. Instead of assuming the definition of both are in the same module and crashing if they aren't, the pass should ignore a variable which doesn't meet that requirement. JonChesterfield: This is an error in the implementation, not something that should have a test checking the…

				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 16

				declare hidden void @callee_1() local_unnamed_addr

				define internal void @callee_2() {
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				entry:
				call void @callee_1()
				call void @callee_2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_direct_call_diamond_shape.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void @function_4() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: call void @function_4()
				; GCN-NEXT: ret void
				entry:
				call void @function_4()
				ret void
				}

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: call void @function_4()
				; GCN-NEXT: ret void
				entry:
				call void @function_4()
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: call void @function_2()
				; GCN: call void @function_3()
				; GCN-NEXT: ret void
				entry:
				call void @function_2()
				call void @function_3()
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_1 to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: call void @function_1()
				; GCN-NEXT: ret void
				entry:
				call void @function_1()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_direct_call_misc.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.3 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions These tests would be more robust if the new pointer was named based on the global it is intended to reference, as then the regex can check that we created load from the correct pointer (as opposed to just one of the new pointers). JonChesterfield: These tests would be more robust if the new pointer was named based on the global it is…
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				ret void
				}

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_3() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[13]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[13]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: call void @function_3()
				; GCN-NEXT: call void @function_1()
				; GCN-NEXT: ret void
				entry:
				call void @function_3()
				call void @function_1()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[23]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[23]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: call void @function_2()
				; GCN-NEXT: call void @function_3()
				; GCN-NEXT: ret void
				entry:
				call void @function_2()
				call void @function_3()
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[12]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[12]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: call void @function_1()
				; GCN-NEXT: call void @function_2()
				; GCN-NEXT: ret void
				entry:
				call void @function_1()
				call void @function_2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_ignored_lds.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; Ignore LDS lds_global_1 because it is dynamic lds.
				; Ignore LDS lds_global_2 because it is used only within kernel.
				; Ignore LDS lds_global_3 because it is used within nowhere called non-kernel function.
				; Ignore LDS lds_global_4 because it is used within non-kernel function but is not reachable due to inline asm call.

				; LDS: @lds_global_1 = external addrspace(3) global [0 x i32], align 4
				; LDS: @lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_4 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER-NOT: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER-NOT: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER-NOT: @llvm.amdgcn.lds.pointer.3 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER-NOT: @llvm.amdgcn.lds.pointer.4 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@lds_global_1 = external addrspace(3) global [0 x i32], align 4
				@lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_4 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %gep = getelementptr inbounds [0 x i32], [0 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [0 x i32], [0 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: call void @function_1()
				; GCN-NEXT: ret void
				entry:
				call void @function_1()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				ret void
				}

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_3() {
				; GCN-LABEL: entry:
				; GCN: ret void
				entry:
				ret void
				}

				define internal void @function_4() {
				; GCN-LABEL: entry:
				; GCN: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_4, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_4, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_4() {
				; GCN-LABEL: entry:
				; GCN: call void asm sideeffect "", "~{v23}"()
				; GCN-NEXT: ret void
				entry:
				call void asm sideeffect "", "~{v23}"()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_diamond_shape.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=FPTR,LDS,POINTER,GCN %s

				; FPTR: @gv.fptr4 = internal local_unnamed_addr externally_initialized global void ()* @function_4, align 8
				; FPTR: @gv.fptr3 = internal local_unnamed_addr externally_initialized global void ()* @function_3, align 8
				; FPTR: @gv.fptr2 = internal local_unnamed_addr externally_initialized global void ()* @function_2, align 8
				; FPTR: @gv.fptr1 = internal local_unnamed_addr externally_initialized global void ()* @function_1, align 8
				; LDS: @lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@gv.fptr4 = internal local_unnamed_addr externally_initialized global void ()* @function_4, align 8
				@gv.fptr3 = internal local_unnamed_addr externally_initialized global void ()* @function_3, align 8
				@gv.fptr2 = internal local_unnamed_addr externally_initialized global void ()* @function_2, align 8
				@gv.fptr1 = internal local_unnamed_addr externally_initialized global void ()* @function_1, align 8
				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void @function_4() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: %fptr4 = load void (), void ()* @gv.fptr4, align 8
				; GCN-NEXT: call void %fptr4()
				; GCN-NEXT: ret void
				entry:
				%fptr4 = load void (), void ()* @gv.fptr4, align 8
				call void %fptr4()
				ret void
				}

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %fptr4 = load void (), void ()* @gv.fptr4, align 8
				; GCN-NEXT: call void %fptr4()
				; GCN-NEXT: ret void
				entry:
				%fptr4 = load void (), void ()* @gv.fptr4, align 8
				call void %fptr4()
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %fptr2 = load void (), void ()* @gv.fptr2, align 8
				; GCN-NEXT: %fptr3 = load void (), void ()* @gv.fptr3, align 8
				; GCN-NEXT: call void %fptr2()
				; GCN-NEXT: call void %fptr3()
				; GCN-NEXT: ret void
				entry:
				%fptr2 = load void (), void ()* @gv.fptr2, align 8
				%fptr3 = load void (), void ()* @gv.fptr3, align 8
				call void %fptr2()
				call void %fptr3()
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_1 to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: %fptr1 = load void (), void ()* @gv.fptr1, align 8
				; GCN-NEXT: call void %fptr1()
				; GCN-NEXT: ret void
				entry:
				%fptr1 = load void (), void ()* @gv.fptr1, align 8
				call void %fptr1()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_misc.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; FPTR: @gv.fptr3 = internal local_unnamed_addr externally_initialized global void ()* @function_3, align 8
				; FPTR: @gv.fptr2 = internal local_unnamed_addr externally_initialized global void ()* @function_2, align 8
				; FPTR: @gv.fptr1 = internal local_unnamed_addr externally_initialized global void ()* @function_1, align 8
				; LDS: @lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.3 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@gv.fptr3 = internal local_unnamed_addr externally_initialized global void ()* @function_3, align 8
				@gv.fptr2 = internal local_unnamed_addr externally_initialized global void ()* @function_2, align 8
				@gv.fptr1 = internal local_unnamed_addr externally_initialized global void ()* @function_1, align 8
				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				ret void
				}

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_3() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %2 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %2, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %fptr3 = load void (), void ()* @gv.fptr3, align 8
				; GCN-NEXT: %fptr1 = load void (), void ()* @gv.fptr1, align 8
				; GCN-NEXT: call void %fptr3()
				; GCN-NEXT: call void %fptr1()
				; GCN-NEXT: ret void
				entry:
				%fptr3 = load void (), void ()* @gv.fptr3, align 8
				%fptr1 = load void (), void ()* @gv.fptr1, align 8
				call void %fptr3()
				call void %fptr1()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %2 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %2, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %fptr2 = load void (), void ()* @gv.fptr2, align 8
				; GCN-NEXT: %fptr3 = load void (), void ()* @gv.fptr3, align 8
				; GCN-NEXT: call void %fptr2()
				; GCN-NEXT: call void %fptr3()
				; GCN-NEXT: ret void
				entry:
				%fptr2 = load void (), void ()* @gv.fptr2, align 8
				%fptr3 = load void (), void ()* @gv.fptr3, align 8
				call void %fptr2()
				call void %fptr3()
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %2 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %2, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %fptr1 = load void (), void ()* @gv.fptr1, align 8
				; GCN-NEXT: %fptr2 = load void (), void ()* @gv.fptr2, align 8
				; GCN-NEXT: call void %fptr1()
				; GCN-NEXT: call void %fptr2()
				; GCN-NEXT: ret void
				entry:
				%fptr1 = load void (), void ()* @gv.fptr1, align 8
				%fptr2 = load void (), void ()* @gv.fptr2, align 8
				call void %fptr1()
				call void %fptr2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_misc2.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.3 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void ()* @return_function_3() {
				; GCN-LABEL: entry:
				; GCN: ret void ()* @function_3
				entry:
				ret void ()* @function_3
				}

				define internal void ()* @return_function_2() {
				; GCN-LABEL: entry:
				; GCN: ret void ()* @function_2
				entry:
				ret void ()* @function_2
				}

				define internal void ()* @return_function_1() {
				; GCN-LABEL: entry:
				; GCN: ret void ()* @function_1
				entry:
				ret void ()* @function_1
				}

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				ret void
				}

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_3() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %2 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %2, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %fptr3 = call void ()* @return_function_3()
				; GCN-NEXT: %fptr1 = call void ()* @return_function_1()
				; GCN-NEXT: call void %fptr3()
				; GCN-NEXT: call void %fptr1()
				; GCN-NEXT: ret void
				entry:
				%fptr3 = call void ()* @return_function_3()
				%fptr1 = call void ()* @return_function_1()
				call void %fptr3()
				call void %fptr1()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %2 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %2, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %fptr2 = call void ()* @return_function_2()
				; GCN-NEXT: %fptr3 = call void ()* @return_function_3()
				; GCN-NEXT: call void %fptr2()
				; GCN-NEXT: call void %fptr3()
				; GCN-NEXT: ret void
				entry:
				%fptr2 = call void ()* @return_function_2()
				%fptr3 = call void ()* @return_function_3()
				call void %fptr2()
				call void %fptr3()
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %2 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %2, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %fptr1 = call void ()* @return_function_1()
				; GCN-NEXT: %fptr2 = call void ()* @return_function_2()
				; GCN-NEXT: call void %fptr1()
				; GCN-NEXT: call void %fptr2()
				; GCN-NEXT: ret void
				entry:
				%fptr1 = call void ()* @return_function_1()
				%fptr2 = call void ()* @return_function_2()
				call void %fptr1()
				call void %fptr2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_no_addr_taken.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER-NOT: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER-NOT: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER-NOT: @llvm.amdgcn.lds.pointer.3 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				ret void
				}

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_3() {
				; GCN-LABEL: entry:
				; GCN: %alloca = alloca void ()*
				; GCN-NEXT: %fptr = load void (), void ()* %alloca
				; GCN-NEXT: call void %fptr()
				; GCN-NEXT: ret void
				entry:
				%alloca = alloca void ()*
				%fptr = load void (), void ()* %alloca
				call void %fptr()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: %alloca = alloca void ()*
				; GCN-NEXT: %fptr = load void (), void ()* %alloca
				; GCN-NEXT: call void %fptr()
				; GCN-NEXT: ret void
				entry:
				%alloca = alloca void ()*
				%fptr = load void (), void ()* %alloca
				call void %fptr()
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %alloca = alloca void ()*
				; GCN-NEXT: %fptr = load void (), void ()* %alloca
				; GCN-NEXT: call void %fptr()
				; GCN-NEXT: ret void
				entry:
				%alloca = alloca void ()*
				%fptr = load void (), void ()* %alloca
				call void %fptr()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_no_init.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; FPTR: @gv.fptr3 = internal local_unnamed_addr externally_initialized global void ()* @function_3, align 8
				; FPTR: @gv.fptr2 = internal local_unnamed_addr externally_initialized global void ()* @function_2, align 8
				; FPTR: @gv.fptr1 = internal local_unnamed_addr externally_initialized global void ()* @function_1, align 8
				; LDS: @lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				; LDS: @lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.3 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@gv.fptr3 = internal local_unnamed_addr externally_initialized global void ()* @function_3, align 8
				@gv.fptr2 = internal local_unnamed_addr externally_initialized global void ()* @function_2, align 8
				@gv.fptr1 = internal local_unnamed_addr externally_initialized global void ()* @function_1, align 8
				@lds_global_1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_2 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_global_3 = internal addrspace(3) global [1 x i32] undef, align 4

				define internal void @function_3() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_3, i32 0, i32 0
				ret void
				}

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_2, i32 0, i32 0
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [1 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %1, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %2 = ptrtoint [1 x i32] addrspace(3)* @lds_global_{{[1-3]}} to i16
				; GCN-NEXT: store i16 %2, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %alloca = alloca void ()*
				; GCN-NEXT: %fptr = load void (), void ()* %alloca
				; GCN-NEXT: call void %fptr()
				; GCN-NEXT: ret void
				entry:
				%alloca = alloca void ()*
				%fptr = load void (), void ()* %alloca
				call void %fptr()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_llvm_insts.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @smem_ptr = hidden addrspace(3) global i32* undef, align 8
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@smem_ptr = hidden addrspace(3) global i32* undef, align 8

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to i32* addrspace(3)*
				; GCN-NEXT: %2 = addrspacecast i32* addrspace(3)* %1 to i32**
				; GCN-NEXT: %ptr = load i32, i32* %2, align 8
				; GCN-NEXT: %res1 = atomicrmw add i32* %ptr, i32 8 acquire, align 4
				; GCN-NEXT: %res2 = cmpxchg i32* %ptr, i32 8, i32 16 acq_rel monotonic, align 4
				; GCN-NEXT: ret void
				entry:
				%ptr = load i32, i32* addrspacecast (i32* addrspace(3)* @smem_ptr to i32**), align 8
				%res1 = atomicrmw add i32* %ptr, i32 8 acquire
				%res2 = cmpxchg i32* %ptr, i32 8, i32 16 acq_rel monotonic
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint i32* addrspace(3)* @smem_ptr to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: call void @function_1()
				; GCN-NEXT: ret void
				entry:
				call void @function_1()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_types_misc.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @lds_global_1 = internal addrspace(3) global [65 x i32] undef, align 16
				; LDS: @lds_global_2 = internal addrspace(3) global [65 x i16] undef, align 16
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@lds_global_1 = internal addrspace(3) global [65 x i32] undef, align 16
				@lds_global_2 = internal addrspace(3) global [65 x i16] undef, align 16

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [65 x i16] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [65 x i16], [65 x i16] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [65 x i16], [65 x i16] addrspace(3)* @lds_global_2, i32 0, i32 0
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to [65 x i32] addrspace(3)*
				; GCN-NEXT: %gep = getelementptr inbounds [65 x i32], [65 x i32] addrspace(3)* %1, i32 0, i32 0
				; GCN-NEXT: ret void
				entry:
				%gep = getelementptr inbounds [65 x i32], [65 x i32] addrspace(3)* @lds_global_1, i32 0, i32 0
				ret void
				}

				; GCN: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				define protected amdgpu_kernel void @kernel_1() {
				entry:
				call void @function_1()
				call void @function_2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_types_pointers.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @smem_ptr = hidden addrspace(3) global i32* undef, align 8
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@smem_ptr = hidden addrspace(3) global i32* undef, align 8

				define internal void @function_2() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to i32* addrspace(3)*
				; GCN-NEXT: %2 = addrspacecast i32* addrspace(3)* %1 to i32**
				; GCN-NEXT: %3 = load i32, i32* %2, align 8
				; GCN-NEXT: %4 = addrspacecast i32* addrspace(3)* %1 to i32**
				; GCN-NEXT: store i32* %3, i32** %4, align 8
				; GCN-NEXT: ret void
				entry:
				%0 = load i32, i32* addrspacecast (i32* addrspace(3)* @smem_ptr to i32**), align 8
				store i32* %0, i32** addrspacecast (i32* addrspace(3)* @smem_ptr to i32**), align 8
				ret void
				}

				define internal void @function_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: %1 = inttoptr i16 %0 to i32* addrspace(3)*
				; GCN-NEXT: %2 = addrspacecast i32* addrspace(3)* %1 to i32**
				; GCN-NEXT: %3 = load i32, i32* %2, align 8
				; GCN-NEXT: %4 = addrspacecast i32* addrspace(3)* %1 to i32**
				; GCN-NEXT: store i32* %3, i32** %4, align 8
				; GCN-NEXT: ret void
				entry:
				%0 = load i32, i32* addrspacecast (i32* addrspace(3)* @smem_ptr to i32**), align 8
				store i32* %0, i32** addrspacecast (i32* addrspace(3)* @smem_ptr to i32**), align 8
				ret void
				}

				define protected amdgpu_kernel void @kernel_1() {
				; GCN-LABEL: entry:
				; GCN: %0 = ptrtoint i32* addrspace(3)* @smem_ptr to i16
				; GCN-NEXT: store i16 %0, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.1, align 2
				; GCN-NEXT: call void @function_1()
				; GCN-NEXT: call void @function_2()
				; GCN-NEXT: ret void
				entry:
				call void @function_1()
				call void @function_2()
				ret void
				}

				define protected amdgpu_kernel void @kernel_2() {
				; GCN-LABEL: entry:
				; GCN: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace_lds_test_types_pointers_misc.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck -check-prefixes=LDS,POINTER,GCN %s

				; LDS: @smem = hidden addrspace(3) global i32 undef, align 4
				; LDS: @smem_ptr = hidden addrspace(3) global i32* undef, align 8
				; LDS: @smem_ptr_ptr = hidden local_unnamed_addr addrspace(3) global i32** undef, align 8
				; LDS: @smem_arr = hidden addrspace(3) global [1 x i32] undef, align 4
				; LDS: @smem_ptr2 = hidden local_unnamed_addr addrspace(3) global i32* undef, align 8
				; POINTER: @llvm.amdgcn.lds.pointer.1 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.2 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.3 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.4 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER: @llvm.amdgcn.lds.pointer.5 = internal unnamed_addr addrspace(3) global i16 undef, align 2
				@smem = hidden addrspace(3) global i32 undef, align 4
				@smem_ptr = hidden addrspace(3) global i32* undef, align 8
				@smem_ptr_ptr = hidden local_unnamed_addr addrspace(3) global i32** undef, align 8
				@smem_arr = hidden addrspace(3) global [1 x i32] undef, align 4
				@smem_ptr2 = hidden local_unnamed_addr addrspace(3) global i32* undef, align 8

				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				define internal void @function_2() {
				entry:
				store i32* addrspacecast (i32 addrspace(3)* @smem to i32), i32 addrspace(3)* @smem_ptr, align 8
				store i32** addrspacecast (i32* addrspace(3)* @smem_ptr to i32), i32 addrspace(3)* @smem_ptr_ptr, align 8
				%0 = load i32, i32 addrspace(3)* @smem, align 4
				ret void
				}

				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				define internal void @function_1() {
				entry:
				store i32* addrspacecast (i32 addrspace(3)* @smem to i32), i32 addrspace(3)* @smem_ptr, align 8
				store i32** addrspacecast (i32* addrspace(3)* @smem_ptr to i32), i32 addrspace(3)* @smem_ptr_ptr, align 8
				%0 = load i32, i32 addrspace(3)* @smem, align 4
				ret void
				}

				; GCN: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				define protected amdgpu_kernel void @kernel_1() {
				entry:
				call void @function_1()
				call void @function_2()
				ret void
				}

				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: %{{[0-9]+}} = load i16, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				define internal void @function_3() {
				entry:
				store i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([1 x i32], [1 x i32] addrspace(3)* @smem_arr, i32 0, i32 0) to i32), i32 addrspace(3)* @smem_ptr2, align 8
				ret void
				}

				; GCN: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				; GCN: store i16 %{{[0-9]+}}, i16 addrspace(3)* @llvm.amdgcn.lds.pointer.{{[0-9]+}}, align 2
				define protected amdgpu_kernel void @kernel_2() {
				entry:
				call void @function_3()
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers.AbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 332541

llvm/lib/Target/AMDGPU/AMDGPU.h

llvm/lib/Target/AMDGPU/AMDGPUAlwaysInlinePass.cpp

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/lib/Target/AMDGPU/CMakeLists.txt

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.h

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.cpp

llvm/lib/Target/AMDGPU/Utils/CMakeLists.txt

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-non-entry-func.ll

llvm/test/CodeGen/AMDGPU/addrspacecast-initializer-unsupported.ll

llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address-codegen.ll

llvm/test/CodeGen/AMDGPU/force-alwaysinline-lds-global-address.ll

llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-indirect.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-inline-asm-call.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds-used-list.ll

llvm/test/CodeGen/AMDGPU/lower-module-lds.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-constantexpr-use.ll

llvm/test/CodeGen/AMDGPU/replace_lds_report_error_no_func_def.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_direct_call_diamond_shape.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_direct_call_misc.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_ignored_lds.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_diamond_shape.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_misc.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_misc2.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_no_addr_taken.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_indirect_call_no_init.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_llvm_insts.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_types_misc.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_types_pointers.ll

llvm/test/CodeGen/AMDGPU/replace_lds_test_types_pointers_misc.ll

[AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers.
AbandonedPublic