This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers.
AbandonedPublic

Authored by hsmhsm on Nov 15 2020, 11:09 PM.

Details

Summary

One of the memory types being supported within AMD GPU memory hierarchy is
shared memory, also called Local Data Share or LDS for short. LDS memory
is the second fastest memory in the AMD GPU memory hierarchy (with register
file being fastest available memory in the hierarchy). Being faster also
means LDS memory is comparatively costlier and hence is a limited available
memory resource.

Being global scoped, an LDS variable is accessible within kernel functions
and non-kernel functions, but two different kernel execution paths, say
called from two kernels K1 and K2, cannot access the same instance of an LDS
variable, say L. Both K1 and K2 has to own its own instance of L. This puts
some challenges, especially to lower the LDS variables used within non-kernel
functions.

So, the pass - "Lower Module LDS" lowers the LDS globals by packing them
within in a struct type, and by creating an instance of that struct type
within every kerenl at address zero. Though, the pass - "Lower Module LDS"
makes some effort to minimize unnecessary LDS allocation, it is limited by
means of the fundamental basis and assumption upon which the pass is
implemented.

The current pass acts as an helping aid to the pass - "Lower Module LDS" with
the intention of minimizing unnecessary LDS allocation as much as possible.

The main idea behind the current pass is:

(1) To identify the LDS globals used within non-kernel function scope and
global scope.
(2) To push the use of all the above identified LDS globals to kernel
function scope by initializing their addresses to newly created LDS
global pointer variables (within kernel functions).
(3) To replace the uses of original LDS globals within non-kernel functions
by their pointer counter-parts.
(4) This way, the transformation makes sure that the pass "Lower Module LDS"
packs only pointer variables within struct type, and hence significantly
minimizes unnecessary LDS allocation, espacically when the original LDS
globals are big arrays (as this is the common LDS use case).

NOTE: The pass - "Lower Module LDS" now has a tight dependency on the current pass, and the current pass should always be run before running the pass "Lower Module LDS". Running the pass "Lower Module LDS" alone may lead to surprizing results.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hsmhsm updated this revision to Diff 317500.Jan 19 2021, 2:25 AM

[0] Started re-implementing from scratch again.
[1] Added a new pass, namely, amdgpu-lower-function-local-lds.
[2] Implemented required initial plumbing work for both old and new pass

managers.

[3] An option, namely, amdgpu-enable-function-local-lds-lowering is

added, when passed, it enables the pass.
hsmhsm retitled this revision from [AMDGPU] Support for device scope shared variables to [AMDGPU][WIP] Lower Function Local LDS Variables..Jan 19 2021, 2:27 AM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm edited the summary of this revision. (Show Details)

Started to implement the feature from scratch again. The previous experience tells me that - "a single very big patch is very problamatic and confusing for a meanigful review process". Hence this time, I am planning to submit small patches (time to time) which can be reasonably reviewed. This first patch implements the following.

[1] Add new pass, namely, amdgpu-lower-function-local-lds.
[2] Implement required initial plumbing work for both old and new pass managers.
[3] Add an option, namely, amdgpu-enable-function-local-lds-lowering, when passed, it enables the pass.

You can just use the done checkbox, you don't need to comment on each point

llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
1510 ↗(On Diff #313659)

You should only try to preserve things that are important, otherwise you are adding cost and complexity for no benefit

1512 ↗(On Diff #313659)

The IR is language independent and none of the constructs here are tied to a language

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
38

This isn't a user exposed flag, and there shouldn't be a need for users to set one.

hsmhsm marked 3 inline comments as done.Jan 19 2021, 7:31 PM
hsmhsm updated this revision to Diff 317763.EditedJan 19 2021, 8:53 PM

Based on the Matt's comment for previous patch, some changes are done w.r.t the handling of the guarding flag - amdgpu-enable-function-local-lds-lowering.

Though Matt is against the usage of any guarding flag for this pass, I personally feel the need of it for below two reasons.

(1) Presence of this pass means, we should disable the forcefull inlining as it is done within the pass - amdgpu-always-inline. Otherwise, this pass does not make any sense at all. It is a better idea to disable this forcefull inlining via a flag.

(2) In case of any emergency issue within this pass, customer should have an handy approach to disable the pass and temporarily move on until the fix available.

So the amdgpu-enable-function-local-lds-lowering is a hidden flag, and it is enabled by default. It works as below:

(1) Default behavoir is to run the pass as shown below.

  • Old pass manager:
mahesha@brego:[tmp]$ hipcc main.cpp
Running the pass - LowerFunctionLocalLDS
mahesha@brego:[tmp]$
  • New pass manager:
mahesha@brego:[tmp]$ hipcc -fexperimental-new-pass-manager main.cpp
Running the pass - LowerFunctionLocalLDS
mahesha@brego:[tmp]$

(2) The pass will not run when it is explicitly turned off as shown below.

  • Old pass manager:
mahesha@brego:[tmp]$ hipcc -mllvm --amdgpu-enable-function-local-lds-lowering=false main.cpp
mahesha@brego:[tmp]$
  • New pass manager:
mahesha@brego:[tmp]$ hipcc -fexperimental-new-pass-manager  -mllvm --amdgpu-enable-function-local-lds-lowering=false main.cpp
mahesha@brego:[tmp]$
hsmhsm edited the summary of this revision. (Show Details)Jan 19 2021, 9:44 PM
hsmhsm updated this revision to Diff 318723.EditedJan 22 2021, 8:01 PM

Build all the required data structures which will be later used to lower function local LDS. Below are the data structures being built.

[1] Kernel Set - Holds all the kernels in the module
[2] Function Local LDS Set - Holds all the function local LDS from all functions
[3] Function Address Taken Set - Holds all the functions whose address is taken within the module

[4] LDS to Function Map - Maps each function local LDS to a function within which the LDS is defined
[5] Function to LDS Map - Reverse of above map, which maps each functon F to a SET of LDS which are defined within F

[6] Kernel to Callee Map - Maps each kernel K to a SET of functions which define LDS and there exists call graph path from K to these functions.
[7] Kernel to LDS - Maps each kernel K to a set of function local LDS which are supposed to be lowered w.r.t K.

Data structures [1], [2], and [3] are built by iterating over the globals and functions defined within the module.
Data structures [4] and [5] are built using BOTTOM-UP based on the use list of function local LDS.
Data structure [6] is built using TOP-DOWN via call graph traversal.
Data structure [7] is built using the result of above BOTTOM-UP and TOP-DOWN constructed data structures.

hsmhsm updated this revision to Diff 318724.Jan 22 2021, 8:08 PM

Add missing "static" keyword to a function isKernel().

hsmhsm updated this revision to Diff 318726.Jan 22 2021, 8:44 PM

Added a FIXME comment.

Harbormaster completed remote builds in B86389: Diff 318723.
hsmhsm updated this revision to Diff 318730.Jan 22 2021, 9:44 PM

Add missing explicit keyword for constructor.

hsmhsm edited the summary of this revision. (Show Details)Jan 22 2021, 9:46 PM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm updated this revision to Diff 318731.Jan 22 2021, 9:59 PM

Fix few spell mistakes in comments.

hsmhsm updated this revision to Diff 318732.Jan 22 2021, 10:05 PM

Fix few spell mistakes in comments.

hsmhsm updated this revision to Diff 318733.Jan 22 2021, 10:17 PM

Fix comments.

hsmhsm updated this revision to Diff 318736.Jan 22 2021, 11:18 PM

Re-arrange code for more readability.

hsmhsm updated this revision to Diff 318738.Jan 22 2021, 11:40 PM

Fixed clang-tidy warnings.

hsmhsm updated this revision to Diff 318898.Jan 24 2021, 8:39 PM

Code re-organization.

hsmhsm updated this revision to Diff 318899.Jan 24 2021, 9:01 PM

Corrected few comments.

hsmhsm updated this revision to Diff 318904.Jan 24 2021, 10:47 PM

Fixed one of the FIXME comments which is associated with indirect calls.

hsmhsm updated this revision to Diff 318985.Jan 25 2021, 6:19 AM

Make use of llvm append_range() api.

hsmhsm updated this revision to Diff 319026.Jan 25 2021, 8:39 AM

Improvements to code at few places.

hsmhsm edited the summary of this revision. (Show Details)Jan 25 2021, 9:01 AM
hsmhsm edited the summary of this revision. (Show Details)

All tests are now missing

llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
32 ↗(On Diff #319026)

This function is pointless, just directly use isModuleEntryFunctionCC

41–44 ↗(On Diff #319026)

This function is useless. Assert strings also don't need to end in \n

56 ↗(On Diff #319026)

cast<>, don't dyn_cast and assert

66 ↗(On Diff #319026)

Pointless comment

68 ↗(On Diff #319026)

Pointless comment

74–75 ↗(On Diff #319026)

Pointless comment

82 ↗(On Diff #319026)

Extra private

92 ↗(On Diff #319026)

Extra private

93 ↗(On Diff #319026)

Typo unhanlded

111 ↗(On Diff #319026)

.contains

142 ↗(On Diff #319026)

Extra private

233 ↗(On Diff #319026)

Copy of set unnecessary

320 ↗(On Diff #319026)

Don't need all these newlines in assert strings

376–379 ↗(On Diff #319026)

I think you're overcomplicating the CallGraph usage by ignoring most of what it gives you. You should be able to just iterate directly through the CallGraph to get functions reachable from the parent

410–411 ↗(On Diff #319026)

This concept doesn't quite work for the IR. The same global can appear in multiple functions

449 ↗(On Diff #319026)

isa<>, no \n

479 ↗(On Diff #319026)

Should not be checking the function name. Should just skip all declarations

492–495 ↗(On Diff #319026)

Return !Kernels.empty()

502 ↗(On Diff #319026)

Probably should skip declarations. Also not sure about the linkage check

506–509 ↗(On Diff #319026)

Return !empty()

555–558 ↗(On Diff #319026)

Return !empty()

561–571 ↗(On Diff #319026)

Don't understand the point of this stub function

hsmhsm updated this revision to Diff 319316.Jan 26 2021, 8:55 AM

Fixed review comments (by Matt).

hsmhsm marked 22 inline comments as done.Jan 26 2021, 9:12 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
111 ↗(On Diff #319026)

The data structures, ValueMap<>, SmallPtrSet<> do not have member function - .contains(). W.r.t std::set<>, this member function is supported in C++20.

376–379 ↗(On Diff #319026)

As far as I understand it, llvm CallGraph infrastructure does not provide any facility as such. Implementer needs to explicitly iterate the callees of the caller.

410–411 ↗(On Diff #319026)

My understanding is that - scope of the shared variable is function/statement block scope. It is not available to access outside this scope. It is just that we implement it as global, just like how the local static variables are implemented in C/C++?

Can you give an example of the use-case that you are claiming?

502 ↗(On Diff #319026)

The linkage test is required to ignore the dynamic shared variables like the one defined as extern __shared__ int dy_sm[]; where size of dy_sm is not known at compile time, but is passed as one of the kernel execution configuration parameters at run time.

561–571 ↗(On Diff #319026)

This is a driver function, it looks like a stub now, since implementation is not complete yet. Once this patch is accepted, next step is to (1) define kernel specific LDS layouts (2) create 2D offset table and (3) add new implicit argument.

arsenm added inline comments.Jan 26 2021, 7:34 PM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
125–128 ↗(On Diff #319316)

Should just inline this function

182 ↗(On Diff #319316)

Should just inline this function

111 ↗(On Diff #319026)

There is an llvm::is_contained. Also why use std::set? You randomly switch set types around here

arsenm added inline comments.Jan 26 2021, 7:34 PM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
168 ↗(On Diff #319316)

Should just inline this function

205 ↗(On Diff #319316)

Copy here, just directly use this in the for loop

410–411 ↗(On Diff #319026)

The IR has absolutely no concept of these scopes. The global variables have global scope and no restriction on where their uses can appear. Whether or not this directly corresponds to a direct language feature is unimportant. Some IPO transforms can push global variable references into other functions.

The example is just two functions that refer to the same variable:

@lds = ...

define void @func0() {
  store i32 0, i32* @lds
  ret void
}

define void @func1() {
  store i32 0, i32* @lds
  ret void
}
502 ↗(On Diff #319026)

That's a function of it having 0 size, not the linkage

561–571 ↗(On Diff #319026)

If it's going to be split, I'd rather see the full stack for the review

arsenm added inline comments.Jan 26 2021, 7:45 PM
llvm/lib/Target/AMDGPU/AMDGPULowerFunctionLocalLDS.cpp
376–379 ↗(On Diff #319026)

The CallGraph as a whole gives you the functions reachable from each other. I don't think you need to do a stack walk to find the callees. You don't need to care about which functions specifically call which, just that they are all connected

hsmhsm updated this revision to Diff 320037.Jan 28 2021, 8:36 PM
hsmhsm marked 5 inline comments as done.

Pushing renamed code (not for review).

hsmhsm updated this revision to Diff 320090.Jan 29 2021, 3:10 AM

Fixed Matt's comments.

hsmhsm retitled this revision from [AMDGPU][WIP] Lower Function Local LDS Variables. to [AMDGPU][WIP] Lower LDS Global Variables..Jan 29 2021, 3:11 AM
hsmhsm edited the summary of this revision. (Show Details)Feb 2 2021, 7:59 PM
hsmhsm updated this revision to Diff 322270.Feb 8 2021, 7:32 PM

Save current work.

hsmhsm updated this revision to Diff 323247.Feb 12 2021, 1:31 AM

Save current work.

Could we get some tests *and* a commit message that explains what this is supposed to do.

arsenm added inline comments.Feb 12 2021, 8:36 AM
llvm/lib/Target/AMDGPU/AMDGPULowerLDSGlobal.cpp
312

You should only need to do the use replacement, you aren't changing the types of the instructions so cloning/hacking on them shouldn't be needed

447–449

Don't need this, the IR would have failed the verifier to get here

468–469

Should not const_cast

518–519

Too much auto for me

554

Should use lowercase, period separator naming convention with an llvm.amdgcn prefix

566

Should use lowercase, period separator naming convention with an llvm.amdgcn prefix

609

This still needs to add in alignment padding

669

Braces, Also can use range loop

743

I don't see why you need to build your own stack. The call graph already found the reachable functions for you

777

I think trying to handle callees is left for a later patch. Additionally, I think this should be the CallGraph analysis's responsibility to deal with

855–858

The callgraph should already give this to you. Iterating the call graph should give you all of the functions you care about. You don't actually need to worry about which functions call which, since you need to touch every function in the SCC

987–988

I would swap the order of these checks

Could we get some tests *and* a commit message that explains what this is supposed to do.

This is WIP, will add test and commit messages at the end before final review.

hsmhsm updated this revision to Diff 323514.Feb 12 2021, 8:34 PM

Address Matt's comments.

hsmhsm marked 7 inline comments as done.Feb 12 2021, 8:38 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULowerLDSGlobal.cpp
312

Added FIXME comment, will see how to fix it.

609

Added FIXME comment, will see how to fix it.

743

Added FIXME comment, will see how to fix it.

777

Added FIXME comment, will see how to fix it.

855–858

Added FIXME comment, will see how to fix it.

hsmhsm updated this revision to Diff 323518.Feb 12 2021, 9:30 PM

Assert that both caller and callee appear in same translation unit.

hsmhsm updated this revision to Diff 323530.Feb 13 2021, 12:17 AM

Address clang-tidy warnings.

hsmhsm updated this revision to Diff 323532.Feb 13 2021, 12:31 AM

Remove over created auto variables.

jdoerfert requested changes to this revision.Feb 13 2021, 5:10 PM

Could we get some tests *and* a commit message that explains what this is supposed to do.

This is WIP, will add test and commit messages at the end before final review.

I generally would recommend against that but I guess you can use phabricator this way.
However, once people can figure out what this is actually supposed to do, they might effectively restart the entire review process if the design is questioned.
I say this because I have a hunch, or maybe a hope, about the intent of this patch. If it would be that, I'd very much like this to be a generic, non-AMDGPU pass. I might be wrong about what this does, and that is what I'd like to figure out rather sooner than later.

This revision now requires changes to proceed.Feb 13 2021, 5:10 PM

Could we get some tests *and* a commit message that explains what this is supposed to do.

This is WIP, will add test and commit messages at the end before final review.

I generally would recommend against that but I guess you can use phabricator this way.
However, once people can figure out what this is actually supposed to do, they might effectively restart the entire review process if the design is questioned.
I say this because I have a hunch, or maybe a hope, about the intent of this patch. If it would be that, I'd very much like this to be a generic, non-AMDGPU pass. I might be wrong about what this does, and that is what I'd like to figure out rather sooner than later.

OK let's wait for the complete patch then. I will only going to push complete patch next time. But, it may take some time since there are some major hurdles to overcome.

hsmhsm updated this revision to Diff 332541.Mar 22 2021, 11:32 PM

Implemented a new approach based on initializing LDS globals to pointers.

hsmhsm retitled this revision from [AMDGPU][WIP] Lower LDS Global Variables. to [AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers..Mar 22 2021, 11:33 PM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm edited the summary of this revision. (Show Details)Mar 22 2021, 11:35 PM
JonChesterfield added a comment.EditedMar 23 2021, 12:46 AM

This is much more complicated than I expected. Is the large amount of comments largely from a previous patch doing different things that has been hammered into this one?

@jdoerfert the transform I think this is intended to do is:

  • find a large shared variable used from a function
  • add a new void*, also in shared, pointing to it
  • initialize that void* only in kernels that can call functions that use the large variable
  • replace all uses with

That means, on amdgcn, the large variable only costs LDS space in kernels that definitely use it. I don't know how cuda lowers shared accesses from functions, it could plausibly benefit from the same transform.

I can't work out which LDS variables you intend to replace with pointers from the code. Could you spell out what the condition under which you intend to replace one is?

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
177 ↗(On Diff #332541)

Why not isa<GlobalVariable> / function needs a different name

204 ↗(On Diff #332541)

Functions define hasAddressTaken, but also I don't think this pass needs to distinguish between direct and indirect calls

224 ↗(On Diff #332541)

This I haven't read yet, but it looks like far too much state. Expected a set of LDS globals called 'toReplaceWithPointer' or similar instead of all the maps

384 ↗(On Diff #332541)

Why do we want to replace constexpr with instructions? This comment contradicts the implementation

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.cpp
20 ↗(On Diff #332541)

Perhaps name the new files after LDS to make it clearer that they're used for LDS lowering an optimisation, not necessarily general purpose.

Also move the functions out in a separate commit, without changes to their implementation, as that improves the signal/noise of the functional change.

59 ↗(On Diff #332541)

e.g. I recognise this as newly introduced by the comment, but in phab it's hard to distinguish from things that haven't changed

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.h
16 ↗(On Diff #332541)

Include list should be limited to those that are used by the header, with the ones used by the source included there

This is much more complicated than I expected.

We need to *really* discuss, what is complicated here and what is violated here from the internal email discussions.

Is the large amount of comments largely from a previous patch doing different things that has been hammered into this one?

No, nothing is hammered from the previous patch. The current patch is implementing what is planned via internal emial discussion.

@jdoerfert the transform I think this is intended to do is:

  • find a large shared variable used from a function
  • add a new void*, also in shared, pointing to it
  • initialize that void* only in kernels that can call functions that use the large variable
  • replace all uses with

No, the intended implementation plan which is implemented here is as follows.

(1) Identify the LDS globals (whether large or small) which are used within non-kernel function scope and in global scope.
(2) Create new LDS glboals of i16 type corresponding to every LDS global identified above. The i16 typed LDS globals act as pointers to corresponding original LDS globals.
(3) push the *use* of above identified LDS globals to kernels by adding instructions within the kernels which initialize the address of original LDS globals to their respective pointers. This is will make sure that per kernel LDS allocation for these LDS globals correctly happen.
(4) Within non-kernel functions, replace the *use* of original LDS globals by thier respective pointers.
(5) Keep the global scope use of original LDS globals unchanged since now they should work automatically as the use of these original LDS globals (pointer initialiation) also there within all kernels and hence it will semantically work correct as expected because of per kernel LDS allocation for these LDS globals.

That means, on amdgcn, the large variable only costs LDS space in kernels that definitely use it. I don't know how cuda lowers shared accesses from functions, it could plausibly benefit from the same transform.

Let's not bother about how CUDA handles it since there is lot of differences here. And focus on only AMDGCN.

I can't work out which LDS variables you intend to replace with pointers from the code. Could you spell out what the condition under which you intend to replace one is?

All those LDS globals which are used within non-kernel functions and within global scope requires pointer initialization within kernels.

hsmhsm added inline comments.Mar 29 2021, 1:13 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
177 ↗(On Diff #332541)

Because LDS would be nested within the const expr within global scope use.

204 ↗(On Diff #332541)

I am not getting this comment, probably we can discuss it offline.

224 ↗(On Diff #332541)

These maps are required for the logic where we really need to restrict the LDS set for kernel based on kernel excecution paths.

384 ↗(On Diff #332541)

Again not clear about what you intended here - let's take it offline.

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.cpp
20 ↗(On Diff #332541)

will think about it.

59 ↗(On Diff #332541)

not sure what you mean here. Let's discuss offline.

llvm/lib/Target/AMDGPU/Utils/AMDGPUGeneralUtils.h
16 ↗(On Diff #332541)

agree.

hsmhsm added a comment.Apr 6 2021, 8:25 PM

Hi Jon,

I have replied to some of your review comments, and few other comments require internal discussion for better and quick unambiguous conclusion. I am expecting a response from you.

This is much more complicated than I expected.
Is the large amount of comments largely from a previous patch doing different things that has been hammered into this one?

No, nothing is hammered from the previous patch. The current patch is implementing what is planned via internal emial discussion.

This review dates from November 16 last year and contains hundreds of review comments against code that may or may not still be in the latest revision, this being diff #34 at time of commenting. If I'm following along successfully, the design has changed significantly and repeatedly during that process. It is therefore very difficult to determine what the design intent behind the current revision is. That is what I mean by 'previous patch has been hammered into this one'.

The algorithm I had in mind was along the lines of:

for each LDS variable:
  if should-transform
    create 16 bit integer in LDS
    initialize that global with (constexpr) address of variable
    replace all uses of variable with a (constexpr) access through new pointer

where
should-transform:
 if (sizeof) < 8ish return false
 if used by instruction in indirectly called function return false
 if only used by kernels return false
 probably other exclusions
 return true
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
384 ↗(On Diff #332541)

I was thinking the introduced 16 bit pointers will be initialised with constexpr from the corresponding variable.

This patch presently initialises them with undef, which I think thwarts using constexpr everywhere, and means we insert stores in the kernel entry basic block here.

If we fix the back end to handle LDS variables with initializers (at least the simple case of only used from kernel and initialized with address of some other variable), then quite a lot of the complexity of this patch drops out.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
195–202

I think this test reads better as proposed here - 'enable-lower-module-lds=true' is better than 'disable-lower-module-lds=false'. Separable from the rest of this patch, we could land a patch that just inverts that commandline flag and updates the tests to match. That removes some noise from this review.

llvm/test/CodeGen/AMDGPU/lower-module-lds.ll
22 ↗(On Diff #332541)

This test should only check the behaviour of lower-module-lds. Separate tests check the behaviour of amdgpu-replace-lds-use-with-pointer.

Equally, running amdgpu-lower-module-lds by itself should not automatically run amdgpu-replace-lds-use-with-pointer and vice versa.

llvm/test/CodeGen/AMDGPU/replace_lds_report_error_no_func_def.ll
3 ↗(On Diff #332541)

This is an error in the implementation, not something that should have a test checking the implementation is broken. Instead of assuming the definition of both are in the same module and crashing if they aren't, the pass should ignore a variable which doesn't meet that requirement.

llvm/test/CodeGen/AMDGPU/replace_lds_test_direct_call_misc.ll
15 ↗(On Diff #332541)

These tests would be more robust if the new pointer was named based on the global it is intended to reference, as then the regex can check that we created load from the correct pointer (as opposed to just one of the new pointers).

The algorithm I had in mind was along the lines of:

for each LDS variable:
  if should-transform
    create 16 bit integer in LDS
    initialize that global with (constexpr) address of variable
    replace all uses of variable with a (constexpr) access through new pointer

where
should-transform:
 if (sizeof) < 8ish return false
 if used by instruction in indirectly called function return false
 if only used by kernels return false
 probably other exclusions
 return true

I think, we all of us, who are involved in discussing about the functionalities related to this patch are not on the same page. First, we need to internally discuss it and make sure that we are all on the same page, before I start making any further changes to this patch.

By the way, this patch has gone through too many revisions, and is becoming too complex to go back to any previous history of the patch when required. So, I think, it is better to abandon this patch, start on a clean slate with fresh new patch. If I do not get any objection to abandon this patch, then I will be abonding it in a day or two.

hsmhsm abandoned this revision.Apr 19 2021, 9:05 AM

As mentioned eariler, I am abandoning this patch. Let's start with a clean slate, and decide on the implementation.