This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allocate LDS globals in sorted order of their alignment and size.
AbandonedPublic

Authored by hsmhsm on May 13 2021, 7:11 AM.

Details

Summary

At the start of ISel lowering of kernel function K:

[1] Collect all the LDS globals, say, LDSSet, which are used within the kernel K.
[2] Fix the natural alignment of all the LDS globals from LDSSet.
[3] Sort LDS globals from LDSSet first by their alignment and then by their size order.
[4] Allocate memory for LDS globals from LDSSet in the above sorted order.

Diff Detail

Event Timeline

hsmhsm created this revision.May 13 2021, 7:11 AM
hsmhsm requested review of this revision.May 13 2021, 7:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2021, 7:11 AM
foad added inline comments.May 13 2021, 7:59 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
20

Isn't this llvm::is_contained?

31

Isn't there a suitable Alignment::max?

92–93

Why is sorting by size important?

I can see that sorting by alignment, descending, avoids alignment gaps (at least in the common case where an object's size is a multiple of its alignment). But sorting by size first means that most globals won't be sorted by alignment (unless they happened to have the same size and different alignments), so you won't get that benefit.

hsmhsm added inline comments.May 13 2021, 8:44 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
31

I could find any such function. I could see below ones, but one of the arg should be of MaybeAlign.

inline Align max(MaybeAlign Lhs, Align Rhs) {
  return Lhs && *Lhs > Rhs ? *Lhs : Rhs;
}

inline Align max(Align Lhs, MaybeAlign Rhs) {
  return Rhs && *Rhs > Lhs ? *Rhs : Lhs;
}
92–93

I think we need to first sort based on "alignment", then on tie, based on size.

rampitec added inline comments.May 13 2021, 8:46 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93

Probably yes. Strategy to sort by size was used along with the super-alignment, so did the same thing.

hsmhsm updated this revision to Diff 345189.May 13 2021, 9:46 AM

Fixed review comments by Jay.

hsmhsm retitled this revision from [AMDGPU] Allocate LDS globals in sorted order of their size and alignment. to [AMDGPU] Allocate LDS globals in sorted order of their alignment and size..May 13 2021, 9:47 AM
hsmhsm edited the summary of this revision. (Show Details)

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

rampitec added inline comments.May 13 2021, 10:33 AM
llvm/include/llvm/Support/Alignment.h
348 ↗(On Diff #345189)

It does not follow the style around, make the body on separate lines.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
41

Is there a reason to use std::set and not one of llvm's collections?

46

Get DL outside of the loop.

52

Can a GV hold a pointer to this GV?

59–60

This comment shall not be relevant anymore?

61

It can probably be Operator.

70

Why std::vector?

82

Looks like you are traversing all users just to check if a GC used from this kernel. This is wasteful, you can just create a function checking that a GV is reachable from this kernel and early return if it is.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
269

This is mostly a copy-paste of findVariablesToLower() above. I would suggest to factor it somehow.

foad added inline comments.May 14 2021, 3:06 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93

I still don't understand if there's a reason to sort by size or name here. It's a stable sort, so why not just sort by alignment?

rampitec added inline comments.May 14 2021, 3:09 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93

Size may leave samller gaps. Name is an overkill, merely a backdoor to hack it and force non-trivial "optimizations" by renaming variables. But there are more conceptual issues so far.

hsmhsm updated this revision to Diff 345398.May 14 2021, 4:50 AM

Fixed review comments by Stas.

hsmhsm marked 14 inline comments as done.May 14 2021, 4:53 AM

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

llvm/include/llvm/Support/Alignment.h
348 ↗(On Diff #345189)

This was forced by clang-format tool.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61

I did not understand this comment.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93

This has been copy/pasted from lowermodulelds, where sort by name was mostly done for more predictable test cases.

hsmhsm added inline comments.May 14 2021, 5:09 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93

I thought that it would be helpful in following scenarios. Here we probably know that allocation order is [A, B, C] since both size and alignment are same for all three globals.

@A = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4
@B = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4
@C= internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

I mean an IR test where we can see what lowering has done, not just total allocation amount.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61

llvm::Operator.

83

SmallVectorImpl<>

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

Maybe just pass a filter lambda there?

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
52

SmallVectorImpl<>.

arsenm added inline comments.May 14 2021, 3:24 PM
llvm/include/llvm/Support/Alignment.h
348 ↗(On Diff #345398)

regular std::max works for this

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
21

You're ignoring the explicit alignment of the global, but also implicitly adding the ABI alignment of the type

73–74

cast<>

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
37

I think GV directly has a getAddressSpace

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
52

That doesn't work for a return value

Can this go into an analysis pass or something? It seems unfortunate that every single function is going to end up doing this same walk through the module

hsmhsm updated this revision to Diff 345678.May 16 2021, 1:16 AM

Partially fixed review comments by Matt and Stas.

hsmhsm marked 9 inline comments as done.May 16 2021, 1:31 AM

Can this go into an analysis pass or something? It seems unfortunate that every single function is going to end up doing this same walk through the module

The only same walk through required is for collecting LDS static globals. If we want to avoid it, then I am not sure, how an analysis result (via an OPT analysis pass) can be used within ISEL stage? Do you know it?

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
21

I think in that case, we need a separate OPT transformation pass just before ISEL (llc) pass, which actually update the natural alignment of LDS globals based on their size, when there is an underaligned case.

61

llvm::Operator is an abstracrtion over Instruction and ConstExpr. An llvm::Operator should be either an Instruction OR ConstExpr, and hence explicit handling of llvm::Operator is not required.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

I personally think that it will be over complicated - collectStaticLDSGlobals() is called by both findVariablesToLower() and getSortedUsedLDSGlobals(). Further filtering of LDS globals is required for findVariablesToLower(), but not for getSortedUsedLDSGlobals().

hsmhsm marked 3 inline comments as done.May 16 2021, 1:37 AM

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

I mean an IR test where we can see what lowering has done, not just total allocation amount.

Probably then, we need to dump the MIR result by passing the option - "--print-after-isel", and then we need to test for the memory address. But, I see this address is still based on alignment info avaialble in the IR. So, I think, we need to first explicitly update the alignment of LDS globals in a seperate OPT pass just before ISEL pass?

hsmhsm updated this revision to Diff 345718.May 16 2021, 11:44 AM

Updated the llvm lit test to check the output of "instruction selection" phasse.

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

I mean an IR test where we can see what lowering has done, not just total allocation amount.

Probably then, we need to dump the MIR result by passing the option - "--print-after-isel", and then we need to test for the memory address. But, I see this address is still based on alignment info avaialble in the IR. So, I think, we need to first explicitly update the alignment of LDS globals in a seperate OPT pass just before ISEL pass?

I have updated the lit test to check the output of the "instruction selection" pass.

hsmhsm updated this revision to Diff 345751.May 16 2021, 10:47 PM

Updated failing lit test - CodeGen/MIR/AMDGPU/machine-function-info.ll

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

I'm saying you're putting a nontrivial IR inspection into a codegen pass. It would be cleaner if we could leave the allocation optimizations entirely in an IR pass.

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

I'm saying you're putting a nontrivial IR inspection into a codegen pass. It would be cleaner if we could leave the allocation optimizations entirely in an IR pass.

At the IR level (in the opt pass), you can probably do following:

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

I'm saying you're putting a nontrivial IR inspection into a codegen pass. It would be cleaner if we could leave the allocation optimizations entirely in an IR pass.

At the IR level (in the opt pass), you can probably do following:

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

May be, we can have it for the time-being, and as a next step, see what best we can do for it.

hsmhsm updated this revision to Diff 345918.May 17 2021, 9:58 AM

Moved few utility functions to AMDGPULDSUtils.cpp file since may land-up reusing them in near future.

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

If you rewrite all the accesses relative to a single LDS global, there's nothing for codegen to sort.

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

If you rewrite all the accesses relative to a single LDS global, there's nothing for codegen to sort.

But, do we have any immediate available neat solution(s) which will make sure that all the LDS globals which are used within a kernel is replaced by single struct (or something else) instance? We are struggling with it for a long. Probably, if we solve that problem in near future, then this patch becomes kind of irrelavent and we can revert it at that point, until then, this patch is essential for the immediate need. But, probably let's move all the utility code from AMDGPUMachineFunction to utility file.

hsmhsm updated this revision to Diff 345921.May 17 2021, 10:13 AM

Move utility functions from AMDGPUMachineFunction to utility file

But, do we have any immediate available neat solution(s) which will make sure that all the LDS globals which are used within a kernel is replaced by single struct (or something else) instance?

That's essentially what AMDGPULowerModuleLDSPass does, and could be extended to have a kernel specific table

rampitec added inline comments.May 17 2021, 1:16 PM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61

No, it is a separate subclass of User: https://llvm.org/doxygen/classllvm_1_1User.html

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

I do not like the idea of allocating memory, then copying.

llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36

The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA.

It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile.

One thing it does not check is that we do not allocate more LDS than used. It does not check the situation with multiple kernels using intersecting LDS variable sets.

I was looking at a test which will really check memory layout, like that is possible with AMDGPULowerModuleLDSPass. In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.

But, do we have any immediate available neat solution(s) which will make sure that all the LDS globals which are used within a kernel is replaced by single struct (or something else) instance?

That's essentially what AMDGPULowerModuleLDSPass does, and could be extended to have a kernel specific table

Ok, assume that we will extend AMDGPULowerModuleLDSPass to have a kernel specific table. What I am not getting is - how to use this table in order to replace LDS globals by their offset counter-parts within non-kernel functions. Passing kernel-id as an argument is ruled out, since the presence of indirect calls will really break this since function pointers can be used in so many varieties of ways.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61

But, if I look into the implemenation of Operator at - https://llvm.org/doxygen/Operator_8h_source.html it clearly tells me that an instance of Operator class will never ever be created, and it is really a place holder for Instrunction and ConstExpr.

/// Return the opcode for this Instruction or ConstantExpr.
unsigned getOpcode() const {
  if (const Instruction *I = dyn_cast<Instruction>(this))
    return I->getOpcode();
  return cast<ConstantExpr>(this)->getOpcode();
}
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

Then you mean to say, let's return vector, instead of passing it as reference arg? then we cannot use SmallVectorImpl<> as return type, so, we need to go back to returning SmallVector<> itself.

llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36

I think, all of you are kind of suggesting to extend AMDGPULowerModuleLDSPass. But, I am not sure, how it is easily extendible, moreever, it is being invoked before promoteAlloca pass at the moment. We really need to discuss it and come to a common conclusion, before proceeding further.

For me, it looks like - there is a ball, half of which is painted white, and other half black, few are looking at white part, and others looking at black part, and not able to come to a common conclusion.

foad added inline comments.May 18 2021, 2:01 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

To avoid allocating memory for a copy of the data, you could filter StaticLDSGlobals in-place with erase(remove_if(StaticLDSGlobals.begin(), StaticLDSGlobals.end(), predicate), StaticLDSGlobals.end()).

hsmhsm added inline comments.May 18 2021, 2:52 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

Probably, we land-up with code like below. We still need to copy to LocalVars, because findVariablesToLower() expect return std::vector. And, based on earlier reveiw comments, I have made changes to collectStaticLDSGlobals() to work with SmallVector. Additionally, look at the complexity of the code change in terms of understanding it. So, I thought it is not worth, since we still need to copy anyway.

StaticLDSGlobals.erase(
  std::remove_if(StaticLDSGlobals.begin(), StaticLDSGlobals.end(),
               [&](const GlobalVariable *GV) {
                 GlobalVariable *GV2 = const_cast<GlobalVariable *>(GV);
                  if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User *U) {
                        return userRequiresLowering(UsedList, U);
                      })) {
                    return true;
                  }
                  return false;
               }),
  StaticLDSGlobals.end());

std::vector<llvm::GlobalVariable *> LocalVars(StaticLDSGlobals.begin(),
                                              StaticLDSGlobals.end());
hsmhsm added inline comments.May 18 2021, 3:01 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

Additionally, there is also a added complexity of handling "constness" of "GlobalVariable *"

foad added inline comments.May 18 2021, 3:24 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236
if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User *U) {
      return userRequiresLowering(UsedList, U);
    })) {
  return true;
}
return false;

This should just return std::none_of(...);

Additionally, there is also a added complexity of handling "constness" of "GlobalVariable *"

Then fix userRequiresLowering first? Something like: https://reviews.llvm.org/differential/diff/346095/

hsmhsm added inline comments.May 18 2021, 3:35 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236

This should just return std::none_of(...);

Agree

Then fix userRequiresLowering first? Something like: https://reviews.llvm.org/differential/diff/346095/

I assume, it is not as simple as that, and I guess, it will force me to change within AMDGPULowerModuleLDSPass.cpp which I want to avoid at this point. These kind of changes are better to implement as a kind of code refactor patch. But, nevetheless, if you guys think that it is must, then, I will do it.

rampitec added inline comments.May 18 2021, 11:09 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61

Example:

%sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8*), i64 %sunkaddr3
llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36

PromoteAlloca is probably not an issue, it will just add a new allocation past that struct. It should be already properly aligned as we are creating it ourselves. The current waste of LDS by the AMDGPULowerModuleLDSPass is an issue though. Ideally it would be nice to come up with a single solution for functions and kernels which I guess we are continue discussing...

hsmhsm added inline comments.May 19 2021, 1:00 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61

If I understand it correctly, @global_pointer is used within a constant expression - bitcast, and this constant expression is used within an instruction - getelementptr.

And this case is neatly handled in the code. I am not sure how relevant it to the "Operator" that you brought up. I still believe, there is no need to check for "Operator".

llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36

I still do not understand, if we can really handle it within AMDGPULowerModuleLDSPass. Expecting a single struct variable and only single struct variable within every kernel is too much ideal expectation at this point. And not sure, why you guys are insistiing about it.

Strictly speaking, this pass does not update align of LDS globals? It only fix alignment and allocate memory accordingly. So, the lit tests should check for allocated boundary and total size allocated. And that is what I did here. If you are not happy with MIR result, then, I can check it on final assembly.

To answer some of your earlier comments:

The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA.

I can change it to check instructions instead of mir. Based on the sorting algorithm implemented - we expect which LDS global should be allocated on which assdresses, and how much LDS should be allocated in total. That is what I did here.

It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile.

Let me see, what is missing here.

One thing it does not check is that we do not allocate more LDS than used.

I am checking for it as in "SDAG-HSA: amdhsa_group_segment_fixed_size 31"

It does not check the situation with multiple kernels using intersecting LDS variable sets.

I will try to improve the test cases..

I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass

Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting?

In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.

I doubt if we can quickly achieve it considering all the mess happening right now.

hsmhsm updated this revision to Diff 346361.May 19 2021, 1:42 AM

Rebased to latest main.

hsmhsm updated this revision to Diff 346415.May 19 2021, 5:37 AM

Update lit tests as below:

[1] Check instructions instead of MIR
[2] Add test to check LDS used with multiple kernels
[3] Add test to check that llvm.amdgcn.module.lds is allocated on address 0

hsmhsm updated this revision to Diff 346417.May 19 2021, 5:43 AM

Fixed comment within lit test.

hsmhsm updated this revision to Diff 346429.May 19 2021, 6:18 AM

Within lit tests make sure that different values are stored to LDS globals so that it will be
bit easier to track which memory is allocated for which LDS global.

rampitec added inline comments.May 19 2021, 10:31 AM
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61

OK, I was unable to exploit it, it indeed cast to Constant.

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-allocation.ll
35

One remaining problem with these tests now they depend on the register allocation and actual register names. It will be difficult to maintain.

llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36

I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass

Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting?

I was thinking about a test just showing an IR globals after the transformation, but I guess since it is still just the lowering the IR is not updated and such test is not possible.

In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.

I doubt if we can quickly achieve it considering all the mess happening right now.

Given it does not solve minimal per-kernel allocation as we discussed yes, we cannot just use it right away.

hsmhsm abandoned this revision.May 21 2021, 2:33 AM

Since we are planning different solutions, I am abandoning this patch.