This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
25/29
AMDGPUMachineFunction.cpp
-
Utils/
2/2
AMDGPULDSUtils.h
8/11
AMDGPULDSUtils.cpp
-
test/CodeGen/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
1
lds-allocation.ll
-
lds-allocation2.ll
-
lds-global-value.ll
-
lds-alignment.ll
2/5
lds-allocation.ll
-
lds-allocation2.ll
-
local-memory.amdgcn.ll
-
promote-alloca-padding-size-estimate.ll
-
MIR/AMDGPU/
-
AMDGPU/
-
machine-function-info.ll

Differential D102401

[AMDGPU] Allocate LDS globals in sorted order of their alignment and size.
AbandonedPublic

Authored by hsmhsm on May 13 2021, 7:11 AM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
JonChesterfield
t-tye
b-sumner
foad

Summary

At the start of ISel lowering of kernel function K:

[1] Collect all the LDS globals, say, LDSSet, which are used within the kernel K.
[2] Fix the natural alignment of all the LDS globals from LDSSet.
[3] Sort LDS globals from LDSSet first by their alignment and then by their size order.
[4] Allocate memory for LDS globals from LDSSet in the above sorted order.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hsmhsm created this revision.May 13 2021, 7:11 AM

Herald added subscribers: kerbowa, hiraditya, tpr and 5 others. · View Herald TranscriptMay 13 2021, 7:11 AM

hsmhsm requested review of this revision.May 13 2021, 7:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2021, 7:11 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

foad added inline comments.May 13 2021, 7:59 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
20	Isn't this llvm::is_contained?
31	Isn't there a suitable Alignment::max?
92–93	Why is sorting by size important? I can see that sorting by alignment, descending, avoids alignment gaps (at least in the common case where an object's size is a multiple of its alignment). But sorting by size first means that most globals won't be sorted by alignment (unless they happened to have the same size and different alignments), so you won't get that benefit.

Harbormaster completed remote builds in B104277: Diff 345130.May 13 2021, 8:10 AM

hsmhsm added inline comments.May 13 2021, 8:44 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
31	I could find any such function. I could see below ones, but one of the arg should be of MaybeAlign. inline Align max(MaybeAlign Lhs, Align Rhs) { return Lhs && Lhs > Rhs ? Lhs : Rhs; } inline Align max(Align Lhs, MaybeAlign Rhs) { return Rhs && Rhs > Lhs ? Rhs : Lhs; }
92–93	I think we need to first sort based on "alignment", then on tie, based on size.

rampitec added inline comments.May 13 2021, 8:46 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93	Probably yes. Strategy to sort by size was used along with the super-alignment, so did the same thing.

Fixed review comments by Jay.

Herald added a subscriber: dexonsmith. · View Herald TranscriptMay 13 2021, 9:46 AM

hsmhsm retitled this revision from [AMDGPU] Allocate LDS globals in sorted order of their size and alignment. to [AMDGPU] Allocate LDS globals in sorted order of their alignment and size..May 13 2021, 9:47 AM

hsmhsm edited the summary of this revision. (Show Details)

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

rampitec added inline comments.May 13 2021, 10:33 AM

llvm/include/llvm/Support/Alignment.h
348 ↗	(On Diff #345189)	It does not follow the style around, make the body on separate lines.
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
41	Is there a reason to use std::set and not one of llvm's collections?
46	Get DL outside of the loop.
52	Can a GV hold a pointer to this GV?
59–60	This comment shall not be relevant anymore?
61	It can probably be Operator.
70	Why std::vector?
82	Looks like you are traversing all users just to check if a GC used from this kernel. This is wasteful, you can just create a function checking that a GV is reachable from this kernel and early return if it is.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
269	This is mostly a copy-paste of findVariablesToLower() above. I would suggest to factor it somehow.

Harbormaster completed remote builds in B104319: Diff 345189.May 13 2021, 11:27 AM

foad added inline comments.May 14 2021, 3:06 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93	I still don't understand if there's a reason to sort by size or name here. It's a stable sort, so why not just sort by alignment?

rampitec added inline comments.May 14 2021, 3:09 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93	Size may leave samller gaps. Name is an overkill, merely a backdoor to hack it and force non-trivial "optimizations" by renaming variables. But there are more conceptual issues so far.

Fixed review comments by Stas.

hsmhsm marked 14 inline comments as done.May 14 2021, 4:53 AM

In D102401#2757441, @rampitec wrote:

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

llvm/include/llvm/Support/Alignment.h
348 ↗	(On Diff #345189)	This was forced by clang-format tool.
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61	I did not understand this comment.

JonChesterfield added inline comments.May 14 2021, 5:03 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93	This has been copy/pasted from lowermodulelds, where sort by name was mostly done for more predictable test cases.

hsmhsm added inline comments.May 14 2021, 5:09 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
92–93	I thought that it would be helpful in following scenarios. Here we probably know that allocation order is [A, B, C] since both size and alignment are same for all three globals. @A = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4 @B = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4 @C= internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4

Harbormaster completed remote builds in B104467: Diff 345398.May 14 2021, 6:29 AM

In D102401#2759276, @hsmhsm wrote:

In D102401#2757441, @rampitec wrote:

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

I mean an IR test where we can see what lowering has done, not just total allocation amount.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61	llvm::Operator.
83	SmallVectorImpl<>
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	Maybe just pass a filter lambda there?
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
52	SmallVectorImpl<>.

arsenm added inline comments.May 14 2021, 3:24 PM

llvm/include/llvm/Support/Alignment.h
348 ↗	(On Diff #345398)	regular std::max works for this
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
21	You're ignoring the explicit alignment of the global, but also implicitly adding the ABI alignment of the type
73–74	cast<>
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
37	I think GV directly has a getAddressSpace
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
52	That doesn't work for a return value

Can this go into an analysis pass or something? It seems unfortunate that every single function is going to end up doing this same walk through the module

Partially fixed review comments by Matt and Stas.

In D102401#2760924, @arsenm wrote:

Can this go into an analysis pass or something? It seems unfortunate that every single function is going to end up doing this same walk through the module

The only same walk through required is for collecting LDS static globals. If we want to avoid it, then I am not sure, how an analysis result (via an OPT analysis pass) can be used within ISEL stage? Do you know it?

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
21	I think in that case, we need a separate OPT transformation pass just before ISEL (llc) pass, which actually update the natural alignment of LDS globals based on their size, when there is an underaligned case.
61	llvm::Operator is an abstracrtion over Instruction and ConstExpr. An llvm::Operator should be either an Instruction OR ConstExpr, and hence explicit handling of llvm::Operator is not required.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	I personally think that it will be over complicated - collectStaticLDSGlobals() is called by both findVariablesToLower() and getSortedUsedLDSGlobals(). Further filtering of LDS globals is required for findVariablesToLower(), but not for getSortedUsedLDSGlobals().

In D102401#2760317, @rampitec wrote:

In D102401#2759276, @hsmhsm wrote:

In D102401#2757441, @rampitec wrote:

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

I mean an IR test where we can see what lowering has done, not just total allocation amount.

Probably then, we need to dump the MIR result by passing the option - "--print-after-isel", and then we need to test for the memory address. But, I see this address is still based on alignment info avaialble in the IR. So, I think, we need to first explicitly update the alignment of LDS globals in a seperate OPT pass just before ISEL pass?

Harbormaster completed remote builds in B104689: Diff 345678.May 16 2021, 1:56 AM

Updated the llvm lit test to check the output of "instruction selection" phasse.

In D102401#2761858, @hsmhsm wrote:

In D102401#2760317, @rampitec wrote:

In D102401#2759276, @hsmhsm wrote:

In D102401#2757441, @rampitec wrote:

We probably need a test which only runs this lowering and showing a resulting transformed IR. Test the alignment, sorting and situation with multiple kernels using intersecting LDS variable sets.

Added test - lds-allocation.ll. However, note that we actually do not update the alignment of LDS globals within the module, during allocation, we only just compute the required natural alignment for proper allocation.

I mean an IR test where we can see what lowering has done, not just total allocation amount.

Probably then, we need to dump the MIR result by passing the option - "--print-after-isel", and then we need to test for the memory address. But, I see this address is still based on alignment info avaialble in the IR. So, I think, we need to first explicitly update the alignment of LDS globals in a seperate OPT pass just before ISEL pass?

I have updated the lit test to check the output of the "instruction selection" pass.

Harbormaster completed remote builds in B104716: Diff 345718.May 16 2021, 12:27 PM

Updated failing lit test - CodeGen/MIR/AMDGPU/machine-function-info.ll

Harbormaster completed remote builds in B104743: Diff 345751.May 16 2021, 11:55 PM

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

In D102401#2763394, @arsenm wrote:

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

In D102401#2763512, @hsmhsm wrote:

In D102401#2763394, @arsenm wrote:

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

I'm saying you're putting a nontrivial IR inspection into a codegen pass. It would be cleaner if we could leave the allocation optimizations entirely in an IR pass.

In D102401#2763842, @arsenm wrote:

In D102401#2763512, @hsmhsm wrote:

In D102401#2763394, @arsenm wrote:

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

I'm saying you're putting a nontrivial IR inspection into a codegen pass. It would be cleaner if we could leave the allocation optimizations entirely in an IR pass.

At the IR level (in the opt pass), you can probably do following:

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

In D102401#2763869, @hsmhsm wrote:

In D102401#2763842, @arsenm wrote:

In D102401#2763512, @hsmhsm wrote:

In D102401#2763394, @arsenm wrote:

If we already have a pass that condenses the LDS globals into a single variable to access, what is the advantage of this? Why can't we just always do that compacting and codegen will then not have to worry about optimal LDS packing since it will only see the one global

But, do we really have one right now? As you might know - "LowerModuleLDSPass" will not guarentee it (atleast as it exist today). We need have something working till we have a kind of perfect solution(s). And, I strongly believe that this patch is really useful patch in the sense it will definetely increase the probability of unaligned ds access to be aligned at runtime. Morever, it only touches kernels, not every single function in the module as you said in one of the earlier patch. I do not see any harm in having this patch, unless I am missing something.

I'm saying you're putting a nontrivial IR inspection into a codegen pass. It would be cleaner if we could leave the allocation optimizations entirely in an IR pass.

At the IR level (in the opt pass), you can probably do following:

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

May be, we can have it for the time-being, and as a next step, see what best we can do for it.

Moved few utility functions to AMDGPULDSUtils.cpp file since may land-up reusing them in near future.

In D102401#2763869, @hsmhsm wrote:

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

If you rewrite all the accesses relative to a single LDS global, there's nothing for codegen to sort.

In D102401#2763893, @arsenm wrote:

In D102401#2763869, @hsmhsm wrote:

(1) infer and fix LDS alignment
(2) update the align attribute of LDS globals and LDS memory operations based on above (1)

But, sorting of per kernel used LDS globals can be achieved only in codegen pass as did in this patch. The reason for this is - I do not see any neat way to pass some analysis result which is built in IR pass on top of IR to codegen pass.

If you rewrite all the accesses relative to a single LDS global, there's nothing for codegen to sort.

But, do we have any immediate available neat solution(s) which will make sure that all the LDS globals which are used within a kernel is replaced by single struct (or something else) instance? We are struggling with it for a long. Probably, if we solve that problem in near future, then this patch becomes kind of irrelavent and we can revert it at that point, until then, this patch is essential for the immediate need. But, probably let's move all the utility code from AMDGPUMachineFunction to utility file.

Move utility functions from AMDGPUMachineFunction to utility file

Harbormaster completed remote builds in B104855: Diff 345921.May 17 2021, 11:11 AM

In D102401#2763925, @hsmhsm wrote:

But, do we have any immediate available neat solution(s) which will make sure that all the LDS globals which are used within a kernel is replaced by single struct (or something else) instance?

That's essentially what AMDGPULowerModuleLDSPass does, and could be extended to have a kernel specific table

rampitec added inline comments.May 17 2021, 1:16 PM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61	No, it is a separate subclass of User: https://llvm.org/doxygen/classllvm_1_1User.html
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	I do not like the idea of allocating memory, then copying.
llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36	The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA. It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile. One thing it does not check is that we do not allocate more LDS than used. It does not check the situation with multiple kernels using intersecting LDS variable sets. I was looking at a test which will really check memory layout, like that is possible with AMDGPULowerModuleLDSPass. In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.

In D102401#2764143, @arsenm wrote:

In D102401#2763925, @hsmhsm wrote:

But, do we have any immediate available neat solution(s) which will make sure that all the LDS globals which are used within a kernel is replaced by single struct (or something else) instance?

That's essentially what AMDGPULowerModuleLDSPass does, and could be extended to have a kernel specific table

Ok, assume that we will extend AMDGPULowerModuleLDSPass to have a kernel specific table. What I am not getting is - how to use this table in order to replace LDS globals by their offset counter-parts within non-kernel functions. Passing kernel-id as an argument is ruled out, since the presence of indirect calls will really break this since function pointers can be used in so many varieties of ways.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61	But, if I look into the implemenation of Operator at - https://llvm.org/doxygen/Operator_8h_source.html it clearly tells me that an instance of Operator class will never ever be created, and it is really a place holder for Instrunction and ConstExpr. /// Return the opcode for this Instruction or ConstantExpr. unsigned getOpcode() const { if (const Instruction *I = dyn_cast<Instruction>(this)) return I->getOpcode(); return cast<ConstantExpr>(this)->getOpcode(); }
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	Then you mean to say, let's return vector, instead of passing it as reference arg? then we cannot use SmallVectorImpl<> as return type, so, we need to go back to returning SmallVector<> itself.
llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36	I think, all of you are kind of suggesting to extend AMDGPULowerModuleLDSPass. But, I am not sure, how it is easily extendible, moreever, it is being invoked before promoteAlloca pass at the moment. We really need to discuss it and come to a common conclusion, before proceeding further. For me, it looks like - there is a ball, half of which is painted white, and other half black, few are looking at white part, and others looking at black part, and not able to come to a common conclusion.

foad added inline comments.May 18 2021, 2:01 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	To avoid allocating memory for a copy of the data, you could filter StaticLDSGlobals in-place with `erase(remove_if(StaticLDSGlobals.begin(), StaticLDSGlobals.end(), predicate), StaticLDSGlobals.end())`.

hsmhsm added inline comments.May 18 2021, 2:52 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	Probably, we land-up with code like below. We still need to copy to `LocalVars`, because findVariablesToLower() expect return std::vector. And, based on earlier reveiw comments, I have made changes to collectStaticLDSGlobals() to work with SmallVector. Additionally, look at the complexity of the code change in terms of understanding it. So, I thought it is not worth, since we still need to copy anyway. StaticLDSGlobals.erase( std::remove_if(StaticLDSGlobals.begin(), StaticLDSGlobals.end(), [&](const GlobalVariable GV) { GlobalVariable GV2 = const_cast<GlobalVariable >(GV); if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User U) { return userRequiresLowering(UsedList, U); })) { return true; } return false; }), StaticLDSGlobals.end()); std::vector<llvm::GlobalVariable *> LocalVars(StaticLDSGlobals.begin(), StaticLDSGlobals.end());

hsmhsm added inline comments.May 18 2021, 3:01 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	Additionally, there is also a added complexity of handling "constness" of "GlobalVariable *"

foad added inline comments.May 18 2021, 3:24 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User U) { return userRequiresLowering(UsedList, U); })) { return true; } return false; This should just `return std::none_of(...);` Additionally, there is also a added complexity of handling "constness" of "GlobalVariable " Then fix `userRequiresLowering` first? Something like: https://reviews.llvm.org/differential/diff/346095/

hsmhsm added inline comments.May 18 2021, 3:35 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
236	This should just return std::none_of(...); Agree Then fix userRequiresLowering first? Something like: https://reviews.llvm.org/differential/diff/346095/ I assume, it is not as simple as that, and I guess, it will force me to change within `AMDGPULowerModuleLDSPass.cpp` which I want to avoid at this point. These kind of changes are better to implement as a kind of code refactor patch. But, nevetheless, if you guys think that it is must, then, I will do it.

rampitec added inline comments.May 18 2021, 11:09 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61	Example: %sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8*), i64 %sunkaddr3
llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36	PromoteAlloca is probably not an issue, it will just add a new allocation past that struct. It should be already properly aligned as we are creating it ourselves. The current waste of LDS by the AMDGPULowerModuleLDSPass is an issue though. Ideally it would be nice to come up with a single solution for functions and kernels which I guess we are continue discussing...

hsmhsm added inline comments.May 19 2021, 1:00 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61	If I understand it correctly, @global_pointer is used within a constant expression - bitcast, and this constant expression is used within an instruction - getelementptr. And this case is neatly handled in the code. I am not sure how relevant it to the "Operator" that you brought up. I still believe, there is no need to check for "Operator".
llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36	I still do not understand, if we can really handle it within AMDGPULowerModuleLDSPass. Expecting a single struct variable and only single struct variable within every kernel is too much ideal expectation at this point. And not sure, why you guys are insistiing about it. Strictly speaking, this pass does not update align of LDS globals? It only fix alignment and allocate memory accordingly. So, the lit tests should check for allocated boundary and total size allocated. And that is what I did here. If you are not happy with MIR result, then, I can check it on final assembly. To answer some of your earlier comments: The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA. I can change it to check instructions instead of mir. Based on the sorting algorithm implemented - we expect which LDS global should be allocated on which assdresses, and how much LDS should be allocated in total. That is what I did here. It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile. Let me see, what is missing here. One thing it does not check is that we do not allocate more LDS than used. I am checking for it as in "SDAG-HSA: amdhsa_group_segment_fixed_size 31" It does not check the situation with multiple kernels using intersecting LDS variable sets. I will try to improve the test cases.. I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting? In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities. I doubt if we can quickly achieve it considering all the mess happening right now.

Rebased to latest main.

Harbormaster completed remote builds in B105166: Diff 346361.May 19 2021, 2:07 AM

Update lit tests as below:

[1] Check instructions instead of MIR
[2] Add test to check LDS used with multiple kernels
[3] Add test to check that llvm.amdgcn.module.lds is allocated on address 0

Fixed comment within lit test.

Within lit tests make sure that different values are stored to LDS globals so that it will be
bit easier to track which memory is allocated for which LDS global.

Harbormaster completed remote builds in B105218: Diff 346429.May 19 2021, 6:59 AM

rampitec added inline comments.May 19 2021, 10:31 AM

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
61	OK, I was unable to exploit it, it indeed cast to Constant.
llvm/test/CodeGen/AMDGPU/GlobalISel/lds-allocation.ll
35	One remaining problem with these tests now they depend on the register allocation and actual register names. It will be difficult to maintain.
llvm/test/CodeGen/AMDGPU/lds-allocation.ll
36	I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting? I was thinking about a test just showing an IR globals after the transformation, but I guess since it is still just the lowering the IR is not updated and such test is not possible. In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities. I doubt if we can quickly achieve it considering all the mess happening right now. Given it does not solve minimal per-kernel allocation as we discussed yes, we cannot just use it right away.

Since we are planning different solutions, I am abandoning this patch.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUMachineFunction.cpp

16 lines

Utils/

AMDGPULDSUtils.h

21 lines

AMDGPULDSUtils.cpp

192 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

481 lines

61 lines

4 lines

38 lines

450 lines

58 lines

local-memory.amdgcn.ll

4 lines

promote-alloca-padding-size-estimate.ll

20 lines

MIR/

AMDGPU/

machine-function-info.ll

2 lines

Diff 346429

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

	//===-- AMDGPUMachineFunctionInfo.cpp ---------------------------------------=//			//===-- AMDGPUMachineFunctionInfo.cpp ---------------------------------------=//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPUMachineFunction.h"			#include "AMDGPUMachineFunction.h"
	#include "AMDGPUPerfHintAnalysis.h"			#include "AMDGPUPerfHintAnalysis.h"
	#include "AMDGPUSubtarget.h"			#include "AMDGPUSubtarget.h"
				#include "Utils/AMDGPULDSUtils.h"
	#include "llvm/CodeGen/MachineModuleInfo.h"			#include "llvm/CodeGen/MachineModuleInfo.h"
	#include "llvm/Target/TargetMachine.h"			#include "llvm/Target/TargetMachine.h"

	using namespace llvm;			using namespace llvm;

	AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF)			AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF)
	: MachineFunctionInfo(), Mode(MF.getFunction()),			: MachineFunctionInfo(), Mode(MF.getFunction()),
	IsEntryFunction(			IsEntryFunction(
				foadUnsubmitted Done Reply Inline Actions Isn't this llvm::is_contained? foad: Isn't this llvm::is_contained?
	AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv())),			AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv())),
				arsenmUnsubmitted Done Reply Inline Actions You're ignoring the explicit alignment of the global, but also implicitly adding the ABI alignment of the type arsenm: You're ignoring the explicit alignment of the global, but also implicitly adding the ABI…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I think in that case, we need a separate OPT transformation pass just before ISEL (llc) pass, which actually update the natural alignment of LDS globals based on their size, when there is an underaligned case. hsmhsm: I think in that case, we need a separate OPT transformation pass just before ISEL (llc) pass…
	IsModuleEntryFunction(			IsModuleEntryFunction(
	AMDGPU::isModuleEntryFunctionCC(MF.getFunction().getCallingConv())),			AMDGPU::isModuleEntryFunctionCC(MF.getFunction().getCallingConv())),
	NoSignedZerosFPMath(MF.getTarget().Options.NoSignedZerosFPMath) {			NoSignedZerosFPMath(MF.getTarget().Options.NoSignedZerosFPMath) {
	const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(MF);			const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(MF);

	// FIXME: Should initialize KernArgSize based on ExplicitKernelArgOffset,			// FIXME: Should initialize KernArgSize based on ExplicitKernelArgOffset,
	// except reserved size is not correctly aligned.			// except reserved size is not correctly aligned.
	const Function &F = MF.getFunction();			const Function &F = MF.getFunction();

	Attribute MemBoundAttr = F.getFnAttribute("amdgpu-memory-bound");			Attribute MemBoundAttr = F.getFnAttribute("amdgpu-memory-bound");
				foadUnsubmitted Done Reply Inline Actions Isn't there a suitable Alignment::max? foad: Isn't there a suitable Alignment::max?
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I could find any such function. I could see below ones, but one of the arg should be of MaybeAlign. inline Align max(MaybeAlign Lhs, Align Rhs) { return Lhs && Lhs > Rhs ? Lhs : Rhs; } inline Align max(Align Lhs, MaybeAlign Rhs) { return Rhs && Rhs > Lhs ? Rhs : Lhs; } hsmhsm: I could find any such function. I could see below ones, but one of the arg should be of…
	MemoryBound = MemBoundAttr.getValueAsBool();			MemoryBound = MemBoundAttr.getValueAsBool();

	Attribute WaveLimitAttr = F.getFnAttribute("amdgpu-wave-limiter");			Attribute WaveLimitAttr = F.getFnAttribute("amdgpu-wave-limiter");
	WaveLimiter = WaveLimitAttr.getValueAsBool();			WaveLimiter = WaveLimitAttr.getValueAsBool();

	CallingConv::ID CC = F.getCallingConv();			CallingConv::ID CC = F.getCallingConv();
	if (CC == CallingConv::AMDGPU_KERNEL \|\| CC == CallingConv::SPIR_KERNEL)			if (CC == CallingConv::AMDGPU_KERNEL \|\| CC == CallingConv::SPIR_KERNEL)
	ExplicitKernArgSize = ST.getExplicitKernArgSize(F, MaxKernArgAlign);			ExplicitKernArgSize = ST.getExplicitKernArgSize(F, MaxKernArgAlign);

				if (IsModuleEntryFunction) {
				rampitecUnsubmitted Done Reply Inline Actions Is there a reason to use std::set and not one of llvm's collections? rampitec: Is there a reason to use std::set and not one of llvm's collections?
				SmallVector<const GlobalVariable *, 8> SortedLDSGlobals;
				AMDGPU::getSortedUsedLDSGlobals(&F, SortedLDSGlobals);

				const DataLayout &DL = F.getParent()->getDataLayout();
				for (const auto *GV : SortedLDSGlobals)
				rampitecUnsubmitted Done Reply Inline Actions Get DL outside of the loop. rampitec: Get DL outside of the loop.
				allocateLDSGlobal(DL, *GV);
				}
	}			}

	unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,			unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,
	const GlobalVariable &GV) {			const GlobalVariable &GV) {
				rampitecUnsubmitted Done Reply Inline Actions Can a GV hold a pointer to this GV? rampitec: Can a GV hold a pointer to this GV?
	auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));			auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));
	if (!Entry.second)			if (!Entry.second)
	return Entry.first->second;			return Entry.first->second;

	Align Alignment =			Align Alignment =
	DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());			DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());

	/// TODO: We should sort these to minimize wasted space due to alignment			// Make sure that natural alignment of `GV` is correctly fixed.
				rampitecUnsubmitted Done Reply Inline Actions This comment shall not be relevant anymore? rampitec: This comment shall not be relevant anymore?
	/// padding. Currently the padding is decided by the first encountered use			Alignment = AMDGPU::fixAlignment(DL, &GV, Alignment);
				rampitecUnsubmitted Not Done Reply Inline Actions It can probably be Operator. rampitec: It can probably be Operator.
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I did not understand this comment. hsmhsm: I did not understand this comment.
				rampitecUnsubmitted Done Reply Inline Actions llvm::Operator. rampitec: llvm::Operator.
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions llvm::Operator is an abstracrtion over Instruction and ConstExpr. An llvm::Operator should be either an Instruction OR ConstExpr, and hence explicit handling of llvm::Operator is not required. hsmhsm: llvm::Operator is an abstracrtion over Instruction and ConstExpr. An llvm::Operator should be…
				rampitecUnsubmitted Not Done Reply Inline Actions No, it is a separate subclass of User: https://llvm.org/doxygen/classllvm_1_1User.html rampitec: No, it is a separate subclass of User: https://llvm.org/doxygen/classllvm_1_1User.html
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions But, if I look into the implemenation of Operator at - https://llvm.org/doxygen/Operator_8h_source.html it clearly tells me that an instance of Operator class will never ever be created, and it is really a place holder for Instrunction and ConstExpr. /// Return the opcode for this Instruction or ConstantExpr. unsigned getOpcode() const { if (const Instruction I = dyn_cast<Instruction>(this)) return I->getOpcode(); return cast<ConstantExpr>(this)->getOpcode(); } hsmhsm:* But, if I look into the implemenation of Operator at - https://llvm.
				rampitecUnsubmitted Not Done Reply Inline Actions Example: %sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8), i64 %sunkaddr3 rampitec:* Example: ``` %sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }*…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions If I understand it correctly, @global_pointer is used within a constant expression - bitcast, and this constant expression is used within an instruction - getelementptr. And this case is neatly handled in the code. I am not sure how relevant it to the "Operator" that you brought up. I still believe, there is no need to check for "Operator". hsmhsm: If I understand it correctly, @global_pointer is used within a constant expression - bitcast…
				rampitecUnsubmitted Not Done Reply Inline Actions OK, I was unable to exploit it, it indeed cast to Constant. rampitec: OK, I was unable to exploit it, it indeed cast to Constant.
	/// during lowering.
	unsigned Offset = StaticLDSSize = alignTo(StaticLDSSize, Alignment);			unsigned Offset = StaticLDSSize = alignTo(StaticLDSSize, Alignment);

	Entry.first->second = Offset;			Entry.first->second = Offset;
	StaticLDSSize += DL.getTypeAllocSize(GV.getValueType());			StaticLDSSize += DL.getTypeAllocSize(GV.getValueType());

	// Update the LDS size considering the padding to align the dynamic shared			// Update the LDS size considering the padding to align the dynamic shared
	// memory.			// memory.
	LDSSize = alignTo(StaticLDSSize, DynLDSAlign);			LDSSize = alignTo(StaticLDSSize, DynLDSAlign);
				rampitecUnsubmitted Done Reply Inline Actions Why std::vector? rampitec: Why std::vector?

	return Offset;			return Offset;
	}			}

				arsenmUnsubmitted Done Reply Inline Actions cast<> arsenm: cast<>
	void AMDGPUMachineFunction::allocateModuleLDSGlobal(const Module *M) {			void AMDGPUMachineFunction::allocateModuleLDSGlobal(const Module *M) {
	if (isModuleEntryFunction()) {			if (isModuleEntryFunction()) {
	GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");			GlobalVariable *GV = M->getGlobalVariable("llvm.amdgcn.module.lds");
	if (GV) {			if (GV) {
	unsigned Offset = allocateLDSGlobal(M->getDataLayout(), *GV);			unsigned Offset = allocateLDSGlobal(M->getDataLayout(), *GV);
	(void)Offset;			(void)Offset;
	assert(Offset == 0 &&			assert(Offset == 0 &&
	"Module LDS expected to be allocated before other LDS");			"Module LDS expected to be allocated before other LDS");
				rampitecUnsubmitted Done Reply Inline Actions Looks like you are traversing all users just to check if a GC used from this kernel. This is wasteful, you can just create a function checking that a GV is reachable from this kernel and early return if it is. rampitec: Looks like you are traversing all users just to check if a GC used from this kernel. This is…
	}			}
				rampitecUnsubmitted Done Reply Inline Actions SmallVectorImpl<> rampitec: SmallVectorImpl<>
	}			}
	}			}

	void AMDGPUMachineFunction::setDynLDSAlign(const DataLayout &DL,			void AMDGPUMachineFunction::setDynLDSAlign(const DataLayout &DL,
	const GlobalVariable &GV) {			const GlobalVariable &GV) {
	assert(DL.getTypeAllocSize(GV.getValueType()).isZero());			assert(DL.getTypeAllocSize(GV.getValueType()).isZero());

	Align Alignment =			Align Alignment =
	DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());			DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());
	if (Alignment <= DynLDSAlign)			if (Alignment <= DynLDSAlign)
				foadUnsubmitted Done Reply Inline Actions Why is sorting by size important? I can see that sorting by alignment, descending, avoids alignment gaps (at least in the common case where an object's size is a multiple of its alignment). But sorting by size first means that most globals won't be sorted by alignment (unless they happened to have the same size and different alignments), so you won't get that benefit. foad: Why is sorting by size important? I can see that sorting by alignment, descending, avoids…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I think we need to first sort based on "alignment", then on tie, based on size. hsmhsm: I think we need to first sort based on "alignment", then on tie, based on size.
				rampitecUnsubmitted Done Reply Inline Actions Probably yes. Strategy to sort by size was used along with the super-alignment, so did the same thing. rampitec: Probably yes. Strategy to sort by size was used along with the super-alignment, so did the same…
				foadUnsubmitted Done Reply Inline Actions I still don't understand if there's a reason to sort by size or name here. It's a stable sort, so why not just sort by alignment? foad: I still don't understand if there's a reason to sort by size or name here. It's a stable sort…
				rampitecUnsubmitted Done Reply Inline Actions Size may leave samller gaps. Name is an overkill, merely a backdoor to hack it and force non-trivial "optimizations" by renaming variables. But there are more conceptual issues so far. rampitec: Size may leave samller gaps. Name is an overkill, merely a backdoor to hack it and force non…
				JonChesterfieldUnsubmitted Done Reply Inline Actions This has been copy/pasted from lowermodulelds, where sort by name was mostly done for more predictable test cases. JonChesterfield: This has been copy/pasted from lowermodulelds, where sort by name was mostly done for more…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I thought that it would be helpful in following scenarios. Here we probably know that allocation order is [A, B, C] since both size and alignment are same for all three globals. @A = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4 @B = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4 @C= internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4 hsmhsm: I thought that it would be helpful in following scenarios. Here we probably know that…
	return;			return;

	LDSSize = alignTo(StaticLDSSize, Alignment);			LDSSize = alignTo(StaticLDSSize, Alignment);
	DynLDSAlign = Alignment;			DynLDSAlign = Alignment;
	}			}

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h

	Show All 17 Lines
	namespace llvm {			namespace llvm {

	namespace AMDGPU {			namespace AMDGPU {

	bool isKernelCC(Function *Func);			bool isKernelCC(Function *Func);

	Align getAlign(DataLayout const &DL, const GlobalVariable *GV);			Align getAlign(DataLayout const &DL, const GlobalVariable *GV);

				// Fix natural alignment for LDS global `GV` based on its size.
				Align fixAlignment(const DataLayout &DL, const GlobalVariable *GV,
				Align Alignment);

				// Check if the kernel K uses the LDS global GV.
				bool hasKernelUsesLDS(const GlobalVariable GV, const Function K);

				// Sort LDS globals (which are used within kernel K) by alignment, descending,
				// on ties, by size, descending, on ties, by name, lexicographical.
				void sortLDSGlobals(SmallVectorImpl<const GlobalVariable *> &LDSGlobals,
				const DataLayout &DL);

				// Collcet all static LDS globals which are used within kernel K in sorted
				// order.
				void getSortedUsedLDSGlobals(
				const Function *K,
				SmallVectorImpl<const GlobalVariable *> &SortedLDSGlobals);

	bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,			bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,
	User *InitialUser);			User *InitialUser);

	std::vector<GlobalVariable *>			std::vector<GlobalVariable *>
	findVariablesToLower(Module &M, const SmallPtrSetImpl<GlobalValue *> &UsedList);			findVariablesToLower(Module &M, const SmallPtrSetImpl<GlobalValue *> &UsedList);

	SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M);			SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M);

				void collectStaticLDSGlobals(
				rampitecUnsubmitted Done Reply Inline Actions SmallVectorImpl<>. rampitec: SmallVectorImpl<>.
				arsenmUnsubmitted Done Reply Inline Actions That doesn't work for a return value arsenm: That doesn't work for a return value
				const Module M, SmallVectorImpl<const GlobalVariable > &StaticLDSGlobals);

	} // end namespace AMDGPU			} // end namespace AMDGPU

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H			#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp

Show All 23 Lines	bool isKernelCC(Function *Func) {
return AMDGPU::isModuleEntryFunctionCC(Func->getCallingConv());		return AMDGPU::isModuleEntryFunctionCC(Func->getCallingConv());
}		}

Align getAlign(DataLayout const &DL, const GlobalVariable *GV) {		Align getAlign(DataLayout const &DL, const GlobalVariable *GV) {
return DL.getValueOrABITypeAlignment(GV->getPointerAlignment(DL),		return DL.getValueOrABITypeAlignment(GV->getPointerAlignment(DL),
GV->getValueType());		GV->getValueType());
}		}

		// Fix natural alignment for LDS global `GV` based on its size.
		Align fixAlignment(const DataLayout &DL, const GlobalVariable *GV,
		Align Alignment) {
		TypeSize GVSize = DL.getTypeAllocSize(GV->getValueType());

		if (GVSize > 8) {
		arsenmUnsubmitted Done Reply Inline Actions I think GV directly has a getAddressSpace arsenm: I think GV directly has a getAddressSpace
		// We might want to use a b96 or b128 load/store
		Alignment = std::max(Alignment, Align(16));
		} else if (GVSize > 4) {
		// We might want to use a b64 load/store
		Alignment = std::max(Alignment, Align(8));
		} else if (GVSize > 2) {
		// We might want to use a b32 load/store
		Alignment = std::max(Alignment, Align(4));
		} else if (GVSize > 1) {
		// We might want to use a b16 load/store
		Alignment = std::max(Alignment, Align(2));
		}

		return Alignment;
		}

		// Check if the kernel K uses the LDS global GV.
		bool hasKernelUsesLDS(const GlobalVariable GV, const Function K) {
		SmallVector<const User *, 8> UserStack(GV->users());
		SmallPtrSet<const User *, 8> VisitedUsers;

		// Following are the possibilities of GV uses:
		// 1. GV is directly used within an instruction as one of its operand.
		// 2. GV is used within a constant expression, and this constant expression
		// itself is used either within an instruction or as an initiializer to
		// a global variable. Though later case is rare, nevertheless, we need to
		// check for it.
		while (!UserStack.empty()) {
		const auto *UU = UserStack.pop_back_val();

		// No need to handle already visited users.
		if (!VisitedUsers.insert(UU).second)
		continue;

		// Ignore possible, but, rare uses as in below example.
		// @GV = addrspace(3) global int undef
		// @GV2 =
		// addrspace(1) global int* addrspacecast(int addrspace(3)* @GV to int*)
		if (isa<GlobalVariable>(UU))
		continue;

		// Recursively traverse through constant expressions.
		if (isa<Constant>(UU)) {
		append_range(UserStack, UU->users());
		continue;
		}

		// At this point, user should be an instruction. Check if this instruction
		// is from the kernel K, if so, return true.
		if (cast<Instruction>(UU)->getFunction() == K)
		return true;
		}

		// GV is not used within kernel K. return false.
		return false;
		}

		// Sort LDS globals (which are used within kernel K) by alignment, descending,
		// on ties, by size, descending, on ties, by name, lexicographical.
		void sortLDSGlobals(SmallVectorImpl<const GlobalVariable *> &LDSGlobals,
		const DataLayout &DL) {
		llvm::stable_sort(
		LDSGlobals,
		[&](const GlobalVariable LHS, const GlobalVariable RHS) -> bool {
		Align ALHS = AMDGPU::fixAlignment(DL, LHS, AMDGPU::getAlign(DL, LHS));
		Align ARHS = AMDGPU::fixAlignment(DL, RHS, AMDGPU::getAlign(DL, RHS));
		if (ALHS != ARHS) {
		return ALHS > ARHS;
		}

		TypeSize SLHS = DL.getTypeAllocSize(LHS->getValueType());
		TypeSize SRHS = DL.getTypeAllocSize(RHS->getValueType());
		if (SLHS != SRHS) {
		return SLHS > SRHS;
		}

		return LHS->getName() < RHS->getName();
		});
		}

		void collectStaticLDSGlobals(
		const Module *M,
		SmallVectorImpl<const GlobalVariable *> &StaticLDSGlobals) {

		for (const auto &GV : M->globals()) {
		if (GV.getAddressSpace() != AMDGPUAS::LOCAL_ADDRESS) {
		continue;
		}

		if (!GV.hasInitializer()) {
		// addrspace(3) without initializer implies cuda/hip extern __shared__
		// the semantics for such a variable appears to be that all extern
		// __shared__ variables alias one another, in which case this transform
		// is not required
		continue;
		}

		if (!isa<UndefValue>(GV.getInitializer())) {
		// Initializers are unimplemented for local address space.
		// Leave such variables in place for consistent error reporting.
		continue;
		}

		if (GV.isConstant()) {
		// A constant undef variable can't be written to, and any load is
		// undef, so it should be eliminated by the optimizer. It could be
		// dropped by the back end if not. This pass skips over it.
		continue;
		}

		StaticLDSGlobals.push_back(&GV);
		}
		}

		// Collcet all static LDS globals which are used within kernel K in sorted
		// order.
		void getSortedUsedLDSGlobals(
		const Function *K,
		SmallVectorImpl<const GlobalVariable *> &SortedLDSGlobals) {
		const Module *M = K->getParent();
		const DataLayout &DL = M->getDataLayout();

		// Collect all stastic LDS globals defined within the module.
		SmallVector<const GlobalVariable *, 8> StaticLDSGlobals;
		AMDGPU::collectStaticLDSGlobals(M, StaticLDSGlobals);

		// Remove all those stastic LDS globals which are not used within kernel K.
		StaticLDSGlobals.erase(
		std::remove_if(StaticLDSGlobals.begin(), StaticLDSGlobals.end(),
		[&](const GlobalVariable *GV) {
		return !AMDGPU::hasKernelUsesLDS(GV, K);
		}),
		StaticLDSGlobals.end());

		// Sort LDS globals (which are used within kernel K) by alignment, descending,
		// on ties, by size, descending, on ties, by name, lexicographical.
		AMDGPU::sortLDSGlobals(StaticLDSGlobals, DL);

		// Module LDS which is possibly created by the "Lower Module LDS" pass, should
		// be allocated at address 0, irrespective of its size and alignment.
		GlobalVariable *ModuleLDS =
		M->getGlobalVariable("llvm.amdgcn.module.lds", true);
		if (ModuleLDS)
		SortedLDSGlobals.push_back(ModuleLDS);

		for (const auto *GV : StaticLDSGlobals) {
		if (GV != ModuleLDS)
		SortedLDSGlobals.push_back(GV);
		}
		}

bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,		bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,
User *InitialUser) {		User *InitialUser) {
// Any LDS variable can be lowered by moving into the created struct		// Any LDS variable can be lowered by moving into the created struct
// Each variable so lowered is allocated in every kernel, so variables		// Each variable so lowered is allocated in every kernel, so variables
// whose users are all known to be safe to lower without the transform		// whose users are all known to be safe to lower without the transform
// are left unchanged.		// are left unchanged.
SmallPtrSet<User *, 8> Visited;		SmallPtrSet<User *, 8> Visited;
SmallVector<User *, 16> Stack;		SmallVector<User *, 16> Stack;
Show All 30 Lines	bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,
}		}

return false;		return false;
}		}

std::vector<GlobalVariable *>		std::vector<GlobalVariable *>
findVariablesToLower(Module &M,		findVariablesToLower(Module &M,
const SmallPtrSetImpl<GlobalValue *> &UsedList) {		const SmallPtrSetImpl<GlobalValue *> &UsedList) {
		SmallVector<const GlobalVariable *, 8> StaticLDSGlobals;
		collectStaticLDSGlobals(&M, StaticLDSGlobals);
		rampitecUnsubmitted Done Reply Inline Actions Maybe just pass a filter lambda there? rampitec: Maybe just pass a filter lambda there?
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions I personally think that it will be over complicated - collectStaticLDSGlobals() is called by both findVariablesToLower() and getSortedUsedLDSGlobals(). Further filtering of LDS globals is required for findVariablesToLower(), but not for getSortedUsedLDSGlobals(). hsmhsm: I personally think that it will be over complicated - collectStaticLDSGlobals() is called by…
		rampitecUnsubmitted Not Done Reply Inline Actions I do not like the idea of allocating memory, then copying. rampitec: I do not like the idea of allocating memory, then copying.
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Then you mean to say, let's return vector, instead of passing it as reference arg? then we cannot use SmallVectorImpl<> as return type, so, we need to go back to returning SmallVector<> itself. hsmhsm: Then you mean to say, let's return vector, instead of passing it as reference arg? then we…
		foadUnsubmitted Not Done Reply Inline Actions To avoid allocating memory for a copy of the data, you could filter StaticLDSGlobals in-place with `erase(remove_if(StaticLDSGlobals.begin(), StaticLDSGlobals.end(), predicate), StaticLDSGlobals.end())`. foad: To avoid allocating memory for a copy of the data, you could filter StaticLDSGlobals in-place…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Probably, we land-up with code like below. We still need to copy to `LocalVars`, because findVariablesToLower() expect return std::vector. And, based on earlier reveiw comments, I have made changes to collectStaticLDSGlobals() to work with SmallVector. Additionally, look at the complexity of the code change in terms of understanding it. So, I thought it is not worth, since we still need to copy anyway. StaticLDSGlobals.erase( std::remove_if(StaticLDSGlobals.begin(), StaticLDSGlobals.end(), [&](const GlobalVariable GV) { GlobalVariable GV2 = const_cast<GlobalVariable >(GV); if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User U) { return userRequiresLowering(UsedList, U); })) { return true; } return false; }), StaticLDSGlobals.end()); std::vector<llvm::GlobalVariable > LocalVars(StaticLDSGlobals.begin(), StaticLDSGlobals.end()); hsmhsm:* Probably, we land-up with code like below. We still need to copy to `LocalVars`, because…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Additionally, there is also a added complexity of handling "constness" of "GlobalVariable " hsmhsm:* Additionally, there is also a added complexity of handling "constness" of "GlobalVariable *"
		foadUnsubmitted Not Done Reply Inline Actions if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User U) { return userRequiresLowering(UsedList, U); })) { return true; } return false; This should just `return std::none_of(...);` Additionally, there is also a added complexity of handling "constness" of "GlobalVariable " Then fix `userRequiresLowering` first? Something like: https://reviews.llvm.org/differential/diff/346095/ foad: ``` if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User *U) {…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions This should just return std::none_of(...); Agree Then fix userRequiresLowering first? Something like: https://reviews.llvm.org/differential/diff/346095/ I assume, it is not as simple as that, and I guess, it will force me to change within `AMDGPULowerModuleLDSPass.cpp` which I want to avoid at this point. These kind of changes are better to implement as a kind of code refactor patch. But, nevetheless, if you guys think that it is must, then, I will do it. hsmhsm: >> This should just return std::none_of(...); Agree >> Then fix userRequiresLowering first?

std::vector<llvm::GlobalVariable *> LocalVars;		std::vector<llvm::GlobalVariable *> LocalVars;
for (auto &GV : M.globals()) {
if (GV.getType()->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS) {		for (const auto *GV : StaticLDSGlobals) {
continue;		GlobalVariable GV2 = const_cast<GlobalVariable >(GV);
}
if (!GV.hasInitializer()) {		if (std::none_of(GV2->user_begin(), GV2->user_end(), [&](User *U) {
// addrspace(3) without initializer implies cuda/hip extern __shared__
// the semantics for such a variable appears to be that all extern
// __shared__ variables alias one another, in which case this transform
// is not required
continue;
}
if (!isa<UndefValue>(GV.getInitializer())) {
// Initializers are unimplemented for local address space.
// Leave such variables in place for consistent error reporting.
continue;
}
if (GV.isConstant()) {
// A constant undef variable can't be written to, and any load is
// undef, so it should be eliminated by the optimizer. It could be
// dropped by the back end if not. This pass skips over it.
continue;
}
if (std::none_of(GV.user_begin(), GV.user_end(), [&](User *U) {
return userRequiresLowering(UsedList, U);		return userRequiresLowering(UsedList, U);
})) {		})) {
continue;		continue;
}		}
LocalVars.push_back(&GV);
		LocalVars.push_back(GV2);
}		}

return LocalVars;		return LocalVars;
}		}

SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {		SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {
SmallPtrSet<GlobalValue *, 32> UsedList;		SmallPtrSet<GlobalValue *, 32> UsedList;

SmallVector<GlobalValue *, 32> TmpVec;		SmallVector<GlobalValue *, 32> TmpVec;
collectUsedGlobalVariables(M, TmpVec, true);		collectUsedGlobalVariables(M, TmpVec, true);
UsedList.insert(TmpVec.begin(), TmpVec.end());		UsedList.insert(TmpVec.begin(), TmpVec.end());

TmpVec.clear();		TmpVec.clear();
collectUsedGlobalVariables(M, TmpVec, false);		collectUsedGlobalVariables(M, TmpVec, false);
UsedList.insert(TmpVec.begin(), TmpVec.end());		UsedList.insert(TmpVec.begin(), TmpVec.end());

return UsedList;		return UsedList;
}		}

} // end namespace AMDGPU		} // end namespace AMDGPU
		rampitecUnsubmitted Done Reply Inline Actions This is mostly a copy-paste of findVariablesToLower() above. I would suggest to factor it somehow. rampitec: This is mostly a copy-paste of findVariablesToLower() above. I would suggest to factor it…

} // end namespace llvm		} // end namespace llvm

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-allocation.ll

This file was added.

				; RUN: llc -global-isel -march=amdgcn -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 --amdhsa-code-object-version=4 < %s \| FileCheck -check-prefixes=HSA,LDS-USED %s

				; LDS allocation is now done in the sorted order as below:
				;
				; 1. First, alignment is fixed, even for missing alignment or underaligned cases since
				; we know the size of LDS globals at compile time.
				; 2. Next, per kernel used LDS globals are sorted, based on natural alignment, on ties,
				; based on size.
				; 3. Finally, memory is allocated in the above sorted order.


				; [TEST 0] All LDS are properly aligned
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @lds.size.16.align.16 -> 16 byte boundary
				; @lds.size.8.align.8 -> 8 byte boundary
				; @lds.size.4.align.4 -> 4 byte boundary
				; @lds.size.2.align.2 -> 2 byte boundary
				; @lds.size.1.align.1 -> 1 byte boundary (any boundary)
				;
				; Sorted order based on alignment should be:
				; [@lds.size.16.align.16, @lds.size.8.align.8, @lds.size.4.align.4, @lds.size.2.align.2, @lds.size.1.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 16 + 8 + 4 + 2 + 1 [= 31 bytes]

				@lds.size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
				@lds.size.2.align.2 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 2
				@lds.size.4.align.4 = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4
				@lds.size.8.align.8 = internal unnamed_addr addrspace(3) global [8 x i8] undef, align 8
				@lds.size.16.align.16 = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_0:
				; HSA: v_mov_b32_e32 v0, 1
				rampitecUnsubmitted Not Done Reply Inline Actions One remaining problem with these tests now they depend on the register allocation and actual register names. It will be difficult to maintain. rampitec: One remaining problem with these tests now they depend on the register allocation and actual…
				; HSA: v_mov_b32_e32 v1, 30
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 28
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 24
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v1, 16
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 5
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 31
				define amdgpu_kernel void @fix_alignment_0() {
				%lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* @lds.size.1.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.1.align.1.bc, align 1

				%lds.size.2.align.2.bc = bitcast [2 x i8] addrspace(3)* @lds.size.2.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.2.align.2.bc, align 2

				%lds.size.4.align.4.bc = bitcast [4 x i8] addrspace(3)* @lds.size.4.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.4.align.4.bc, align 4

				%lds.size.8.align.8.bc = bitcast [8 x i8] addrspace(3)* @lds.size.8.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.8.align.8.bc, align 8

				%lds.size.16.align.16.bc = bitcast [16 x i8] addrspace(3)* @lds.size.16.align.16 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %lds.size.16.align.16.bc, align 16

				ret void
				}


				; [TEST 1] All LDS are underaligned, requires to allocate on different alignment boundary
				;
				; Fixed alignments for allocation should be:
				; @lds.size.9.align.8 -> 16 byte boundary
				; @lds.size.5.align.4 -> 8 byte boundary
				; @lds.size.3.align.2 -> 4 byte boundary
				; @lds.size.2.align.1 -> 2 byte boundary
				;
				; Sorted order based on fixed alignment should be:
				; [@lds.size.9.align.8, lds.size.5.align.4, lds.size.3.align.2, lds.size.2.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 9 + (pad 7) + 5 + (pad 3) + 3 + (pad 1) + 2 [= 30 bytes]

				@lds.size.2.align.1 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 1
				@lds.size.3.align.2 = internal unnamed_addr addrspace(3) global [3 x i8] undef, align 2
				@lds.size.5.align.4 = internal unnamed_addr addrspace(3) global [5 x i8] undef, align 4
				@lds.size.9.align.8 = internal unnamed_addr addrspace(3) global [9 x i8] undef, align 8

				; HSA-LABEL: {{^}}fix_alignment_1:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 28
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 24
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 16
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 30
				define amdgpu_kernel void @fix_alignment_1() {
				%lds.size.2.align.1.bc = bitcast [2 x i8] addrspace(3)* @lds.size.2.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.2.align.1.bc, align 1

				%lds.size.3.align.2.bc = bitcast [3 x i8] addrspace(3)* @lds.size.3.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.3.align.2.bc, align 2

				%lds.size.5.align.4.bc = bitcast [5 x i8] addrspace(3)* @lds.size.5.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.5.align.4.bc, align 4

				%lds.size.9.align.8.bc = bitcast [9 x i8] addrspace(3)* @lds.size.9.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.9.align.8.bc, align 8

				ret void
				}


				; [TEST 2] All LDS are underaligned, requires to allocate on 16 byte boundary
				;
				; Fixed alignments for allocation should be:
				; @lds.size.9.align.1 -> 16 byte boundary
				; @lds.size.10.align.2 -> 16 byte boundary
				; @lds.size.11.align.4 -> 16 byte boundary
				; @lds.size.12.align.8 -> 16 byte boundary
				;
				; Sorted order based on fixed alignment and then by size should be:
				; [@lds.size.12.align.8, @lds.size.11.align.4, @lds.size.10.align.2, @lds.size.9.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 12 + (pad 4) + 11 + (pad 5) + 10 + (pad 6) + 9 [= 57 bytes]

				@lds.size.9.align.1 = internal unnamed_addr addrspace(3) global [9 x i8] undef, align 1
				@lds.size.10.align.2 = internal unnamed_addr addrspace(3) global [10 x i8] undef, align 2
				@lds.size.11.align.4 = internal unnamed_addr addrspace(3) global [11 x i8] undef, align 4
				@lds.size.12.align.8 = internal unnamed_addr addrspace(3) global [12 x i8] undef, align 8

				; HSA-LABEL: {{^}}fix_alignment_2:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 48
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 32
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 16
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 57
				define amdgpu_kernel void @fix_alignment_2() {
				%lds.size.9.align.1.bc = bitcast [9 x i8] addrspace(3)* @lds.size.9.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.9.align.1.bc, align 1

				%lds.size.10.align.2.bc = bitcast [10 x i8] addrspace(3)* @lds.size.10.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.10.align.2.bc, align 2

				%lds.size.11.align.4.bc = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.11.align.4.bc, align 4

				%lds.size.12.align.8.bc = bitcast [12 x i8] addrspace(3)* @lds.size.12.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.12.align.8.bc, align 8

				ret void
				}


				; [TEST 3] All LDS are underaligned, requires to allocate on 8 byte boundary
				;
				; Fixed alignments for allocation should be:
				; @lds.size.5.align.2 -> 8 byte boundary
				; @lds.size.6.align.2 -> 8 byte boundary
				; @lds.size.7.align.2 -> 8 byte boundary
				; @lds.size.7.align.4 -> 8 byte boundary
				;
				; Sorted order based on fixed alignment and then by size should be:
				; [@lds.size.7.align.4, @lds.size.7.align.2, @lds.size.6.align.2, @lds.size.5.align.2]
				; OR
				; [@lds.size.7.align.2, @lds.size.7.align.4, @lds.size.6.align.2, @lds.size.5.align.2]
				;
				; Memory allocated in the sorted order should be:
				; 7 + (pad 1) + 7 + (pad 1) + 6 + (pad 2) + 5 [= 29 bytes]

				@lds.size.5.align.2 = internal unnamed_addr addrspace(3) global [5 x i8] undef, align 2
				@lds.size.6.align.2 = internal unnamed_addr addrspace(3) global [6 x i8] undef, align 2
				@lds.size.7.align.2 = internal unnamed_addr addrspace(3) global [7 x i8] undef, align 2
				@lds.size.7.align.4 = internal unnamed_addr addrspace(3) global [7 x i8] undef, align 4

				; HSA-LABEL: {{^}}fix_alignment_3:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 24
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 16
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v1, 8
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 29
				define amdgpu_kernel void @fix_alignment_3() {
				%lds.size.5.align.2.bc = bitcast [5 x i8] addrspace(3)* @lds.size.5.align.2 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.5.align.2.bc, align 2

				%lds.size.6.align.2.bc = bitcast [6 x i8] addrspace(3)* @lds.size.6.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.6.align.2.bc, align 2

				%lds.size.7.align.2.bc = bitcast [7 x i8] addrspace(3)* @lds.size.7.align.2 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.7.align.2.bc, align 2

				%lds.size.7.align.4.bc = bitcast [7 x i8] addrspace(3)* @lds.size.7.align.4 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.7.align.4.bc, align 4

				ret void
				}


				; [TEST 4] All LDS are of same alignment, but different size
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @lds.size.17.align.16 -> 16 byte boundary
				; @lds.size.18.align.16 -> 16 byte boundary
				; @lds.size.19.align.16 -> 16 byte boundary
				; @lds.size.20.align.16 -> 16 byte boundary
				;
				; Sorted order based on size should be:
				; [@lds.size.20.align.16, @lds.size.19.align.16, @lds.size.18.align.16, @lds.size.17.align.16]
				;
				; Memory allocated in the sorted order should be:
				; 20 + (pad 12) + 19 + (pad 13) + 18 + (pad 14) + 17 [= 113 bytes]

				@lds.size.17.align.16 = internal unnamed_addr addrspace(3) global [17 x i8] undef, align 16
				@lds.size.18.align.16 = internal unnamed_addr addrspace(3) global [18 x i8] undef, align 16
				@lds.size.19.align.16 = internal unnamed_addr addrspace(3) global [19 x i8] undef, align 16
				@lds.size.20.align.16 = internal unnamed_addr addrspace(3) global [20 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_4:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 0x60
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 64
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 32
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 113
				define amdgpu_kernel void @fix_alignment_4() {
				%lds.size.17.align.16.bc = bitcast [17 x i8] addrspace(3)* @lds.size.17.align.16 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.17.align.16.bc, align 16

				%lds.size.18.align.16.bc = bitcast [18 x i8] addrspace(3)* @lds.size.18.align.16 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.18.align.16.bc, align 16

				%lds.size.19.align.16.bc = bitcast [19 x i8] addrspace(3)* @lds.size.19.align.16 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.19.align.16.bc, align 16

				%lds.size.20.align.16.bc = bitcast [20 x i8] addrspace(3)* @lds.size.20.align.16 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.20.align.16.bc, align 16

				ret void
				}

				; [TEST 5] All LDS are of size 1, but different alignments.
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @lds.size.1.align.1 -> any boundary
				; @lds.size.1.align.2 -> any boundary
				; @lds.size.1.align.4 -> any boundary
				; @lds.size.1.align.8 -> any boundary
				; @lds.size.1.align.16 -> any boundary
				;
				; Sorted order based on size should be:
				; [@size.1.align.16, @size.1.align.8, @size.1.align.4, @size.1.align.2, size.1.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 1 + [gap 7] + 1 + [gap 3] + 1 + [gap 1] + 1 + 1 [= 16 bytes]

				@size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
				@size.1.align.2 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 2
				@size.1.align.4 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 4
				@size.1.align.8 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 8
				@size.1.align.16 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_5:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 15
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 14
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 12
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v1, 8
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 5
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 16
				define amdgpu_kernel void @fix_alignment_5() {
				%lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.1.align.1.bc, align 1

				%lds.size.1.align.2.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.1.align.2.bc, align 2

				%lds.size.1.align.4.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.1.align.4.bc, align 4

				%lds.size.1.align.8.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.1.align.8.bc, align 8

				%lds.size.1.align.16.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.16 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %lds.size.1.align.16.bc, align 16

				ret void
				}

				; [TEST 6] LDS used in multiple kernels.
				;
				; Fixed alignments for allocation should be:
				; @lds.size.33.align.8 -> 16 bytes boundary
				; @lds.size.67.align.8 -> 16 bytes boundary
				; @lds.size.93.align.8 -> 16 bytes boundary
				; @lds.size.11.align.8 -> 16 bytes boundary
				; @lds.size.13.align.8 -> 16 bytes boundary
				; @lds.size.23.align.8 -> 16 bytes boundary
				;
				; kernel : @fix_alignment_6
				;
				; Sorted order based on size should be:
				; [@lds.size.93.align.8, @lds.size.67.align.8, @lds.size.33.align.8]
				;
				; Memory allocated in the sorted order should be:
				; 93 + [pad 3] + 67 + [pad 13] + 33 [= 209 bytes]
				;
				; kernel : @fix_alignment_7
				;
				; Sorted order based on size should be:
				; [@lds.size.93.align.8, @lds.size.13.align.8, @lds.size.11.align.8]
				;
				; Memory allocated in the sorted order should be:
				; 93 + [pad 3] + 13 + [pad 3] + 11 [= 123 bytes]

				; kernel : @fix_alignment_8
				;
				; Sorted order based on size should be:
				; [@lds.size.93.align.8, @lds.size.67.align.8, @lds.size.33.align.8, @lds.size.13.align.8, @lds.size.11.align.8]
				;
				; Memory allocated in the sorted order should be:
				; 93 + [pad 3] + 67 + [pad 13] + 33 + [pad 15] + 13 + [3] + 11 [= 251 bytes]

				@lds.size.33.align.8 = internal unnamed_addr addrspace(3) global [33 x i8] undef, align 8
				@lds.size.67.align.8 = internal unnamed_addr addrspace(3) global [67 x i8] undef, align 8
				@lds.size.93.align.8 = internal unnamed_addr addrspace(3) global [93 x i8] undef, align 8
				@lds.size.11.align.8 = internal unnamed_addr addrspace(3) global [11 x i8] undef, align 8
				@lds.size.13.align.8 = internal unnamed_addr addrspace(3) global [13 x i8] undef, align 8

				; HSA-LABEL: {{^}}fix_alignment_6:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 0xb0
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 0x60
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 209
				define amdgpu_kernel void @fix_alignment_6() {
				%lds.size.33.align.8.bc = bitcast [33 x i8] addrspace(3)* @lds.size.33.align.8 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.33.align.8.bc, align 8

				%lds.size.67.align.8.bc = bitcast [67 x i8] addrspace(3)* @lds.size.67.align.8 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.67.align.8.bc, align 2

				%lds.size.93.align.8.bc = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.93.align.8.bc, align 4

				ret void
				}

				; HSA-LABEL: {{^}}fix_alignment_7:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 0x70
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 0x60
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 123
				define amdgpu_kernel void @fix_alignment_7() {
				%lds.size.93.align.8.bc = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.93.align.8.bc, align 4

				%lds.size.11.align.8.bc = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.8 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.11.align.8.bc, align 8

				%lds.size.13.align.8.bc = bitcast [13 x i8] addrspace(3)* @lds.size.13.align.8 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.13.align.8.bc, align 2

				ret void
				}

				; HSA-LABEL: {{^}}fix_alignment_8:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 0xf0
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v2, 0xb0
				; HSA: ds_write_b8 v2, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v2, 0x60
				; HSA: ds_write_b8 v2, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v2, 0
				; HSA: ds_write_b8 v2, v0
				; HSA: v_mov_b32_e32 v0, 5
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 6
				; HSA: v_mov_b32_e32 v1, 0xe0
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 7
				; HSA: ds_write_b8 v2, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 251
				define amdgpu_kernel void @fix_alignment_8() {
				%lds.size.11.align.8.bc.1 = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.8 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.11.align.8.bc.1, align 8

				%lds.size.33.align.8.bc.2 = bitcast [33 x i8] addrspace(3)* @lds.size.33.align.8 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.33.align.8.bc.2, align 8

				%lds.size.67.align.8.bc.3 = bitcast [67 x i8] addrspace(3)* @lds.size.67.align.8 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.67.align.8.bc.3, align 2

				%lds.size.93.align.8.bc.4 = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.93.align.8.bc.4, align 4

				%lds.size.11.align.8.bc.5 = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.8 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %lds.size.11.align.8.bc.5, align 8

				%lds.size.13.align.8.bc.6 = bitcast [13 x i8] addrspace(3)* @lds.size.13.align.8 to i8 addrspace(3)*
				store i8 6, i8 addrspace(3)* %lds.size.13.align.8.bc.6, align 2

				%lds.size.93.align.8.bc.7 = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 7, i8 addrspace(3)* %lds.size.93.align.8.bc.7, align 4

				ret void
				}

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-allocation2.ll

This file was added.

				; RUN: llc -global-isel -march=amdgcn -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 --amdhsa-code-object-version=4 < %s \| FileCheck -check-prefixes=HSA,LDS-USED %s

				; LDS allocation is now done in the sorted order as below:
				;
				; 1. First, alignment is fixed, even for missing alignment or underaligned cases since
				; we know the size of LDS globals at compile time.
				; 2. Next, per kernel used LDS globals are sorted, based on natural alignment, on ties,
				; based on size.
				; 3. Finally, memory is allocated in the above sorted order.

				; [TEST 0] The variable llvm.amdgcn.module.lds should be allocated at address 0
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @lds.size.111.align.16 -> 16 byte boundary
				; @lds.size.222.align.16 -> 16 byte boundary
				; @lds.size.333.align.16 -> 16 byte boundary
				; @llvm.amdgcn.module.lds -> 16 byte boundary
				;
				; Sorted order should be:
				; [@llvm.amdgcn.module.lds, @lds.size.333.align.16, @lds.size.222.align.16, @lds.size.111.align.16]
				;
				; Memory allocated in the sorted order should be:
				; 16 + 333 + [pad 3] + 222 + [pad 2] + 111 [= 687 bytes]

				@lds.size.111.align.16 = internal unnamed_addr addrspace(3) global [111 x i8] undef, align 16
				@lds.size.222.align.16 = internal unnamed_addr addrspace(3) global [222 x i8] undef, align 16
				@lds.size.333.align.16 = internal unnamed_addr addrspace(3) global [333 x i8] undef, align 16
				@llvm.amdgcn.module.lds = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_0:
				; HSA: v_mov_b32_e32 v0, 1
				; HSA: v_mov_b32_e32 v1, 0x240
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 2
				; HSA: v_mov_b32_e32 v1, 0x160
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 3
				; HSA: v_mov_b32_e32 v1, 16
				; HSA: ds_write_b8 v1, v0
				; HSA: v_mov_b32_e32 v0, 4
				; HSA: v_mov_b32_e32 v1, 0
				; HSA: ds_write_b8 v1, v0
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 687
				define amdgpu_kernel void @fix_alignment_0() {
				%lds.size.111.align.16.bc = bitcast [111 x i8] addrspace(3)* @lds.size.111.align.16 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.111.align.16.bc, align 16

				%lds.size.222.align.16.bc = bitcast [222 x i8] addrspace(3)* @lds.size.222.align.16 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.222.align.16.bc, align 16

				%lds.size.333.align.16.bc = bitcast [333 x i8] addrspace(3)* @lds.size.333.align.16 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.333.align.16.bc, align 16

				%llvm.amdgcn.module.lds.bc = bitcast [16 x i8] addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %llvm.amdgcn.module.lds.bc, align 16

				ret void
				}

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-value.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck %s
	; TODO: Replace with existing DAG tests			; TODO: Replace with existing DAG tests

	@lds_512_4 = internal unnamed_addr addrspace(3) global [128 x i32] undef, align 4			@lds_512_4 = internal unnamed_addr addrspace(3) global [128 x i32] undef, align 4
	@lds_4_8 = addrspace(3) global i32 undef, align 8			@lds_4_8 = addrspace(3) global i32 undef, align 8

	define amdgpu_kernel void @use_lds_globals(i32 addrspace(1)* %out, i32 addrspace(3)* %in) #0 {			define amdgpu_kernel void @use_lds_globals(i32 addrspace(1)* %out, i32 addrspace(3)* %in) #0 {
	; CHECK-LABEL: use_lds_globals:			; CHECK-LABEL: use_lds_globals:
	; CHECK: ; %bb.0: ; %entry			; CHECK: ; %bb.0: ; %entry
	; CHECK-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0			; CHECK-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0
	; CHECK-NEXT: v_mov_b32_e32 v0, 4			; CHECK-NEXT: v_mov_b32_e32 v0, 4
	; CHECK-NEXT: s_mov_b32 m0, -1			; CHECK-NEXT: s_mov_b32 m0, -1
	; CHECK-NEXT: ds_read_b32 v3, v0 offset:4			; CHECK-NEXT: ds_read_b32 v3, v0
	; CHECK-NEXT: v_mov_b32_e32 v2, 9			; CHECK-NEXT: v_mov_b32_e32 v2, 9
	; CHECK-NEXT: s_waitcnt lgkmcnt(0)			; CHECK-NEXT: s_waitcnt lgkmcnt(0)
	; CHECK-NEXT: s_add_u32 s0, s0, 4			; CHECK-NEXT: s_add_u32 s0, s0, 4
	; CHECK-NEXT: s_addc_u32 s1, s1, 0			; CHECK-NEXT: s_addc_u32 s1, s1, 0
	; CHECK-NEXT: v_mov_b32_e32 v0, s0			; CHECK-NEXT: v_mov_b32_e32 v0, s0
	; CHECK-NEXT: v_mov_b32_e32 v1, s1			; CHECK-NEXT: v_mov_b32_e32 v1, s1
	; CHECK-NEXT: flat_store_dword v[0:1], v3			; CHECK-NEXT: flat_store_dword v[0:1], v3
	; CHECK-NEXT: v_mov_b32_e32 v0, 0			; CHECK-NEXT: v_mov_b32_e32 v0, 0x200
	; CHECK-NEXT: ds_write_b32 v0, v2			; CHECK-NEXT: ds_write_b32 v0, v2
	; CHECK-NEXT: s_endpgm			; CHECK-NEXT: s_endpgm
	entry:			entry:
	%tmp0 = getelementptr [128 x i32], [128 x i32] addrspace(3)* @lds_512_4, i32 0, i32 1			%tmp0 = getelementptr [128 x i32], [128 x i32] addrspace(3)* @lds_512_4, i32 0, i32 1
	%tmp1 = load i32, i32 addrspace(3)* %tmp0			%tmp1 = load i32, i32 addrspace(3)* %tmp0
	%tmp2 = getelementptr i32, i32 addrspace(1)* %out, i32 1			%tmp2 = getelementptr i32, i32 addrspace(1)* %out, i32 1
	store i32 %tmp1, i32 addrspace(1)* %tmp2			store i32 %tmp1, i32 addrspace(1)* %tmp2
	store i32 9, i32 addrspace(3)* @lds_4_8			store i32 9, i32 addrspace(3)* @lds_4_8
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/test/CodeGen/AMDGPU/lds-alignment.ll

; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa --amdhsa-code-object-version=2 < %s \| FileCheck -check-prefix=HSA %s		; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa --amdhsa-code-object-version=2 < %s \| FileCheck -check-prefix=HSA %s

		; LDS allocation is now done in the sorted order:
		; 1. First, alignment is fixed, even for missing alignment or underaligned cases since we know the
		; size of LDS globals at compile time.
		; 2. Next, per kernel used LDS globals are sorted, based on natural alignment, on ties, based on size.
		; 3. Finally, memory is allocated in the above sorted order.

@lds.align16.0 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 16		@lds.align16.0 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 16
@lds.align16.1 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 16		@lds.align16.1 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 16

@lds.align8.0 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 8		@lds.align8.0 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 8
@lds.align32.0 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 32		@lds.align32.0 = internal unnamed_addr addrspace(3) global [38 x i8] undef, align 32

@lds.missing.align.0 = internal unnamed_addr addrspace(3) global [39 x i32] undef		@lds.missing.align.0 = internal unnamed_addr addrspace(3) global [39 x i32] undef
@lds.missing.align.1 = internal unnamed_addr addrspace(3) global [7 x i64] undef		@lds.missing.align.1 = internal unnamed_addr addrspace(3) global [7 x i64] undef
Show All 9 Lines	define amdgpu_kernel void @test_no_round_size_1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.align16.0.bc, i8 addrspace(1)* align 4 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.align16.0.bc, i8 addrspace(1)* align 4 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.align16.0.bc, i32 38, i1 false)
ret void		ret void
}		}

; There are two objects, so one requires padding to be correctly		; There are two objects, so one requires padding to be correctly
; aligned after the other.		; aligned after the other.

; (38 -> 48) + 38 = 92		; 38 + (10 pad) + 38 = 86

; I don't think it is necessary to add padding after since if there		; I don't think it is necessary to add padding after since if there
; were to be a dynamically sized LDS kernel arg, the runtime should		; were to be a dynamically sized LDS kernel arg, the runtime should
; add the alignment padding if necessary alignment padding if needed.		; add the alignment padding if necessary alignment padding if needed.

; HSA-LABEL: {{^}}test_round_size_2:		; HSA-LABEL: {{^}}test_round_size_2:
; HSA: workgroup_group_segment_byte_size = 86		; HSA: workgroup_group_segment_byte_size = 86
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
; HSA: workgroup_group_segment_byte_size = 0		; HSA: workgroup_group_segment_byte_size = 0
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_high_align_lds_arg(i8 addrspace(1)* %out, i8 addrspace(1)* %in, i8 addrspace(3)* align 64 %lds.arg) #1 {		define amdgpu_kernel void @test_high_align_lds_arg(i8 addrspace(1)* %out, i8 addrspace(1)* %in, i8 addrspace(3)* align 64 %lds.arg) #1 {
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 64 %lds.arg, i8 addrspace(1)* align 64 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 64 %lds.arg, i8 addrspace(1)* align 64 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 64 %out, i8 addrspace(3)* align 64 %lds.arg, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 64 %out, i8 addrspace(3)* align 64 %lds.arg, i32 38, i1 false)
ret void		ret void
}		}

; (7 * 8) + (39 * 4) = 212		; 156 + (4 pad) + 56 = 216
; HSA-LABEL: {{^}}test_missing_alignment_size_2_order0:		; HSA-LABEL: {{^}}test_missing_alignment_size_2_order0:
; HSA: workgroup_group_segment_byte_size = 212		; HSA: workgroup_group_segment_byte_size = 216
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_missing_alignment_size_2_order0(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_missing_alignment_size_2_order0(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.missing.align.0.bc = bitcast [39 x i32] addrspace(3)* @lds.missing.align.0 to i8 addrspace(3)*		%lds.missing.align.0.bc = bitcast [39 x i32] addrspace(3)* @lds.missing.align.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i8 addrspace(1)* align 4 %in, i32 160, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i8 addrspace(1)* align 4 %in, i32 160, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i32 160, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i32 160, i1 false)

%lds.missing.align.1.bc = bitcast [7 x i64] addrspace(3)* @lds.missing.align.1 to i8 addrspace(3)*		%lds.missing.align.1.bc = bitcast [7 x i64] addrspace(3)* @lds.missing.align.1 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i8 addrspace(1)* align 8 %in, i32 56, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i8 addrspace(1)* align 8 %in, i32 56, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i32 56, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i32 56, i1 false)

ret void		ret void
}		}

; (39 * 4) + (4 pad) + (7 * 8) = 216		; 156 + (4 pad) + 56 = 216
; HSA-LABEL: {{^}}test_missing_alignment_size_2_order1:		; HSA-LABEL: {{^}}test_missing_alignment_size_2_order1:
; HSA: workgroup_group_segment_byte_size = 216		; HSA: workgroup_group_segment_byte_size = 216
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_missing_alignment_size_2_order1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_missing_alignment_size_2_order1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.missing.align.1.bc = bitcast [7 x i64] addrspace(3)* @lds.missing.align.1 to i8 addrspace(3)*		%lds.missing.align.1.bc = bitcast [7 x i64] addrspace(3)* @lds.missing.align.1 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i8 addrspace(1)* align 8 %in, i32 56, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i8 addrspace(1)* align 8 %in, i32 56, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i32 56, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.missing.align.1.bc, i32 56, i1 false)

Show All 31 Lines	define amdgpu_kernel void @test_round_size_3_order0(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 32, 8, 16		; align 32, 16, 8
; 38 (+ 2 pad) + 38 + (18 pad) + 38 = 134		; 38 (+ 2 pad) + 38 + (18 pad) + 38 = 134
; HSA-LABEL: {{^}}test_round_size_3_order1:		; HSA-LABEL: {{^}}test_round_size_3_order1:
; HSA: workgroup_group_segment_byte_size = 134		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 16, 32, 8		; align 32, 16, 8
; 38 + (26 pad) + 38 + (10 pad) + 38 = 150		; 38 + (10 pad) + 38 + (10 pad) + 38 = 134
; HSA-LABEL: {{^}}test_round_size_3_order2:		; HSA-LABEL: {{^}}test_round_size_3_order2:
; HSA: workgroup_group_segment_byte_size = 150		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order2(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order2(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 16, 8, 32		; align 32, 16, 8
; 38 + (2 pad) + 38 + (2 pad) + 38		; 38 + (2 pad) + 38 + (2 pad) + 38
; HSA-LABEL: {{^}}test_round_size_3_order3:		; HSA-LABEL: {{^}}test_round_size_3_order3:
; HSA: workgroup_group_segment_byte_size = 118		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order3(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order3(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 8, 32, 16		; align 32, 16, 8
; 38 + (26 pad) + 38 + (2 pad) + 38 = 142		; 38 + (10 pad) + 38 + (10 pad) + 38 = 134
; HSA-LABEL: {{^}}test_round_size_3_order4:		; HSA-LABEL: {{^}}test_round_size_3_order4:
; HSA: workgroup_group_segment_byte_size = 142		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order4(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order4(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 8, 16, 32		; align 32, 16, 8
; 38 + (10 pad) + 38 + (2 pad) + 38 = 126		; 38 + (10 pad) + 38 + (10 pad) + 38 = 134
; HSA-LABEL: {{^}}test_round_size_3_order5:		; HSA-LABEL: {{^}}test_round_size_3_order5:
; HSA: workgroup_group_segment_byte_size = 126		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order5(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order5(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
Show All 12 Lines

llvm/test/CodeGen/AMDGPU/lds-allocation.ll

This file was added.

				; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 --amdhsa-code-object-version=4 < %s \| FileCheck -check-prefixes=HSA,LDS-USED %s

				; LDS allocation is now done in the sorted order as below:
				;
				; 1. First, alignment is fixed, even for missing alignment or underaligned cases since
				; we know the size of LDS globals at compile time.
				; 2. Next, per kernel used LDS globals are sorted, based on natural alignment, on ties,
				; based on size.
				; 3. Finally, memory is allocated in the above sorted order.


				; [TEST 0] All LDS are properly aligned
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @lds.size.16.align.16 -> 16 byte boundary
				; @lds.size.8.align.8 -> 8 byte boundary
				; @lds.size.4.align.4 -> 4 byte boundary
				; @lds.size.2.align.2 -> 2 byte boundary
				; @lds.size.1.align.1 -> 1 byte boundary (any boundary)
				;
				; Sorted order based on alignment should be:
				; [@lds.size.16.align.16, @lds.size.8.align.8, @lds.size.4.align.4, @lds.size.2.align.2, @lds.size.1.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 16 + 8 + 4 + 2 + 1 [= 31 bytes]

				@lds.size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
				@lds.size.2.align.2 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 2
				@lds.size.4.align.4 = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4
				@lds.size.8.align.8 = internal unnamed_addr addrspace(3) global [8 x i8] undef, align 8
				@lds.size.16.align.16 = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_0:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				rampitecUnsubmitted Not Done Reply Inline Actions The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA. It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile. One thing it does not check is that we do not allocate more LDS than used. It does not check the situation with multiple kernels using intersecting LDS variable sets. I was looking at a test which will really check memory layout, like that is possible with AMDGPULowerModuleLDSPass. In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities. rampitec: The test checking mir is not any better than a test checking instructions. In fact it is worse…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I think, all of you are kind of suggesting to extend AMDGPULowerModuleLDSPass. But, I am not sure, how it is easily extendible, moreever, it is being invoked before promoteAlloca pass at the moment. We really need to discuss it and come to a common conclusion, before proceeding further. For me, it looks like - there is a ball, half of which is painted white, and other half black, few are looking at white part, and others looking at black part, and not able to come to a common conclusion. hsmhsm: I think, all of you are kind of suggesting to extend AMDGPULowerModuleLDSPass. But, I am not…
				rampitecUnsubmitted Not Done Reply Inline Actions PromoteAlloca is probably not an issue, it will just add a new allocation past that struct. It should be already properly aligned as we are creating it ourselves. The current waste of LDS by the AMDGPULowerModuleLDSPass is an issue though. Ideally it would be nice to come up with a single solution for functions and kernels which I guess we are continue discussing... rampitec: PromoteAlloca is probably not an issue, it will just add a new allocation past that struct. It…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I still do not understand, if we can really handle it within AMDGPULowerModuleLDSPass. Expecting a single struct variable and only single struct variable within every kernel is too much ideal expectation at this point. And not sure, why you guys are insistiing about it. Strictly speaking, this pass does not update align of LDS globals? It only fix alignment and allocate memory accordingly. So, the lit tests should check for allocated boundary and total size allocated. And that is what I did here. If you are not happy with MIR result, then, I can check it on final assembly. To answer some of your earlier comments: The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA. I can change it to check instructions instead of mir. Based on the sorting algorithm implemented - we expect which LDS global should be allocated on which assdresses, and how much LDS should be allocated in total. That is what I did here. It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile. Let me see, what is missing here. One thing it does not check is that we do not allocate more LDS than used. I am checking for it as in "SDAG-HSA: amdhsa_group_segment_fixed_size 31" It does not check the situation with multiple kernels using intersecting LDS variable sets. I will try to improve the test cases.. I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting? In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities. I doubt if we can quickly achieve it considering all the mess happening right now. hsmhsm: I still do not understand, if we can really handle it within AMDGPULowerModuleLDSPass.
				rampitecUnsubmitted Not Done Reply Inline Actions I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting? I was thinking about a test just showing an IR globals after the transformation, but I guess since it is still just the lowering the IR is not updated and such test is not possible. In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities. I doubt if we can quickly achieve it considering all the mess happening right now. Given it does not solve minimal per-kernel allocation as we discussed yes, we cannot just use it right away. rampitec: > >> I was looking at a test which will really check memory layout like that is possible with…
				; HSA: ds_write_b8 v0, v1 offset:30
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:28
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:24
				; HSA: v_mov_b32_e32 v1, 4
				; HSA: ds_write_b8 v0, v1 offset:16
				; HSA: v_mov_b32_e32 v1, 5
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 31
				define amdgpu_kernel void @fix_alignment_0() {
				%lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* @lds.size.1.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.1.align.1.bc, align 1

				%lds.size.2.align.2.bc = bitcast [2 x i8] addrspace(3)* @lds.size.2.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.2.align.2.bc, align 2

				%lds.size.4.align.4.bc = bitcast [4 x i8] addrspace(3)* @lds.size.4.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.4.align.4.bc, align 4

				%lds.size.8.align.8.bc = bitcast [8 x i8] addrspace(3)* @lds.size.8.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.8.align.8.bc, align 8

				%lds.size.16.align.16.bc = bitcast [16 x i8] addrspace(3)* @lds.size.16.align.16 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %lds.size.16.align.16.bc, align 16

				ret void
				}


				; [TEST 1] All LDS are underaligned, requires to allocate on different alignment boundary
				;
				; Fixed alignments for allocation should be:
				; @lds.size.9.align.8 -> 16 byte boundary
				; @lds.size.5.align.4 -> 8 byte boundary
				; @lds.size.3.align.2 -> 4 byte boundary
				; @lds.size.2.align.1 -> 2 byte boundary
				;
				; Sorted order based on fixed alignment should be:
				; [@lds.size.9.align.8, lds.size.5.align.4, lds.size.3.align.2, lds.size.2.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 9 + (pad 7) + 5 + (pad 3) + 3 + (pad 1) + 2 [= 30 bytes]

				@lds.size.2.align.1 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 1
				@lds.size.3.align.2 = internal unnamed_addr addrspace(3) global [3 x i8] undef, align 2
				@lds.size.5.align.4 = internal unnamed_addr addrspace(3) global [5 x i8] undef, align 4
				@lds.size.9.align.8 = internal unnamed_addr addrspace(3) global [9 x i8] undef, align 8

				; HSA-LABEL: {{^}}fix_alignment_1:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				; HSA: ds_write_b8 v0, v1 offset:28
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:24
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:16
				; HSA: v_mov_b32_e32 v1, 4
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 30
				define amdgpu_kernel void @fix_alignment_1() {
				%lds.size.2.align.1.bc = bitcast [2 x i8] addrspace(3)* @lds.size.2.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.2.align.1.bc, align 1

				%lds.size.3.align.2.bc = bitcast [3 x i8] addrspace(3)* @lds.size.3.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.3.align.2.bc, align 2

				%lds.size.5.align.4.bc = bitcast [5 x i8] addrspace(3)* @lds.size.5.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.5.align.4.bc, align 4

				%lds.size.9.align.8.bc = bitcast [9 x i8] addrspace(3)* @lds.size.9.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.9.align.8.bc, align 8

				ret void
				}


				; [TEST 2] All LDS are underaligned, requires to allocate on 16 byte boundary
				;
				; Fixed alignments for allocation should be:
				; @lds.size.9.align.1 -> 16 byte boundary
				; @lds.size.10.align.2 -> 16 byte boundary
				; @lds.size.11.align.4 -> 16 byte boundary
				; @lds.size.12.align.8 -> 16 byte boundary
				;
				; Sorted order based on fixed alignment and then by size should be:
				; [@lds.size.12.align.8, @lds.size.11.align.4, @lds.size.10.align.2, @lds.size.9.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 12 + (pad 4) + 11 + (pad 5) + 10 + (pad 6) + 9 [= 57 bytes]

				@lds.size.9.align.1 = internal unnamed_addr addrspace(3) global [9 x i8] undef, align 1
				@lds.size.10.align.2 = internal unnamed_addr addrspace(3) global [10 x i8] undef, align 2
				@lds.size.11.align.4 = internal unnamed_addr addrspace(3) global [11 x i8] undef, align 4
				@lds.size.12.align.8 = internal unnamed_addr addrspace(3) global [12 x i8] undef, align 8

				; HSA-LABEL: {{^}}fix_alignment_2:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				; HSA: ds_write_b8 v0, v1 offset:48
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:32
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:16
				; HSA: v_mov_b32_e32 v1, 4
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 57
				define amdgpu_kernel void @fix_alignment_2() {
				%lds.size.9.align.1.bc = bitcast [9 x i8] addrspace(3)* @lds.size.9.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.9.align.1.bc, align 1

				%lds.size.10.align.2.bc = bitcast [10 x i8] addrspace(3)* @lds.size.10.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.10.align.2.bc, align 2

				%lds.size.11.align.4.bc = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.11.align.4.bc, align 4

				%lds.size.12.align.8.bc = bitcast [12 x i8] addrspace(3)* @lds.size.12.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.12.align.8.bc, align 8

				ret void
				}


				; [TEST 3] All LDS are underaligned, requires to allocate on 8 byte boundary
				;
				; Fixed alignments for allocation should be:
				; @lds.size.5.align.2 -> 8 byte boundary
				; @lds.size.6.align.2 -> 8 byte boundary
				; @lds.size.7.align.2 -> 8 byte boundary
				; @lds.size.7.align.4 -> 8 byte boundary
				;
				; Sorted order based on fixed alignment and then by size should be:
				; [@lds.size.7.align.4, @lds.size.7.align.2, @lds.size.6.align.2, @lds.size.5.align.2]
				; OR
				; [@lds.size.7.align.2, @lds.size.7.align.4, @lds.size.6.align.2, @lds.size.5.align.2]
				;
				; Memory allocated in the sorted order should be:
				; 7 + (pad 1) + 7 + (pad 1) + 6 + (pad 2) + 5 [= 29 bytes]

				@lds.size.5.align.2 = internal unnamed_addr addrspace(3) global [5 x i8] undef, align 2
				@lds.size.6.align.2 = internal unnamed_addr addrspace(3) global [6 x i8] undef, align 2
				@lds.size.7.align.2 = internal unnamed_addr addrspace(3) global [7 x i8] undef, align 2
				@lds.size.7.align.4 = internal unnamed_addr addrspace(3) global [7 x i8] undef, align 4

				; HSA-LABEL: {{^}}fix_alignment_3:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				; HSA: ds_write_b8 v0, v1 offset:24
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:16
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1
				; HSA: v_mov_b32_e32 v1, 4
				; HSA: ds_write_b8 v0, v1 offset:8
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 29
				define amdgpu_kernel void @fix_alignment_3() {
				%lds.size.5.align.2.bc = bitcast [5 x i8] addrspace(3)* @lds.size.5.align.2 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.5.align.2.bc, align 2

				%lds.size.6.align.2.bc = bitcast [6 x i8] addrspace(3)* @lds.size.6.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.6.align.2.bc, align 2

				%lds.size.7.align.2.bc = bitcast [7 x i8] addrspace(3)* @lds.size.7.align.2 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.7.align.2.bc, align 2

				%lds.size.7.align.4.bc = bitcast [7 x i8] addrspace(3)* @lds.size.7.align.4 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.7.align.4.bc, align 4

				ret void
				}


				; [TEST 4] All LDS are of same alignment, but different size
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @lds.size.17.align.16 -> 16 byte boundary
				; @lds.size.18.align.16 -> 16 byte boundary
				; @lds.size.19.align.16 -> 16 byte boundary
				; @lds.size.20.align.16 -> 16 byte boundary
				;
				; Sorted order based on size should be:
				; [@lds.size.20.align.16, @lds.size.19.align.16, @lds.size.18.align.16, @lds.size.17.align.16]
				;
				; Memory allocated in the sorted order should be:
				; 20 + (pad 12) + 19 + (pad 13) + 18 + (pad 14) + 17 [= 113 bytes]

				@lds.size.17.align.16 = internal unnamed_addr addrspace(3) global [17 x i8] undef, align 16
				@lds.size.18.align.16 = internal unnamed_addr addrspace(3) global [18 x i8] undef, align 16
				@lds.size.19.align.16 = internal unnamed_addr addrspace(3) global [19 x i8] undef, align 16
				@lds.size.20.align.16 = internal unnamed_addr addrspace(3) global [20 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_4:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				; HSA: ds_write_b8 v0, v1 offset:96
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:64
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:32
				; HSA: v_mov_b32_e32 v1, 4
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 113
				define amdgpu_kernel void @fix_alignment_4() {
				%lds.size.17.align.16.bc = bitcast [17 x i8] addrspace(3)* @lds.size.17.align.16 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.17.align.16.bc, align 16

				%lds.size.18.align.16.bc = bitcast [18 x i8] addrspace(3)* @lds.size.18.align.16 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.18.align.16.bc, align 16

				%lds.size.19.align.16.bc = bitcast [19 x i8] addrspace(3)* @lds.size.19.align.16 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.19.align.16.bc, align 16

				%lds.size.20.align.16.bc = bitcast [20 x i8] addrspace(3)* @lds.size.20.align.16 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.20.align.16.bc, align 16

				ret void
				}

				; [TEST 5] All LDS are of size 1, but different alignments.
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @size.1.align.1 -> any boundary
				; @size.1.align.2 -> any boundary
				; @size.1.align.4 -> any boundary
				; @size.1.align.8 -> any boundary
				; @size.1.align.16 -> any boundary
				;
				; Sorted order based on size should be:
				; [@size.1.align.16, @size.1.align.8, @size.1.align.4, @size.1.align.2, size.1.align.1]
				;
				; Memory allocated in the sorted order should be:
				; 1 + [gap 7] + 1 + [gap 3] + 1 + [gap 1] + 1 + 1 [= 16 bytes]

				@size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
				@size.1.align.2 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 2
				@size.1.align.4 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 4
				@size.1.align.8 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 8
				@size.1.align.16 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_5:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				; HSA: ds_write_b8 v0, v1 offset:15
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:14
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:12
				; HSA: v_mov_b32_e32 v1, 4
				; HSA: ds_write_b8 v0, v1 offset:8
				; HSA: v_mov_b32_e32 v1, 5
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 16
				define amdgpu_kernel void @fix_alignment_5() {
				%lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.1.align.1.bc, align 1

				%lds.size.1.align.2.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.1.align.2.bc, align 2

				%lds.size.1.align.4.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.1.align.4.bc, align 4

				%lds.size.1.align.8.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.1.align.8.bc, align 8

				%lds.size.1.align.16.bc = bitcast [1 x i8] addrspace(3)* @size.1.align.16 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %lds.size.1.align.16.bc, align 16

				ret void
				}


				; [TEST 6] LDS used in multiple kernels.
				;
				; Fixed alignments for allocation should be:
				; @lds.size.33.align.8 -> 16 bytes boundary
				; @lds.size.67.align.8 -> 16 bytes boundary
				; @lds.size.93.align.8 -> 16 bytes boundary
				; @lds.size.11.align.8 -> 16 bytes boundary
				; @lds.size.13.align.8 -> 16 bytes boundary
				; @lds.size.23.align.8 -> 16 bytes boundary
				;
				; kernel : @fix_alignment_6
				;
				; Sorted order based on size should be:
				; [@lds.size.93.align.8, @lds.size.67.align.8, @lds.size.33.align.8]
				;
				; Memory allocated in the sorted order should be:
				; 93 + [pad 3] + 67 + [pad 13] + 33 [= 209 bytes]
				;
				; kernel : @fix_alignment_7
				;
				; Sorted order based on size should be:
				; [@lds.size.93.align.8, @lds.size.13.align.8, @lds.size.11.align.8]
				;
				; Memory allocated in the sorted order should be:
				; 93 + [pad 3] + 13 + [pad 3] + 11 [= 123 bytes]

				; kernel : @fix_alignment_8
				;
				; Sorted order based on size should be:
				; [@lds.size.93.align.8, @lds.size.67.align.8, @lds.size.33.align.8, @lds.size.13.align.8, @lds.size.11.align.8]
				;
				; Memory allocated in the sorted order should be:
				; 93 + [pad 3] + 67 + [pad 13] + 33 + [pad 15] + 13 + [3] + 11 [= 251 bytes]

				@lds.size.33.align.8 = internal unnamed_addr addrspace(3) global [33 x i8] undef, align 8
				@lds.size.67.align.8 = internal unnamed_addr addrspace(3) global [67 x i8] undef, align 8
				@lds.size.93.align.8 = internal unnamed_addr addrspace(3) global [93 x i8] undef, align 8
				@lds.size.11.align.8 = internal unnamed_addr addrspace(3) global [11 x i8] undef, align 8
				@lds.size.13.align.8 = internal unnamed_addr addrspace(3) global [13 x i8] undef, align 8

				; HSA-LABEL: {{^}}fix_alignment_6:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				; HSA: ds_write_b8 v0, v1 offset:176
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:96
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 209
				define amdgpu_kernel void @fix_alignment_6() {
				%lds.size.33.align.8.bc = bitcast [33 x i8] addrspace(3)* @lds.size.33.align.8 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.33.align.8.bc, align 8

				%lds.size.67.align.8.bc = bitcast [67 x i8] addrspace(3)* @lds.size.67.align.8 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.67.align.8.bc, align 2

				%lds.size.93.align.8.bc = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.93.align.8.bc, align 4

				ret void
				}

				; HSA-LABEL: {{^}}fix_alignment_7:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 1
				; HSA: ds_write_b8 v0, v1
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:112
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:96
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 123
				define amdgpu_kernel void @fix_alignment_7() {
				%lds.size.93.align.8.bc = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.93.align.8.bc, align 4

				%lds.size.11.align.8.bc = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.8 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.11.align.8.bc, align 8

				%lds.size.13.align.8.bc = bitcast [13 x i8] addrspace(3)* @lds.size.13.align.8 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.13.align.8.bc, align 2

				ret void
				}

				; HSA-LABEL: {{^}}fix_alignment_8:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 2
				; HSA: ds_write_b8 v0, v1 offset:176
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:96
				; HSA: v_mov_b32_e32 v1, 5
				; HSA: ds_write_b8 v0, v1 offset:240
				; HSA: v_mov_b32_e32 v1, 6
				; HSA: ds_write_b8 v0, v1 offset:224
				; HSA: v_mov_b32_e32 v1, 7
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 251
				define amdgpu_kernel void @fix_alignment_8() {
				%lds.size.11.align.8.bc.1 = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.8 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %lds.size.11.align.8.bc.1, align 8

				%lds.size.33.align.8.bc.2 = bitcast [33 x i8] addrspace(3)* @lds.size.33.align.8 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %lds.size.33.align.8.bc.2, align 8

				%lds.size.67.align.8.bc.3 = bitcast [67 x i8] addrspace(3)* @lds.size.67.align.8 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.67.align.8.bc.3, align 2

				%lds.size.93.align.8.bc.4 = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.93.align.8.bc.4, align 4

				%lds.size.11.align.8.bc.5 = bitcast [11 x i8] addrspace(3)* @lds.size.11.align.8 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %lds.size.11.align.8.bc.5, align 8

				%lds.size.13.align.8.bc.6 = bitcast [13 x i8] addrspace(3)* @lds.size.13.align.8 to i8 addrspace(3)*
				store i8 6, i8 addrspace(3)* %lds.size.13.align.8.bc.6, align 2

				%lds.size.93.align.8.bc.7 = bitcast [93 x i8] addrspace(3)* @lds.size.93.align.8 to i8 addrspace(3)*
				store i8 7, i8 addrspace(3)* %lds.size.93.align.8.bc.7, align 4

				ret void
				}

llvm/test/CodeGen/AMDGPU/lds-allocation2.ll

This file was added.

				; RUN: llc -march=amdgcn -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 --amdhsa-code-object-version=4 < %s \| FileCheck -check-prefixes=HSA,LDS-USED %s

				; LDS allocation is now done in the sorted order as below:
				;
				; 1. First, alignment is fixed, even for missing alignment or underaligned cases since
				; we know the size of LDS globals at compile time.
				; 2. Next, per kernel used LDS globals are sorted, based on natural alignment, on ties,
				; based on size.
				; 3. Finally, memory is allocated in the above sorted order.

				; [TEST 0] The variable llvm.amdgcn.module.lds should be allocated at address 0
				;
				; No alignment fix is required for allocation as all the LDS globals are on required alignment
				; boundary. That is:
				; @lds.size.111.align.16 -> 16 byte boundary
				; @lds.size.222.align.16 -> 16 byte boundary
				; @lds.size.333.align.16 -> 16 byte boundary
				; @llvm.amdgcn.module.lds -> 16 byte boundary
				;
				; Sorted order should be:
				; [@llvm.amdgcn.module.lds, @lds.size.333.align.16, @lds.size.222.align.16, @lds.size.111.align.16]
				;
				; Memory allocated in the sorted order should be:
				; 16 + 333 + [pad 3] + 222 + [pad 2] + 111 [= 687 bytes]

				@lds.size.111.align.16 = internal unnamed_addr addrspace(3) global [111 x i8] undef, align 16
				@lds.size.222.align.16 = internal unnamed_addr addrspace(3) global [222 x i8] undef, align 16
				@lds.size.333.align.16 = internal unnamed_addr addrspace(3) global [333 x i8] undef, align 16
				@llvm.amdgcn.module.lds = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16

				; HSA-LABEL: {{^}}fix_alignment_0:
				; HSA: v_mov_b32_e32 v0, 0
				; HSA: v_mov_b32_e32 v1, 3
				; HSA: ds_write_b8 v0, v1 offset:576
				; HSA: v_mov_b32_e32 v1, 4
				; HSA: ds_write_b8 v0, v1 offset:352
				; HSA: v_mov_b32_e32 v1, 5
				; HSA: ds_write_b8 v0, v1 offset:16
				; HSA: v_mov_b32_e32 v1, 6
				; HSA: ds_write_b8 v0, v1
				; HSA: s_endpgm

				; LDS-USED: amdhsa_group_segment_fixed_size 687
				define amdgpu_kernel void @fix_alignment_0() {
				%lds.size.111.align.16.bc = bitcast [111 x i8] addrspace(3)* @lds.size.111.align.16 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %lds.size.111.align.16.bc, align 16

				%lds.size.222.align.16.bc = bitcast [222 x i8] addrspace(3)* @lds.size.222.align.16 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %lds.size.222.align.16.bc, align 16

				%lds.size.333.align.16.bc = bitcast [333 x i8] addrspace(3)* @lds.size.333.align.16 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %lds.size.333.align.16.bc, align 16

				%llvm.amdgcn.module.lds.bc = bitcast [16 x i8] addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*
				store i8 6, i8 addrspace(3)* %llvm.amdgcn.module.lds.bc, align 16

				ret void
				}

llvm/test/CodeGen/AMDGPU/local-memory.amdgcn.ll

	Show All 35 Lines
	; GCN: .long 47180			; GCN: .long 47180
	; GCN-NEXT: .long 32900			; GCN-NEXT: .long 32900

	; GCN-LABEL: {{^}}local_memory_two_objects:			; GCN-LABEL: {{^}}local_memory_two_objects:
	; GCN: v_lshlrev_b32_e32 [[ADDRW:v[0-9]+]], 2, v0			; GCN: v_lshlrev_b32_e32 [[ADDRW:v[0-9]+]], 2, v0
	; CI-DAG: v_sub_i32_e32 [[SUB:v[0-9]+]], vcc, 0, [[ADDRW]]			; CI-DAG: v_sub_i32_e32 [[SUB:v[0-9]+]], vcc, 0, [[ADDRW]]
	; CI-DAG: ds_write2_b32 [[ADDRW]], {{v[0-9]+}}, {{v[0-9]+}} offset1:4			; CI-DAG: ds_write2_b32 [[ADDRW]], {{v[0-9]+}}, {{v[0-9]+}} offset1:4
	; SI-DAG: ds_write2_b32 [[ADDRW]], {{v[0-9]+}}, {{v[0-9]+}} offset1:4			; SI-DAG: ds_write2_b32 [[ADDRW]], {{v[0-9]+}}, {{v[0-9]+}} offset1:4
	; SI-DAG: v_sub_i32_e32 [[SUB0:v[0-9]+]], vcc, 28, [[ADDRW]]			; SI-DAG: v_sub_i32_e32 [[SUB0:v[0-9]+]], vcc, 12, [[ADDRW]]

	; GCN: s_barrier			; GCN: s_barrier

	; SI-DAG: v_sub_i32_e32 [[SUB1:v[0-9]+]], vcc, 12, [[ADDRW]]			; SI-DAG: v_sub_i32_e32 [[SUB1:v[0-9]+]], vcc, 28, [[ADDRW]]
	; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB0]]			; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB0]]
	; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB1]]			; SI-DAG: ds_read_b32 v{{[0-9]+}}, [[SUB1]]
	; CI: ds_read2_b32 {{v\[[0-9]+:[0-9]+\]}}, [[SUB]] offset0:3 offset1:7			; CI: ds_read2_b32 {{v\[[0-9]+:[0-9]+\]}}, [[SUB]] offset0:3 offset1:7

	define amdgpu_kernel void @local_memory_two_objects(i32 addrspace(1)* %out) #0 {			define amdgpu_kernel void @local_memory_two_objects(i32 addrspace(1)* %out) #0 {
	entry:			entry:
	%x.i = call i32 @llvm.amdgcn.workitem.id.x()			%x.i = call i32 @llvm.amdgcn.workitem.id.x()
	%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @local_memory_two_objects.local_mem0, i32 0, i32 %x.i			%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @local_memory_two_objects.local_mem0, i32 0, i32 %x.i
	Show All 24 Lines

llvm/test/CodeGen/AMDGPU/promote-alloca-padding-size-estimate.ll

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 -disable-promote-alloca-to-vector < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=kaveri --amdhsa-code-object-version=2 -disable-promote-alloca-to-vector < %s \| FileCheck -check-prefix=GCN %s

	; This shows that the amount LDS size estimate should try to not be			; This shows that the amount LDS size estimate should try to not be
	; sensitive to the order of the LDS globals. This should try to			; sensitive to the order of the LDS globals. This should try to
	; estimate the worst case padding behavior to avoid overallocating			; estimate the worst case padding behavior to avoid overallocating
	; LDS.			; LDS.

	; These functions use the same amount of LDS, but the total, final			; These functions use the same amount of LDS, but the total, final
	; size changes depending on the visit order of first use.			; size changes depending on the visit order of first use.

	; The one with the suboptimal order resulting in extra padding exceeds			; The one with the suboptimal order resulting in extra padding exceeds
	; the desired limit			; the desired limit

	; The padding estimate heuristic used by the promote alloca pass			; The padding estimate heuristic used by the promote alloca pass is mostly
	; is mostly determined by the order of the globals,			; determined based on following allocation order:
				;
				; 1. First, alignment is fixed, even for missing alignment or underaligned
				; cases since we know the size of LDS globals at compile time.
				; 2. Next, per kernel used LDS globals are sorted, based on natural alignment,
				; on ties, based on size.
				; 3. Finally, memory is allocated in the above sorted order.

	; Raw usage = 1060 bytes			; Raw usage: 512 + 292 + 256 = 1060 bytes
	; Rounded usage:			; Rounded usage: 512 + (0 pad) + 292 + (12 pad) + 256 = 1072 bytes
	; 292 + (4 pad) + 256 + (8 pad) + 512 = 1072
	; 512 + (0 pad) + 256 + (0 pad) + 292 = 1060

	; At default occupancy guess of 7, 2340 bytes available total.			; At default occupancy guess of 7, 2340 bytes available total.

	; 1280 need to be left to promote alloca			; 1280 need to be left to promote alloca
	; optimally packed, this requires			; optimally packed, this requires


	@lds0 = internal unnamed_addr addrspace(3) global [32 x <4 x i32>] undef, align 16			@lds0 = internal unnamed_addr addrspace(3) global [32 x <4 x i32>] undef, align 16
	@lds2 = internal unnamed_addr addrspace(3) global [32 x i64] undef, align 8			@lds2 = internal unnamed_addr addrspace(3) global [32 x i64] undef, align 8
	@lds1 = internal unnamed_addr addrspace(3) global [73 x i32] undef, align 4			@lds1 = internal unnamed_addr addrspace(3) global [73 x i32] undef, align 4


	; GCN-LABEL: {{^}}promote_alloca_size_order_0:			; GCN-LABEL: {{^}}promote_alloca_size_order_0:
	; GCN: workgroup_group_segment_byte_size = 1060			; GCN: workgroup_group_segment_byte_size = 1072
	define amdgpu_kernel void @promote_alloca_size_order_0(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in, i32 %idx) #0 {			define amdgpu_kernel void @promote_alloca_size_order_0(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in, i32 %idx) #0 {
	entry:			entry:
	%stack = alloca [5 x i32], align 4, addrspace(5)			%stack = alloca [5 x i32], align 4, addrspace(5)
	%tmp0 = load i32, i32 addrspace(1)* %in, align 4			%tmp0 = load i32, i32 addrspace(1)* %in, align 4
	%arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(5)* %stack, i32 0, i32 %tmp0			%arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32] addrspace(5)* %stack, i32 0, i32 %tmp0
	store i32 4, i32 addrspace(5)* %arrayidx1, align 4			store i32 4, i32 addrspace(5)* %arrayidx1, align 4
	%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 1			%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 1
	%tmp1 = load i32, i32 addrspace(1)* %arrayidx2, align 4			%tmp1 = load i32, i32 addrspace(1)* %arrayidx2, align 4
	▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll

	; RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=tahiti -stop-after finalize-isel -o %t.mir %s			; RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=tahiti -stop-after finalize-isel -o %t.mir %s
	; RUN: llc -run-pass=none -verify-machineinstrs %t.mir -o - \| FileCheck %s			; RUN: llc -run-pass=none -verify-machineinstrs %t.mir -o - \| FileCheck %s

	; Test that SIMachineFunctionInfo can be round trip serialized through			; Test that SIMachineFunctionInfo can be round trip serialized through
	; MIR.			; MIR.

	@lds = addrspace(3) global [512 x float] undef, align 4			@lds = addrspace(3) global [512 x float] undef, align 4

	; CHECK-LABEL: {{^}}name: kernel			; CHECK-LABEL: {{^}}name: kernel
	; CHECK: machineFunctionInfo:			; CHECK: machineFunctionInfo:
	; CHECK-NEXT: explicitKernArgSize: 128			; CHECK-NEXT: explicitKernArgSize: 128
	; CHECK-NEXT: maxKernArgAlign: 64			; CHECK-NEXT: maxKernArgAlign: 64
	; CHECK-NEXT: ldsSize: 0			; CHECK-NEXT: ldsSize: 2048
	; CHECK-NEXT: dynLDSAlign: 1			; CHECK-NEXT: dynLDSAlign: 1
	; CHECK-NEXT: isEntryFunction: true			; CHECK-NEXT: isEntryFunction: true
	; CHECK-NEXT: noSignedZerosFPMath: false			; CHECK-NEXT: noSignedZerosFPMath: false
	; CHECK-NEXT: memoryBound: false			; CHECK-NEXT: memoryBound: false
	; CHECK-NEXT: waveLimiter: false			; CHECK-NEXT: waveLimiter: false
	; CHECK-NEXT: hasSpilledSGPRs: false			; CHECK-NEXT: hasSpilledSGPRs: false
	; CHECK-NEXT: hasSpilledVGPRs: false			; CHECK-NEXT: hasSpilledVGPRs: false
	; CHECK-NEXT: scratchRSrcReg: '$sgpr96_sgpr97_sgpr98_sgpr99'			; CHECK-NEXT: scratchRSrcReg: '$sgpr96_sgpr97_sgpr98_sgpr99'
	▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allocate LDS globals in sorted order of their alignment and size.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 346429

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-allocation.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-allocation2.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-value.ll

llvm/test/CodeGen/AMDGPU/lds-alignment.ll

llvm/test/CodeGen/AMDGPU/lds-allocation.ll

llvm/test/CodeGen/AMDGPU/lds-allocation2.ll

llvm/test/CodeGen/AMDGPU/local-memory.amdgcn.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-padding-size-estimate.ll

llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll

[AMDGPU] Allocate LDS globals in sorted order of their alignment and size.
AbandonedPublic