In order to increase the probability of aligned accessing of LDS memory
operations, sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
As far as I understand all kernels in the module will use all the LDS globals irrespective if they actually need it? We may run out of LDS.
llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp | ||
---|---|---|
51 | llvm::find() and you do not need to include algorithm then. | |
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h | ||
41 | I do not think you can keep per-module stuff in the TM. |
Why do we need another pass if we always just produce llvm.amdgcn.module.lds such that codegen will always allocate one single global?
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h | ||
---|---|---|
41 | You absolutely cannot do this here |
This seems to put all variables into one struct. I was suggesting a struct type and instance per kernel, populated by the subset of variables that can be accessed by kernel and functions called by it. Strictly speaking this does do that, but the subset can usually be smaller than the set of all variables in the module.
Right, I think we need a minimal (or as small as we can if not minimal) set per kernel.
As I have mentioned in the email -
(1) Which means to further improve the pass - "LowerModuleLDS" implemented by Jon, we need to run this pass twice (a) - before promoteAlloca and (b) afterPromoteAlloca, and I am not sure, if we face any hurdle doing this.
(2) Carefully implement a logic to traverse the call graph to find out all reachable LDS globals
(3) Create a separate struct type per kernel based on their reachable LDS set
(4) Create an instance of respective struct type within each kernel
(5) Replace within non kernel-function uses of LDS by their respective struct member offset counter-part which would be different for different kernels.
Item (2) is problamtic in the presence of call graph. But, we can probably conservatively make an attempt and see if it works
Item (5) whichI have no idea how to do it. @JonChesterfield do you have any idea in mind w.r.t this?
All in all, it is going to take time.
[AMDGPU] Sort LDS globals based on thier size and alignment.
Typo "their" :-)
In order to increase the probability of aligned accessing of LDS memory
operations,
Are you expecting this to affect instruction selection? Or just to increase the probability that the selected instructions will go fast because they will see an aligned address at run time?
sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.
Sorting seems like an odd way to do this, because it *might* happen to over-align a global, but only by chance depending on what other globals get sorted before it.
Instead how about explicitly over-aligning all globals with something like:
for each lds global: if size > 8: // we might want to use a b96 or b128 load alignment = max(alignment, 16) else if size > 4: // we might want to use a b64 load alignment = max(alignment, 8) else if size > 2: // we might want to use a b32 load alignment = max(alignment, 4) else if size > 1 // we might want to use a b16 load alignment = max(alignment, 2)
(You could still sort them after doing that, but the only reason would be to try to minimise wasted space due to alignment gaps.)
It is to increase the probability that the selected instructions will go fast because they will see an aligned address at run time.
sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.Sorting seems like an odd way to do this, because it *might* happen to over-align a global, but only by chance depending on what other globals get sorted before it.
Lately, I am actually of same openion - sorting will definetely help us reduce space overhead due to padding, but may not be with alignment.
Instead how about explicitly over-aligning all globals with something like:
for each lds global: if size > 8: // we might want to use a b96 or b128 load alignment = max(alignment, 16) else if size > 4: // we might want to use a b64 load alignment = max(alignment, 8) else if size > 2: // we might want to use a b32 load alignment = max(alignment, 4) else if size > 1 // we might want to use a b16 load alignment = max(alignment, 2)(You could still sort them after doing that, but the only reason would be to try to minimise wasted space due to alignment gaps.)
Looks like a good suggestion to me.
I will abandon this patch, and I am working on yet another patch which is implemented with in llc (AMDGPUMachineFunction). Will post that new patch soon.
llvm::find() and you do not need to include algorithm then.