This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Sort LDS globals based on thier size and alignment.
AbandonedPublic

Authored by hsmhsm on May 12 2021, 9:36 AM.

Details

Summary

In order to increase the probability of aligned accessing of LDS memory
operations, sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.

Diff Detail

Event Timeline

hsmhsm created this revision.May 12 2021, 9:36 AM
hsmhsm requested review of this revision.May 12 2021, 9:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2021, 9:36 AM

As far as I understand all kernels in the module will use all the LDS globals irrespective if they actually need it? We may run out of LDS.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
51

llvm::find() and you do not need to include algorithm then.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
41

I do not think you can keep per-module stuff in the TM.

Why do we need another pass if we always just produce llvm.amdgcn.module.lds such that codegen will always allocate one single global?

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
41

You absolutely cannot do this here

This seems to put all variables into one struct. I was suggesting a struct type and instance per kernel, populated by the subset of variables that can be accessed by kernel and functions called by it. Strictly speaking this does do that, but the subset can usually be smaller than the set of all variables in the module.

This seems to put all variables into one struct. I was suggesting a struct type and instance per kernel, populated by the subset of variables that can be accessed by kernel and functions called by it. Strictly speaking this does do that, but the subset can usually be smaller than the set of all variables in the module.

Right, I think we need a minimal (or as small as we can if not minimal) set per kernel.

As far as I understand all kernels in the module will use all the LDS globals irrespective if they actually need it? We may run out of LDS.

Hmm, I did not realize it. This patch is totally bad then.

This seems to put all variables into one struct. I was suggesting a struct type and instance per kernel, populated by the subset of variables that can be accessed by kernel and functions called by it. Strictly speaking this does do that, but the subset can usually be smaller than the set of all variables in the module.

Right, I think we need a minimal (or as small as we can if not minimal) set per kernel.

As I have mentioned in the email -

(1) Which means to further improve the pass - "LowerModuleLDS" implemented by Jon, we need to run this pass twice (a) - before promoteAlloca and (b) afterPromoteAlloca, and I am not sure, if we face any hurdle doing this.
(2) Carefully implement a logic to traverse the call graph to find out all reachable LDS globals
(3) Create a separate struct type per kernel based on their reachable LDS set
(4) Create an instance of respective struct type within each kernel
(5) Replace within non kernel-function uses of LDS by their respective struct member offset counter-part which would be different for different kernels.

Item (2) is problamtic in the presence of call graph. But, we can probably conservatively make an attempt and see if it works
Item (5) whichI have no idea how to do it. @JonChesterfield do you have any idea in mind w.r.t this?

All in all, it is going to take time.

foad added a comment.May 13 2021, 1:36 AM

[AMDGPU] Sort LDS globals based on thier size and alignment.

Typo "their" :-)

In order to increase the probability of aligned accessing of LDS memory
operations,

Are you expecting this to affect instruction selection? Or just to increase the probability that the selected instructions will go fast because they will see an aligned address at run time?

sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.

Sorting seems like an odd way to do this, because it *might* happen to over-align a global, but only by chance depending on what other globals get sorted before it.

Instead how about explicitly over-aligning all globals with something like:

for each lds global:
  if size > 8:
    // we might want to use a b96 or b128 load
    alignment = max(alignment, 16)
  else if size > 4:
    // we might want to use a b64 load
    alignment = max(alignment, 8)
  else if size > 2:
    // we might want to use a b32 load
    alignment = max(alignment, 4)
  else if size > 1
    // we might want to use a b16 load
    alignment = max(alignment, 2)

(You could still sort them after doing that, but the only reason would be to try to minimise wasted space due to alignment gaps.)

[AMDGPU] Sort LDS globals based on thier size and alignment.

Typo "their" :-)

In order to increase the probability of aligned accessing of LDS memory
operations,

Are you expecting this to affect instruction selection? Or just to increase the probability that the selected instructions will go fast because they will see an aligned address at run time?

It is to increase the probability that the selected instructions will go fast because they will see an aligned address at run time.

sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.

Sorting seems like an odd way to do this, because it *might* happen to over-align a global, but only by chance depending on what other globals get sorted before it.

Lately, I am actually of same openion - sorting will definetely help us reduce space overhead due to padding, but may not be with alignment.

Instead how about explicitly over-aligning all globals with something like:

for each lds global:
  if size > 8:
    // we might want to use a b96 or b128 load
    alignment = max(alignment, 16)
  else if size > 4:
    // we might want to use a b64 load
    alignment = max(alignment, 8)
  else if size > 2:
    // we might want to use a b32 load
    alignment = max(alignment, 4)
  else if size > 1
    // we might want to use a b16 load
    alignment = max(alignment, 2)

(You could still sort them after doing that, but the only reason would be to try to minimise wasted space due to alignment gaps.)

Looks like a good suggestion to me.

hsmhsm abandoned this revision.May 13 2021, 6:44 AM

I will abandon this patch, and I am working on yet another patch which is implemented with in llc (AMDGPUMachineFunction). Will post that new patch soon.