Page MenuHomePhabricator

[AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers.
AbandonedPublic

Authored by hsmhsm on Apr 26 2021, 10:25 AM.

Details

Summary

The pass - "Lower Module LDS" supports use of LDS globals within non-kernel
functions by lowering LDS globals as follows. It packs within non-kernel used
LDS globals into a struct type, and creates an instance of that struct type
within every kernel at "address zero".

However, the pass - "Lower Module LDS" sometime wastes LDS memory depending
on the pattern of LDS globals use within the module.

Hence the current pass makes an attempt to aid the pass - "Lower Module LDS"
for efficient LDS memory usage. The idea behind current pass is as follows:

  • Instead of directly packing LDS globals into the struct as struct members, create global LDS pointers correspoding those LDS globals.
  • Initialize those global LDS pointers with their respective LDS globals.
  • Replace all the non-kernel function scope use of those original LDS globals by their respective pointer counter-parts.
  • Then the pass "Lower Module LDS" by the virtue of its implementation idea, lands-up packing only LDS pointers as struct members, which substentially reduces unnecessary LDS memory usage.

Diff Detail

Event Timeline

hsmhsm created this revision.Apr 26 2021, 10:25 AM
hsmhsm requested review of this revision.Apr 26 2021, 10:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2021, 10:25 AM
hsmhsm updated this revision to Diff 342231.May 2 2021, 4:17 AM

Rebase to latest upstream main.

I can't work out from the implementation, or the comments, what the condition is for deciding whether to transform an LDS variable. Exactly which cases is this intended to transform?

Emitting stores in the entry block is probably not sufficient to ensure the pointers have the right value when later read from. The stores will look dead and can be eliminated or moved across the function calls. One of the attractions of implementing LDS initializers in the back end is ensuring that such transforms cannot break the lowering. Alternatively, the lower module pass + backend coupling could be updated so that an object can be used instead of zero, at which point the IR passes will natively understand the connection.

In order to exercise this code at runtime, we need executable code that uses LDS from functions. I think that will be necessary to be confident that the transform proposed here works correctly. Possibly also a microbenchmark to determine which cases this makes faster as well.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
84

This is an optimisation. Instead of a fatal error, it can always leave the variable unchanged. Therefore it should not abort compilation.

I can't work out from the implementation, or the comments, what the condition is for deciding whether to transform an LDS variable. Exactly which cases is this intended to transform?

I in-fact implemented what you advocated-for in the abondened patch. that is,

for each LDS variable:

if should-transform
  create 16 bit integer in LDS
  initialize that global with (constexpr) address of variable
  replace all uses of variable with a (constexpr) access through new pointer

where
should-transform:
if (sizeof) < 8ish return false
if used by instruction in indirectly called function return false
if only used by kernels return false
probably other exclusions
return true

Emitting stores in the entry block is probably not sufficient to ensure the pointers have the right value when later read from. The stores will look dead and can be eliminated or moved across the function calls. One of the attractions of implementing LDS initializers in the back end is ensuring that such transforms cannot break the lowering. Alternatively, the lower module pass + backend coupling could be updated so that an object can be used instead of zero, at which point the IR passes will natively understand the connection.

In that case, I will abandon this patch, and stop spending any time on it. I hope that you will continue to strenthen your patch accordingly.

In order to exercise this code at runtime, we need executable code that uses LDS from functions. I think that will be necessary to be confident that the transform proposed here works correctly. Possibly also a microbenchmark to determine which cases this makes faster as well.

When we do not still have a common agreement on the implementation itself, threre is no point in microbenchmarking, and following the same anology your original patch needs it in the first place. So, I just stop working on it anymore by assuming that you will take care of updating your original patch when you feel that the optimal LDS usage required.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
84

call to reportReplaceLDSError() is being made, when the pass cannot proceed further for one or the other reason (internal error situation). And, it is perfectly valid for any optimiation pass to abonden and report an error when it faces some internal error kind of situation.

arsenm added inline comments.May 10 2021, 2:46 PM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
270–272

This is a weird place for pass documentation, just put this in the file header.

The wording is also weird. Current pass doesn't mean anything in this context

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
12

No quotes around address zero?

17

s/current pass/this pass/

26

Typo substentially

28

Add a pseudocode example?

55–59

Don't see much point in this. If it's worth reporting a specific error code, it's worth adding a DiagnosticInfo type for it (but I don't think it is)

69–72

Shouldn't just have a generic error

74–80

I think you shouldn't error here, and just handle these assuming the external call could access the variable, wherever it may end up

84

This should also go through DiagnosticInfo rather than report_fatal_error

88–91

There's already an is_contained

105

getInstructions is overly generic. How about convertConstExprsToInstructions?

125–126

It feels wrong that you would need to explicitly test this, and this case would be naturally handled by your walk through the constantexpr

135

I'd prefer to avoid recursion to analyze constantxeprs

215

Don't see why you need to store this, can just loop through and handle each kernel as it appears

222

ValueMap? You do have value replacements occuring

226

DenseMap?

300–301

This is just what happens on first map access anyway

334–335

cast instead of assert on dyn_cast

363

Should not be using ptrtoint, keep everything as i16 and use GEP to index off the base

436

"Current threshold" part unnecessary

436–438

I don't understand why there would be a size threshold here

442–445

Directly return .empty()?

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
137–140

This function shouldn't exist, just use DL.getTypeAllocSize

hsmhsm abandoned this revision.Tue, May 25, 11:21 PM

We have taken a decision to implement pointer replacement algorithm within LowerModuleLDS pass itself instead of as a seperate pass. Hence abandoning this patch. Will be uploading new patch shortly.