This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Lower kernel LDS into a sorted structure
ClosedPublic

Authored by rampitec on May 21 2021, 4:22 PM.

Diff Detail

Event Timeline

rampitec created this revision.May 21 2021, 4:22 PM
rampitec requested review of this revision.May 21 2021, 4:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 21 2021, 4:22 PM
Herald added a subscriber: wdng. · View Herald Transcript

Couple of minor comments above, overall approach looks sound to me

llvm/lib/IR/Value.cpp
535

since this hazard affects all the targets, we should probably fix it outside of the amdgpu specific LDS patch

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
37

Is this safe (enough) for space? Wonder if it's another place for a heap allocated worklist

rampitec added inline comments.May 21 2021, 4:54 PM
llvm/lib/IR/Value.cpp
535

I can make a separate patch, but the testcase to exploit it here. It needs to see a constant use to fail.

hsmhsm added inline comments.May 21 2021, 6:52 PM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
39

For me, this logic looks incomplete in general sense. Constants are constants throughout compilation phase, meaning, once a constant is created, it is reused in multiple contexts when required. So this is possible case.

void kern_0() {

Inst_1(Const_1(Const_3(LDS)));

}

void kern_1() {

Inst_1(Const_2(Const_3(LDS)));

}

Here Inst_1 and Inst_2 are two different instances, even if they are exactly same, but Const_3 is one-and-the same for both kern_0() and kern_1(). Will this case works here?

47

This is fine, but combining the testing of two different requirements for both module lowering and kernel lowering within same function looks maintenance headache to me. May be it is useful to separate them into two different functions, and findVariablesToLower() call one of them, based on F. But it is upto you. I am fine either way.

As I understand this patch - changes to Value::replaceUsesWithIf() is a fundamental change, so it is better that some one from non AMDGPU team also review it.

Crux of the whole logic within the patch is from two functions - (1) shouldLowerLDSToStruct() and (2) isUsedOnlyFromFunction(). Any bug here will affect the whole functionality of the pass. So we need to be extra careful while reviweing these two functions.

hsmhsm added inline comments.May 21 2021, 9:56 PM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
232

There is some problem here - because of this, llvm -lit tests including your own newly added tests are failing.

After changing

Twine VarName(F ? "llvm.amdgcn.kernel." + Twine(F->getName() + ".lds") : "llvm.amdgcn.module.lds");

To

Twine VarName(F ? Twine("llvm.amdgcn.kernel.") + Twine(F->getName()) + Twine(".lds") : "llvm.amdgcn.module.lds");

it works.

hsmhsm added inline comments.May 21 2021, 10:55 PM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
255

We end-up erasing and re-constructing "llvm.used" as many times as number of kernels which have undergone LDS lowering within them, plus for a very first module level lowering. Not sure about the side-effect it. Also may not worth, erasing and re-constructing "llvm.used" so many times in my openion. But it is left to you.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
39

Further, I verified that below test-case misses lowering because of above incomplete logic, but lowering is required here.

@kern = addrspace(3) global float undef, align 4

define amdgpu_kernel void @k0() {
  %ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)*) to i32*) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)*) to i32*) to i64)) to i32*), align 4
  ret void
}

define amdgpu_kernel void @k1() {
  %ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)*) to i32*) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast (float addrspace(3)* @kern to i32 addrspace(3)*) to i32*) to i64)) to i32*), align 4
  ret void
}
foad added a subscriber: foad.May 21 2021, 11:55 PM
hsmhsm added inline comments.May 22 2021, 9:14 AM
llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll
44

@kern is not lowered here, and same below at line 46.

llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll
10

@var1 is kernel lowered. so it should have 0 use, and got erased. Why it is still there? and we should have CHECK-NOT test for @var1.

hsmhsm added inline comments.May 22 2021, 6:38 PM
llvm/lib/IR/Value.cpp
549

I do not understand, why are we not considering "GlobalVlaue" here. It is upto caller who calls Value::replaceUsesWithIf() to decide on it, and this caller is the one who decide on the logic of the callback of ShouldReplace(). If this caller decide to replace "this" by "New" even in global scope (which is probably an initializer to global), Value::replaceUsesWithIf() should honor it.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
256

Further I have my own doubt about the why we actually doing this - removeFromUsedLists().

I think, it is problamatic even for moduleLDS lower().

When we erase and reconstruct llvm.used/llvm.compiler.used, are we sure that we do not land-up with stale references to (LDS) globals which were added to llvm.used/llvm.compiler.used as used?

As I understand it, the story with "usedList" is as below in this pass:

"When an LDS is used only within llvm.used/llvm.compiler.used, but *nowhere else*, then, we don't want to lower such LDS, that we take care while collecting the globals (for lowering) at the begining. On the other hand, when LDS is used within llvm.used/llvm.compiler.used, but also in other scope (function scope for module lowering), and (kernel scope for kernel lowering), then we want to lower such LDS. That means, we endup replacing LDS within lvm.used/llvm.compiler.used by its struct offset."

If above is what we want to achieve, then why do we need to have call to removeFromUsedLists()?

As far as I understand, call to removeFromUsedLists() would cause problems for kernel lowering because of repeated erasing and re-constructing of llvm.used/llvm.compiler.used.

foad added inline comments.May 24 2021, 2:21 AM
llvm/include/llvm/IR/Value.h
313–314

I guess you need to update the "does not support ... constant users" here?

321–322

... and here.

rampitec updated this revision to Diff 347461.May 24 2021, 11:33 AM
rampitec marked 3 inline comments as done.

Fixed tidy and twine issues.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
232

clang-tidy also warns about local twine variable. Switched to std::string.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
37

It goes up the users and bails on a first non ConstantExpr use. The call stack is as deep as much nested ConstantExprs we have. To run into a problem we would need to have a ConstExpr of a form like gep(gep(gep(gep(...)))) nested really deep. If you still feel I need to switch to a worklist I will do it, it just looks unnecessary here to me.

rampitec updated this revision to Diff 347498.May 24 2021, 2:22 PM
rampitec marked 2 inline comments as done.

Addressed some review comments. Added test. Looking into llvm.used processing.

llvm/lib/IR/Value.cpp
549

This is simply a Constant handling code borrowed from replaceAllUsesWith(). See line 514 above.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
39

Right. This is exactly the case checked in the lower-kernel-lds-constexpr.ll, k0 and k1. Two kernels use the same constant and that cannot be lowered because we cannot replace the constant in one kernel without affecting another one. Handling of this case would require a transformation replacing a constant with an instruction. That is possible but not trivial and not handled in this pass.

At the same time lowering is NOT required. A safe action for kernel LDS lowering is to bail and not touch anything. This is optimization and not required for correct codegen.

39

For me, this logic looks incomplete in general sense. Constants are constants throughout compilation phase, meaning, once a constant is created, it is reused in multiple contexts when required. So this is possible case.

void kern_0() {

Inst_1(Const_1(Const_3(LDS)));

}

void kern_1() {

Inst_1(Const_2(Const_3(LDS)));

}

Here Inst_1 and Inst_2 are two different instances, even if they are exactly same, but Const_3 is one-and-the same for both kern_0() and kern_1(). Will this case works here?

Yes, this is the situation handled by handleOperandChange(), that is why a normal setUse() cannot be used in the first place. Added test k3 to the lower-kernel-lds-constexpr.ll.

47

This is fine, but combining the testing of two different requirements for both module lowering and kernel lowering within same function looks maintenance headache to me. May be it is useful to separate them into two different functions, and findVariablesToLower() call one of them, based on F. But it is upto you. I am fine either way.

I did not want to clone practically identical code. I hope after a series of changes there will be no module LDS at all, it will all go into a per-kernel allocation.

llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll
44

What do you mean? @kern is lowered. This is input IR, the output does not have it.

rampitec marked an inline comment as done.
rampitec added inline comments.
llvm/lib/IR/Value.cpp
535

since this hazard affects all the targets, we should probably fix it outside of the amdgpu specific LDS patch

D103051

rampitec updated this revision to Diff 347522.May 24 2021, 3:39 PM
rampitec marked an inline comment as done.

Handled a case when a variable would not be removed even if lowered.

llvm/test/CodeGen/AMDGPU/lower-module-lds-inactive.ll
10

There was a dangling use of constantexpr bitcast from llvm.used, so use list was not empty. Added a call to C->removeDeadConstantUsers() while destroying llvm.used. Negative check is also added.

rampitec updated this revision to Diff 347529.May 24 2021, 4:23 PM
rampitec marked 2 inline comments as done.

Remove erased variables from local UsedList set.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
256

Actually we do not just keep removing and recreating llvm.used, it is a converging process. We are only removing variables from there along the route when we replace them. We have to do it before we actually replace a variable because after replaceAllUsesWith() the llvm.used would contain a useless reference to the new structure (and verifier rightfully complaints about that spume).

For the kernel lowering an additional reason to drop it before the replacement is to have isUsedOnlyFromFunction() not to be stopped by a GlobalVariable use. Note that shouldLowerLDSToStruct() has a check that UsedList contains V->stripPointerCasts(), so llvm.used does not prevent the collection. isUsedOnlyFromFunction() does not need this handling because reference is already dropped by the removeFromUsedLists() call here.

For the case where a variable only used by an llvm.used we would not do anything at all since the variable simply will not get into the collection from findVariablesToLower().

We shall also consider the reason why an LDS global may appear in an used list in the first place. The use by a function creates an indirect use by a kernel (at least by an indirectly called function), I do not see any other reason to add the variable to a used list. For a kernel that is simply not needed because a use is direct and immediately visible. In reality this path is very unlikely to be taken in a kernel case, although shall work correctly nonetheless.

Moreover, when we get rid of module LDS, and will have only kernel LDS, plus will manage all the cases all LDS global will be completely replaced and lowered and unconditionally erased from used list. At that point we could just replace this logic with a bulk drop at the top of the runOnModule().

However, while exploring all of that I have realized there is a bug in the implementation, I have to erase a variable from local set UsedList when erasing so we do not have dangling pointers there. Fixed.

hsmhsm added inline comments.May 24 2021, 9:24 PM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
39

Why do we need a transformation replacing a constant with an instruction? I am not sure.

My understanding is this -

In the above example, defintely @kern is suppose to be lowered both within kernels @k0 and @k1. But since the struct offset for @kern would be different within both the kernels (assuming they also use other LDS globals), the current lowering flow - that is handling one kernel at a time is the limitation here as I understand it.

Instead let's assume below flow, then it is possible to achieve it. But, only problem is, this flow is overkill since we need to do some heavy work here which is bit against LLVM infrastructure.

1.  ConsolidatedGlobals = nullptr

2.  FOR (each kernel K) DO
        LDSGlobals = Collect LDS globals to be lowered
        Create struct type and struct instance for K, and save it.
        Set union LDSGlobals with ConsolidatedGlobals
     ENDFOR

3.   Call removeFromUsedList for ConsolidatedGlobals

4.   Construct an indexing table which given an LDS and a Kernel, it returns the corresponding struct offset.
 
5.   FOR (each lds global GV from ConsolidatedGlobals) DO
         FOR (each kernel K) DO
              Offset = Get offset for the pair (GV, K) from indexing table.
              GEF = Create a GEF constant expression using Offset.
              KernelUses = Collect all uses of GV within kernel K.
              FOR (each use U from KernelUses) DO
                  Replace GV within U by GEF
              ENDFOR
         ENDFOR
      ENDFOR

6.   FOR (each global GV from ConsolidatedGlobals) DO
         Erase GV from module.
      ENDFOR
llvm/test/CodeGen/AMDGPU/lower-module-lds-constantexpr.ll
44

My bad, I misunderstood it, please ignore it.

rampitec added inline comments.May 25 2021, 2:00 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
39

A constant is uniqued. That means being used from two different kernels it is still the same constant. To lower it differently in different kernels we need to get rid of that modality and duplicate it. One way is to construct a new constant, another is to construct an instruction(s).

The only reason now it is a constant is because it was simpler to implement, one do not need to find a proper insertion point. With kernels and uniqued constants that doesn't work anymore.

hsmhsm added inline comments.May 25 2021, 5:38 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
39

Oh! yes, get it.

hsmhsm accepted this revision.May 25 2021, 5:42 AM

Assuming that we have known limitations with handling of same constant expression appearing in two different kernels, this patch looks okay to me.

This revision is now accepted and ready to land.May 25 2021, 5:42 AM
This revision was landed with ongoing or failed builds.May 25 2021, 11:29 AM
This revision was automatically updated to reflect the committed changes.