This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix missing lowering of LDS used in global scope.
ClosedPublic

Authored by hsmhsm on May 31 2021, 10:18 PM.

Diff Detail

Event Timeline

hsmhsm created this revision.May 31 2021, 10:18 PM
hsmhsm requested review of this revision.May 31 2021, 10:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2021, 10:18 PM
hsmhsm updated this revision to Diff 348889.May 31 2021, 11:26 PM

Fixed clang-tidy warning.

hsmhsm updated this revision to Diff 348893.May 31 2021, 11:54 PM

Added new lit test-case - lower-kernel-lds-global-uses.ll

hsmhsm updated this revision to Diff 348895.Jun 1 2021, 12:15 AM

Updated the logic within shouldLowerLDSToStruct().

Is this the proposed fix for the missing lowering of the LDS in the following?

@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8

@llvm.used = appending global [2 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @lds to i8 addrspace(3)*) to i8*), i8* addrspacecast (i8 addrspace(1)* bitcast (i64* addrspace(1)* @gptr to i8 addrspace(1)*) to i8*)], section "llvm.metadata"

define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}
hsmhsm added a comment.Jun 1 2021, 7:03 AM

Is this the proposed fix for the missing lowering of the LDS in the following?

@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8

@llvm.used = appending global [2 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @lds to i8 addrspace(3)*) to i8*), i8* addrspacecast (i8 addrspace(1)* bitcast (i64* addrspace(1)* @gptr to i8 addrspace(1)*) to i8*)], section "llvm.metadata"

define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}

yes

For that test, I note that commenting out the llvm.used statement results in the variable being lowered. I.e. the following works fine:

@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8

define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}

It's therefore likely that the missed handling is related to skipping uses from the .used list. This change seems to be a complete rewrite of the algorithm in shouldLowerLDSToStruct. I'm not sure we have the test coverage to be confident that a different algorithm will work, can we fix the .used oversight directly instead?

hsmhsm added a comment.Jun 1 2021, 7:36 AM

For that test, I note that commenting out the llvm.used statement results in the variable being lowered. I.e. the following works fine:

@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8

define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}

It's therefore likely that the missed handling is related to skipping uses from the .used list. This change seems to be a complete rewrite of the algorithm in shouldLowerLDSToStruct. I'm not sure we have the test coverage to be confident that a different algorithm will work, can we fix the .used oversight directly instead?

If the current logic of shouldLowerLDSToStruct() is more robust compere to earlier ones, then, why cannot we use it, instead of patching-up earlier ones?

I spent dedicated time to understand the shortcomings of shouldLowerLDSToStruct() and tried to overcome it in this patch. Please take a look at newly added tests.

hsmhsm updated this revision to Diff 349453.Jun 2 2021, 10:15 PM

Rebased.

rampitec added inline comments.Jun 3 2021, 1:23 PM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
64

.equals(). I don't think == does what you want.

122–123

It seems to miss the case when a same constant used by multiple kernels if I am not missing something. We could fix it by converting constant to an instruction but not doing it yet. You even had a utility function for that in some other patch.

130

.equals().

hsmhsm updated this revision to Diff 349766.Jun 3 2021, 9:53 PM

Fixed review comments by Stas.

hsmhsm marked 3 inline comments as done.Jun 3 2021, 10:01 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
122–123

That is correct - The intention of this patch is to fix bug within shouldLowerLDSToStruct() by making sure that it will properly analyse the uses of LDS to decide if it is to be lowered or not (for both module lds lowering and kernel lds lowering), and also to make the code bit programmer freindly for understanding it.

I implemented a separate patch to handle constants users of LDS for kernel lds lowering (not yet uploaded). I was trying resolve a minor bug where in some cases, LDS was not getting erased from the module. But, now looking at your new patch https://reviews.llvm.org/D103655, looks like I was missing the statement - GV->removeDeadConstantUsers();

But, anyway, let's review the patch https://reviews.llvm.org/D103655.

rampitec added inline comments.Jun 3 2021, 10:26 PM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
122–123

Yeah, it took me a good couple of hours to come up with this statement to fix that.

hsmhsm updated this revision to Diff 349783.Jun 4 2021, 1:05 AM
hsmhsm marked an inline comment as done.

Rebased.

If the current logic of shouldLowerLDSToStruct() is more robust compere to earlier ones, then, why cannot we use it, instead of patching-up earlier ones?

By current, you mean the code in this patch? It's not obvious to me that this patch is correct. I'm suspicious of branchy code with lots of comments. It looks like other people are already reviewing it, so hopefully they'll catch any other shortcomings.

foad added inline comments.Jun 4 2021, 6:24 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
64

No, == should be fine. Both == and equals take two StringRefs and do a full string comparison on them.

64

Don't need braces around a single line.

73

A GlobalVariable is a Constant, so the first isa<> check is redundant.

rampitec added inline comments.Jun 4 2021, 11:58 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
64

I do not see StringRef::operator==() in the documentation?

hsmhsm updated this revision to Diff 350107.Jun 6 2021, 8:03 AM

Fixed review comments by @foad.

hsmhsm marked 4 inline comments as done.Jun 6 2021, 8:08 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
64

@rampitec Here is the StringRef::operator==() - https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/ADT/StringRef.h#L897

It infact internally calls .equals() only.

Anyway, I think, it is fine either way except that .equals() is bit verbose, but, I kept .equals() only for now.

hsmhsm updated this revision to Diff 350272.Jun 7 2021, 6:54 AM
hsmhsm marked an inline comment as done.

Rebased.

This probably will need rebase on top of D103655 which presumably shall simplify shouldLowerLDSToStruct() logic?

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
64

Thanks!

hsmhsm updated this revision to Diff 350511.Jun 7 2021, 11:52 PM

Rebased, and tried my best to minimize the code changes within shouldLowerLDSToStruct()
based on recent patch https://reviews.llvm.org/D103655.

hsmhsm marked an inline comment as done.Jun 7 2021, 11:53 PM

This probably will need rebase on top of D103655 which presumably shall simplify shouldLowerLDSToStruct() logic?

Done.

hsmhsm updated this revision to Diff 350512.Jun 7 2021, 11:57 PM

Fixed incorrect and incomplete comment.

hsmhsm updated this revision to Diff 350517.Jun 8 2021, 12:42 AM

UsedList is no longer required to collect "to be lowered" lds.

hsmhsm updated this revision to Diff 350527.Jun 8 2021, 1:17 AM

If the user of LDS is neither global variable nor instruction, then it should be a constant.

hsmhsm updated this revision to Diff 350547.Jun 8 2021, 2:48 AM

explicit "continue" statement at the end of loop is redundant.

Can you add tests for GlobalAlias? It is not immediately clear if these are handled properly.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
76

You do not need to cast here, just append U->users().

98–104

stripPointerCasts() still makes sense.

124

No need to cast, you are adding all users anyway. You might want an assert(isa<Constant>(V)) though.

134

How can you get here? You should have inspected all the users by this time and met Instruction uses if any.

hsmhsm updated this revision to Diff 350775.Jun 8 2021, 9:33 PM

Fixed few review comments by Stas.

hsmhsm marked 4 inline comments as done.EditedJun 8 2021, 9:46 PM

Can you add tests for GlobalAlias? It is not immediately clear if these are handled properly.

I have not yet verified the GlobalAlias. Let me look into it.

GlobalAlias is handled and new lit test is added. Please have a look at lower-module-lds-global-alias.ll.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
98–104

No, as I understand it, we should be very carefull when using stripPointerCasts(), otherwise, it will lead to unexpected and surprising behavior.

For exmaple, in this case, let's try below simple example. Here, only user of "@lds.1" is bitcast inst within kernel. Now, when we are analysing this user, we apply stripPointerCasts(), which will return back "@lds.1" itself, and since it is a global variable, the IF condition satisfies, and we incorrectly land within IF block, where we did not want, which leads to incorrect transormation.

@lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1

define amdgpu_kernel void @k0() {
  %bc = bitcast [1 x i8] addrspace(3)* @lds.1 to i8 addrspace(3)*
  store i8 1, i8 addrspace(3)* %bc, align 1
  ret void
}
134

Please note that the global(s) that we are trying to inspect here are not lds (GV) itself, but, they are the ones which use lds (GV) as an initializer in the global scope or they are special llvm.used. We collect such globals which use lds (GV), and decide whether to lower or not based on the use of those globals.

hsmhsm updated this revision to Diff 350790.Jun 8 2021, 11:33 PM
hsmhsm marked 2 inline comments as done.

Added new lit-test for GlobalAlias - lower-module-lds-global-alias.ll.

rampitec added inline comments.Jun 9 2021, 10:20 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
98–104

Ack.

134

Please note that the global(s) that we are trying to inspect here are not lds (GV) itself, but, they are the ones which use lds (GV) as an initializer in the global scope or they are special llvm.used. We collect such globals which use lds (GV), and decide whether to lower or not based on the use of those globals.

Ack. Makes sense.

rampitec accepted this revision.Jun 9 2021, 10:27 AM

Looks good to me now.

This revision is now accepted and ready to land.Jun 9 2021, 10:27 AM
This revision was landed with ongoing or failed builds.Jun 9 2021, 8:10 PM
This revision was automatically updated to reflect the committed changes.
foad added inline comments.Jun 10 2021, 1:28 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
65

For future reference, SetVector can be useful in cases like this, so that you don't need to declare both a set and a vector and keep them in sync.

134

Could "return true" early here if any G has instruction users, or use std::any_of which will return early for you.

hsmhsm added inline comments.Jun 10 2021, 2:39 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
134

Correct. Since it does not break any semantics, I will keep it for now, and make suggested code change when I submit any other related patch in near future (we definetely need few more patches sooner than later).

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
125

After this patch, all users are added. Before it, only those that were not already in the Visited set were added. On the face of it, this seems to discard the guard against cycles. This patch also leaves the Visited set dead (values are inserted into it, but never read).

I don't see any discussion in this patch about removing that guard and it's strange that the set was left in place if that was intentional. Is there a reason cycles cannot occur here?

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
125

We might be OK - uses can definitely be cyclic, e.g.
uint64_t * __attribute__((address_space(3))) loc = (void*)&loc;
in which case this logic does indeed loop forever, but I don't think it's possible to construct an expression that hits that which gets past the isa<UndefValue> check.