This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
2/2
AMDGPULowerModuleLDSPass.cpp
22/22
AMDGPUReplaceLDSUseWithPointer.cpp
-
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
-
Utils/
-
AMDGPULDSUtils.h
6/8
AMDGPULDSUtils.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
llc-pipeline.ll
4/5
replace-lds-by-ptr-call-diamond-shape.ll
5/5
replace-lds-by-ptr-call-selected_functions.ll
-
replace-lds-by-ptr-ignore-global-scope-use.ll
-
replace-lds-by-ptr-ignore-inline-asm-call.ll
-
replace-lds-by-ptr-ignore-kernel-only-used-lds.ll
-
replace-lds-by-ptr-ignore-not-reachable-lds.ll
-
replace-lds-by-ptr-ignore-small-lds.ll
-
replace-lds-by-ptr-indirect-call-diamond-shape.ll
-
replace-lds-by-ptr-indirect-call-selected_functions.ll
-
replace-lds-by-ptr-indirect-call-signature-match.ll
-
replace-lds-by-ptr-lds-offsets.ll
-
replace-lds-by-ptr-use-multiple-lds.ll
-
replace-lds-by-ptr-use-same-lds.ll
-
replace-lds-by-ptr-use-within-const-expr1.ll
-
replace-lds-by-ptr-use-within-const-expr2.ll
-
replace-lds-by-ptr-use-within-phi-inst.ll

Differential D103225

[AMDGPU] Replace non-kernel function uses of LDS globals by pointers.
ClosedPublic

Authored by hsmhsm on May 26 2021, 10:14 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
arsenm
b-sumner
t-tye
rampitec
ronlieb

Commits

rG80fd5fa5269c: [AMDGPU] Replace non-kernel function uses of LDS globals by pointers.

Summary

The main motivation behind pointer replacement of LDS use within non-kernel
functions is - to *avoid* subsequent LDS lowering pass from directly packing
LDS (assume large LDS) into a struct type which would otherwise cause allocating
huge memory for struct instance within every kernel.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hsmhsm created this revision.May 26 2021, 10:14 PM

Herald added subscribers: foad, kerbowa, hiraditya and 7 others. · View Herald TranscriptMay 26 2021, 10:14 PM

hsmhsm requested review of this revision.May 26 2021, 10:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2021, 10:14 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

I am planning to take-up handling of *bit-casted function pointers* (while collecting reacheable callees for kernels) as a seperate patch on top of it.

The reason for this is :

(1) *not* handling of *bit-casted function pointers* does not break any functionality - when we fail to collect reacheable callees because of this, lowering phase directly lower LDS instead of pointers, so no break in any functionality from semantics persepctive
(2) I want this base patch to be upstreamed first before handling corner cases. It will also help us to mitigate LDS memory ovehead to greater extent when the ModuleLDSLowering code will be shipped in coming rocm releases.

Harbormaster completed remote builds in B106434: Diff 348162.May 26 2021, 10:31 PM

Fixed few clang-tidy warnings.

hsmhsm added a reviewer: ronlieb.May 27 2021, 12:24 AM

Harbormaster completed remote builds in B106447: Diff 348176.May 27 2021, 12:27 AM

Fix pre-merge check failure notified at https://reviews.llvm.org/harbormaster/unit/view/755898/

Herald added a subscriber: nikic. · View Herald TranscriptMay 27 2021, 1:13 AM

Harbormaster completed remote builds in B106449: Diff 348180.May 27 2021, 1:57 AM

foad added inline comments.May 27 2021, 2:34 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
47	You say "all reachable ... nodes" but then you only iterate over an SCC, which is different, and will miss some reachable nodes. If you really want all reachable nodes then use depth_first_ext which will do it all for you. (See EliminateUnreachableBlocks in BasicBlockUtils.cpp for an example of how to use it.)

hsmhsm added inline comments.May 27 2021, 4:46 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
47	The comment is bit vague. I actually meant this - "In a call graph, for a given (caller) node, collect all reachable callee nodes". Here, when I say reachable callee nodes, it includes those which are called within callees of caller, and so on. For example, say, f1 calls f2, f2 calls f3, and f3 calls f4. Then reachable callee set for f1 is {f2, f3, f4} So, given f1, is not that SCC iterator will collect {f2, f3, f4} for me? What are the cases that it cannot handle?

foad added inline comments.May 27 2021, 5:01 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
47	See https://en.wikipedia.org/wiki/Strongly_connected_component f4 will only be in the same SCC as f1 if there is a path from f1 to f4 and back to f1 again.

hsmhsm added inline comments.May 27 2021, 5:45 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
47	But, that is fine, it does not matter. What we are fundamentally doing here is - We collect all reachable SCCs from given kernel, then we collect all the nodes within those collected SCCs. Actually, I am not sure, if I get you here. Let me further think about it, in case if I am missing any obvious point here.

hsmhsm added inline comments.May 27 2021, 6:09 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
47	Further, I experimented for the graph that you mentioned: f1 calls f2 f1 calls f4 f2 calls f3 f4 calls f1 For this graph, yes, we land-up with three SCCs and we collect nodes from all three SCCs. Hence we have collected all reachable callees from f1. Start node: f1 SCC1: f3 SCC2: f2 SCC3: f4 f1 Reachable nodes from f1: f3 f2 f4 f1

Note the failure: LLVM.CodeGen/AMDGPU::llc-pipeline.ll

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
353	It is not strictly required. I would just add it from TM.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
123	auto UsedList. Less updates if return type will change.
136	It is local variable, no need to clean. It will be destroyed.
141	Require is too strong word here. Not beneficial.
152	return !LDSToNonKernels[GV].empty();
169	Technically it can be less. Some ASICs have 32K.
263	Can you name it somehow more readable? Entry, Entry2... Not speaking names.
310	That's a copy of set, you could use reference.

Fixed few of the review comments by Stas.

hsmhsm marked 8 inline comments as done.May 30 2021, 9:45 PM

In D103225#2785585, @rampitec wrote:

Note the failure: LLVM.CodeGen/AMDGPU::llc-pipeline.ll

Yes, I monitoring it. Not understanding why it is failing. Need to further look into it.

In D103225#2785687, @rampitec wrote:

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

Harbormaster completed remote builds in B106864: Diff 348724.May 30 2021, 10:19 PM

Rebased.

Harbormaster completed remote builds in B106915: Diff 348799.May 31 2021, 8:21 AM

Rebased.

Harbormaster completed remote builds in B107014: Diff 348935.Jun 1 2021, 5:05 AM

Fixed clang-tidy warning.

Harbormaster completed remote builds in B107019: Diff 348943.Jun 1 2021, 5:58 AM

Rebased.

Harbormaster completed remote builds in B107075: Diff 349021.Jun 1 2021, 12:26 PM

In D103225#2785687, @rampitec wrote:

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

I do not remember, but probably we can omit it since it is a singe store readonly memory. Anyway a confirmation from @t-tye would be nice.

arsenm added inline comments.Jun 1 2021, 2:58 PM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
29	Typo runnning
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
148–151	Why not check this first?
201	The compiler should never introduce ptrtoint. This should offset from a base address with getelementptr
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
246	This should really be a generic utility function

Fixed few of the review comments by Matt.

Herald added a subscriber: dexonsmith. · View Herald TranscriptJun 1 2021, 10:03 PM

hsmhsm marked 3 inline comments as done.Jun 1 2021, 10:09 PM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
201	Actually I am wondering - how to do it. "inttoptr" can be replaced with getelementptr. On the other-hand, how to achieve it for "ptrtoint"? getelementptr gives pointer value, not integer value, what I want here is - an integer value (pointer converted to integer). So, how to get integer value from getelementptr?

Harbormaster completed remote builds in B107178: Diff 349175.Jun 1 2021, 10:39 PM

Fixed missing case of lds use within phi instruction.

Harbormaster completed remote builds in B107268: Diff 349301.Jun 2 2021, 10:08 AM

When constant users becomes empty, drop references from constant.

Harbormaster completed remote builds in B107279: Diff 349318.Jun 2 2021, 11:00 AM

Reverted previous fix (drop references from constant) and rebased.

Harbormaster completed remote builds in B107291: Diff 349333.Jun 2 2021, 12:01 PM

Rebased.

Harbormaster completed remote builds in B107381: Diff 349451.Jun 2 2021, 10:22 PM

rampitec added inline comments.Jun 3 2021, 1:32 PM

llvm/include/llvm/IR/ReplaceConstant.h
31 ↗	(On Diff #349451)	Should this utility be reviewed an submitted separately since it is a common code?

rampitec mentioned this in D103655: [AMDGPU] Handle constant LDS uses from different kernels.Jun 3 2021, 4:09 PM

hsmhsm mentioned this in D103661: [IR] Add utility to convert constant expression operands (of an instruction) to instructions..Jun 3 2021, 6:28 PM

hsmhsm marked an inline comment as done.Jun 3 2021, 6:30 PM

hsmhsm added inline comments.

llvm/include/llvm/IR/ReplaceConstant.h
31 ↗	(On Diff #349451)	Pushed new patch at https://reviews.llvm.org/D103661.

hsmhsm added a parent revision: D103661: [IR] Add utility to convert constant expression operands (of an instruction) to instructions..Jun 3 2021, 9:40 PM

Rebased.

Harbormaster completed remote builds in B107907: Diff 350176.Jun 6 2021, 11:04 PM

JonChesterfield added inline comments.Jun 7 2021, 6:16 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
153	Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels. Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit.

In D103225#2792240, @rampitec wrote:

In D103225#2785687, @rampitec wrote:

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

I do not remember, but probably we can omit it since it is a singe store readonly memory. Anyway a confirmation from @t-tye would be nice.

For the record, the agreed way is to do a store from lane 0 of each wave and follow with a wave barrier.

hsmhsm mentioned this in rG3af5f3e69247: [IR] Add utility to convert constant expression operands (of an instruction) to….Jun 7 2021, 2:53 PM

In D103225#2803733, @rampitec wrote:

In D103225#2792240, @rampitec wrote:

In D103225#2785687, @rampitec wrote:

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

I do not remember, but probably we can omit it since it is a singe store readonly memory. Anyway a confirmation from @t-tye would be nice.

For the record, the agreed way is to do a store from lane 0 of each wave and follow with a wave barrier.

I agree with @rampitec, although we did suggest measuring to confirm that the work-group barrier and a single wave's lane 0 is not faster than multiple waves's lane 0 and a wave barrier.

We also observed that using a wave barrier is UB in the language memory model, although well defined in the AMDGPU hardware memory model. However, the current AMD GPU sync-scope definition implies the language rules and so would be an issue if any future atomic optimizations exploited that.

Rebased.

Harbormaster completed remote builds in B108127: Diff 350487.Jun 7 2021, 10:09 PM

Two approaches for limiting the stores to lane 0 of each wave:

Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1
Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)

rampitec added inline comments.Jun 8 2021, 12:39 PM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
151	If it is empty it probably will not be returned by findVariablesToLower()?
153	Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels. Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit. Sound like a good TODO at the very least.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42	We had problems with null before. Instruction selection will happily allocate it overlapping with first non-null variable. That's how AMDGPULowerModuleLDSPass turned to use struct instead of null. Did you check we are not introducing this again in the final ISA?

In D103225#2805419, @b-sumner wrote:

Two approaches for limiting the stores to lane 0 of each wave:

Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1

Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)

OK, let's see, which one is more feasible from the implementation point of view.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
151	That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it. In this case, collectNonKernelAccessorsOfLDS() returns empty set. @lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1 @gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64), align 8 define void @f0() { %bc = bitcast [1 x i8] addrspace(3) @lds.1 to i8 addrspace(3)* store i8 1, i8 addrspace(3)* %bc, align 1 ret void } define amdgpu_kernel void @k0() { call void @f0() ret void }
153	It is in my TODO list. It requires some logical thinking, before carefully handling this case. That is the reason, I had not yet replied to Jon's comment.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42	The null represent the base address of LDS memory. My understanding is that, earlier, instruction selection had gone for toss because of the bug where we were not actually allocating memory at all for llvm.amdgcn.module.lds, but we were accessing it at address 0. And, instruction selection was allocating address 0 for some other lds (since this address was not occupied), so there was an issue. But, nevertheless, let me look into ISA and see if we are fine or any issue because of null.

hsmhsm marked an inline comment as done.Jun 9 2021, 7:25 AM

Let only lane 0 from each wave do lds pointer initialization, followed by a wave barrier.

In D103225#2807161, @hsmhsm wrote:

In D103225#2805419, @b-sumner wrote:

Two approaches for limiting the stores to lane 0 of each wave:

Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1

Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)

OK, let's see, which one is more feasible from the implementation point of view.

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

hsmhsm marked an inline comment as done.Jun 9 2021, 8:20 AM

hsmhsm added inline comments.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll

Further, I looked into ISA, it looks good to me. For example consider below input code.

@lds.1 = internal addrspace(3) global i32 undef, align 4
@lds.2 = internal addrspace(3) global i64 undef, align 4

define internal void @func_uses_lds() {
entry:
  store i32 7, i32 addrspace(3)* @lds.1
  store i64 31, i64 addrspace(3)* @lds.2
  ret void
}

define protected amdgpu_kernel void @kernel_reaches_lds() {
entry:
  call void @func_uses_lds()
  ret void
}

output from pointer-replacement is:

@lds.1 = internal addrspace(3) global i32 undef, align 4
@lds.2 = internal addrspace(3) global i64 undef, align 4
@lds.1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
@lds.2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

define internal void @func_uses_lds() {
entry:
  %0 = load i16, i16 addrspace(3)* @lds.2.ptr, align 2
  %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
  %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)*
  %3 = load i16, i16 addrspace(3)* @lds.1.ptr, align 2
  %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
  %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)*
  store i32 7, i32 addrspace(3)* %5, align 4
  store i64 31, i64 addrspace(3)* %2, align 4
  ret void
}

define protected amdgpu_kernel void @kernel_reaches_lds() {
entry:
  %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %1 = icmp eq i32 %0, 0
  br i1 %1, label %2, label %3

2:                                                ; preds = %entry
  store i16 ptrtoint (i64 addrspace(3)* @lds.2 to i16), i16 addrspace(3)* @lds.2.ptr, align 2
  store i16 ptrtoint (i32 addrspace(3)* @lds.1 to i16), i16 addrspace(3)* @lds.1.ptr, align 2
  br label %3

3:                                                ; preds = %entry, %2
  call void @llvm.amdgcn.wave.barrier()
  call void @func_uses_lds()
  ret void
}

output from module (kernel) lds lowering is:

%llvm.amdgcn.module.lds.t = type { i16, i16 }
%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t = type { i64, i32 }

@llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 2
@llvm.compiler.used = appending global [1 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*) to i8*)], section "llvm.metadata"
@llvm.amdgcn.kernel.kernel_reaches_lds.lds = internal addrspace(3) global %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t undef, align 8

define internal void @func_uses_lds() {
entry:
  %0 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
  %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
  %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)*
  %3 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
  %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
  %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)*
  store i32 7, i32 addrspace(3)* %5, align 4
  store i64 31, i64 addrspace(3)* %2, align 4
  ret void
}

define protected amdgpu_kernel void @kernel_reaches_lds() {
entry:
  call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
  %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %1 = icmp eq i32 %0, 0
  br i1 %1, label %2, label %5

2:                                                ; preds = %entry
  %3 = ptrtoint i64 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 0) to i16
  store i16 %3, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
  %4 = ptrtoint i32 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 1) to i16
  store i16 %4, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
  br label %5

5:                                                ; preds = %entry, %2
  call void @llvm.amdgcn.wave.barrier()
  call void @func_uses_lds()
  ret void
}

what it fundamentally means is:

address 0 holds the address of @lds.1, we need to load address of @lds.1 from address 0, and then we need to store value 7 to @lds.1.
address 0 + 2 (offset 2) holds the address of @lds.2, we need to load address of @lds.2 from address 0 + 2, and then we need to store value 31 to @lds.2. That is what happening in the below ISA which is produced for above input ir.

func_uses_lds:                          ; @func_uses_lds
; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        v_mov_b32_e32 v1, 0
        ds_read_i16 v2, v1
        ds_read_i16 v3, v1 offset:2
        v_mov_b32_e32 v4, 7
        v_mov_b32_e32 v0, 31
        s_waitcnt lgkmcnt(1)
        ds_write_b32 v2, v4
        s_waitcnt lgkmcnt(1)
        ds_write_b64 v3, v[0:1]
        s_waitcnt lgkmcnt(0)
        s_setpc_b64 s[30:31]

Harbormaster completed remote builds in B108419: Diff 350901.Jun 9 2021, 8:46 AM

In D103225#2808121, @hsmhsm wrote:

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.

@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
151	That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it. In this case, collectNonKernelAccessorsOfLDS() returns empty set. @lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1 @gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64), align 8 define void @f0() { %bc = bitcast [1 x i8] addrspace(3) @lds.1 to i8 addrspace(3)* store i8 1, i8 addrspace(3)* %bc, align 1 ret void } define amdgpu_kernel void @k0() { call void @f0() ret void } Should not it be fixed by D103431?
153	You can leave a TODO comment here for now. Patch is already too big, so it will need a separate change anyway.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42	So the idea is llvm.amdgcn.module.lds will only contain converted pointers and since it is forced to be allocated at address zero these pointers will fall into the allocated space? We need to be super careful here because if anything else will be injected into the llvm.amdgcn.module.lds by the module lds pass it will all break. In fact there seems to be no work for module lds after this for the module, only for kernels. You probably can just create module structure right here. Then module lds should skip run on module (but not on kernels) if that variable already exists. It will also untangle dependency between passes. In addition you will resolve issue with null as you would just use that structure and Matt's concern about ptrtoint.

rampitec added inline comments.Jun 9 2021, 11:32 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
121	Do you really need this new map? You can get a pointer to block from KernelToLDSPointers by getting a parent of the first instruction.

In D103225#2808671, @rampitec wrote:

In D103225#2808121, @hsmhsm wrote:

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.

@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.

Anyhow, manual says:

Example to compute each thread's position in 0..63:
v_mbcnt_lo_u32_b32 v0, -1, 0
v_mbcnt_hi_u32_b32 v0, -1, v0
// v0 now contains ThreadPosition

So it tells to nest it. Plus will fail on lane 32 as far as I understand.

rampitec added inline comments.Jun 9 2021, 12:28 PM

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16	This likely should not be i16. You could do it this way: @lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i8 addrspace(3)* undef, align 2 Then use bitcast to i8 addrspace(3)* instead of ptrtoint on store. But I still think that creating module lds right here is better.

In D103225#2808734, @rampitec wrote:

In D103225#2808671, @rampitec wrote:

In D103225#2808121, @hsmhsm wrote:

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.

Actually, in wave64 mode, for any thread >= 32, mbcnt_lo returns 32 not 0, practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64, I will make changes accordingly.

@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.

Anyhow, manual says:
Example to compute each thread's position in 0..63:
v_mbcnt_lo_u32_b32 v0, -1, 0
v_mbcnt_hi_u32_b32 v0, -1, v0
// v0 now contains ThreadPosition
So it tells to nest it. Plus will fail on lane 32 as far as I understand.

No it does not fail on lane 32. For lane 32, v_mbcnt_lo returns 32, and you pass this return value from v_mbcnt_lo to v_mbcnt_hi. v_mbcnt_hi will not add any additioncal count (on top of 32), hence it rerurns back 32 again, that is lane position for lane 32.

Rebased.

hsmhsm marked an inline comment as done.Jun 9 2021, 9:21 PM

Harbormaster completed remote builds in B108535: Diff 351054.Jun 9 2021, 10:15 PM

Insert @llvm.amdgcn.mbcnt.hi() for wave64 mode.

practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64

I am baffled by this. Why would you add an extra instruction when you have just explained that it is not required?

Harbormaster completed remote builds in B108556: Diff 351086.Jun 10 2021, 2:11 AM

hsmhsm marked 4 inline comments as done.Jun 10 2021, 2:31 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
121	I think we need it. Consider below transformed code. Here we need to track the newly introduced block st since this is where we need to insert store instructions. And I am not sure, how can I get it from KernelToLDSPointers. define protected amdgpu_kernel void @k0() { entry: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %1 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %0) %2 = icmp eq i32 %1, 0 br i1 %2, label %st, label %ft st: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds to i16), i16 addrspace(3)* @lds.ptr, align 2 br label %ft ft: call void @llvm.amdgcn.wave.barrier() call void @f0() ret void }
151	The patch https://reviews.llvm.org/D103431 has fixed cases where few globally used lds were not getting lowering. Now, here is the basic funda: The pass LowerModuleLDS lowers LDS which are used either in global scope or in non-kernel function scope or in both. And it directly packs the LDS within struct which we want to avoid since it wastes memory. So, in this pointer-replacement patch, we try to replace uses of LDS by pointers so that LowerModuleLDS pass ends-up packing pointers instead of LDS themselves. However, we can only pointer replace those LDS which are used within non-kernel function scope, but not the ones which are used in global scope. Now in the above example, assume that we create a pointer to @lds.1 since it is used within non-kernel function scope, and replace its use within function by pointer. Then LowerModuleLDS pass sees use of @lds.1 (in global scope) and the use of pointer within function. Hence it lowers both lds and its pointer, means it packs both @lds.1 and its pointer within struct, which is waste. Hence this pass does not touch the lds which are used both in global and non-kernel function scope, and leaves those lds to getting directly lowered by LowerModuleLDS pass, means directly getting packed within struct. And, I assume such cases are not frequent in practice.
153	I have kept is as TODO and come to it later in a separate patch.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16	The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using "i8 addrspace(3)*" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here. Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass). This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42	As I mentioned in one my earlier comments - we still need llvm.amdgcn.module.lds. We cannot remove it for now. The main reason for above is - for a given use of LDS within some non-kernel function, we need to make sure that appropriate memory for LDS is being accessed based on which kernel is called that function. We have not yet solved this issue. The current available solution is this llvm.amdgcn.module.lds. But, we should attempt to avoid wasting of memory by llvm.amdgcn.module.lds, hence this pointer replacement pass. What this pass does is below: (1) create an i16 pointer "lds.ptr" to "lds" where "lds" is used in some non-kernel function. (2) Within kernel: lds.ptr = ptrtoint(lds) (3) Within non-kernel where lds is used: lds.addr = GEP(null, lds.ptr) // where null is the base address of the LDS memory, and we use GEP here to avoid inttoptr replace use of "lds" by "lds.addr" after apropriately bitcasting. What it means to LowerModuleLDS pass is below: (1) The use of "lds" is moved to kernel (pointer initialization), hence it is properly allocated within kernel as a part of kernel specific struct, and more important it is allocated only if there is call from kernel to function. Otherwise there is no this pointer initialization, and hence no usage, and hence no allocation. (2) It sees the use of "lds.ptr" within non-kernel function, and hence it lowers this ptr by putting it within module struct. Otherwise, we still allocate memory for llvm.amdgcn.module.lds within every kernel. But, now we are not wasting too much memory. In anycase, within every kernel, llvm.amdgcn.module.lds is allocated at address 0. all the other vars (belonging to kernel) are allocated at some non-zero addresses. In this way, I do not see any problem here, unless I am missing any corner, but serious case for conflcting with address 0.

In D103225#2809859, @foad wrote:

practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64

I am baffled by this. Why would you add an extra instruction when you have just explained that it is not required?

Correct, but, I myself could not confidently prove that it is always safe to use only mbcnt_lo even for wave64 mode, hence had to insert mbcnt_hi for wave64 mode.

I believe that for the purposes of detecting lane 0 mbcnt_lo is sufficient.

In D103225#2809846, @hsmhsm wrote:

Insert @llvm.amdgcn.mbcnt.hi() for wave64 mode.

Actually I think you were right. mbcnt_lo will always return -1 for any lane higher than 31.

rampitec added inline comments.Jun 10 2021, 11:31 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
121	Ok.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16	The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using "i8 addrspace(3)*" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here. Should not sizeof local pointer be 2? Checked the DL: p3:32:32. Sigh! Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass). This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems. My concern is that you assume these i16 pointers will start a null, but that might not be the case if you rely on the module LDS pass to pack them into the struct along with other LDS and let it sort it. Can you add a test where you have these pointers and also another module LDS which will have higher alignment and size, so module LDS will pack them together? We could then inspect output of 2 passes and the ISA.

[1] Do not insert mbcnt_hi for wave64 mode, mbcnt_lo itself is enough to detect lane 0 even in wave64 mode.
[2] Add a new lit test as asked by Stas.

In D103225#2810460, @b-sumner wrote:

I believe that for the purposes of detecting lane 0 mbcnt_lo is sufficient.

Patch is again updated accordingly.

In D103225#2811100, @rampitec wrote:

In D103225#2809846, @hsmhsm wrote:

Insert @llvm.amdgcn.mbcnt.hi() for wave64 mode.

Actually I think you were right. mbcnt_lo will always return -1 for any lane higher than 31.

Now I again updated the patch - mbcnt_lo itself is enough to detect lane 0 even in wave64 mode.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16	No, what you concern here is not at all arise. Please have a look the new test case - replace-lds-by-ptr-lds-offsets.ll. I have given detailed description within this test, and I hope, it will clarify your above doubt.

Harbormaster completed remote builds in B109050: Diff 351785.Jun 14 2021, 12:38 AM

hsmhsm marked 7 inline comments as done.Jun 14 2021, 4:08 AM

rampitec added inline comments.Jun 14 2021, 12:11 PM

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16	Thanks, that helps. Tracked pointers and stores there, it seems to be consistent. Matt raised a concern that IR passes may refuse to produce a 'noalias' answer in presence of null, but that is another concern which we may be able to solve later. In particular any operation on such pointer is invariant and noalias in a given function, essentially it is just a single load, so we may just annotate it accordingly. Not required for this patch though.

LGTM. Please allow at least 24 hours for others to comment too before submitting. In the meanwhile PSDB and ePSDB runs will be useful.

This revision is now accepted and ready to land.Jun 14 2021, 12:36 PM

Closed by commit rG80fd5fa5269c: [AMDGPU] Replace non-kernel function uses of LDS globals by pointers. (authored by hsmhsm). · Explain WhyJun 20 2021, 11:23 PM

This revision was automatically updated to reflect the committed changes.

hsmhsm added a commit: rG80fd5fa5269c: [AMDGPU] Replace non-kernel function uses of LDS globals by pointers..

foad added inline comments.Jun 21 2021, 1:28 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
51	There is no need to find strongly connected components. This should use a df_iterator instead.

foad added inline comments.Jun 22 2021, 5:24 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
51	D104704

This may be the root cause of recent crashes. Reported to me offline, posting the stack trace here. @ronlieb may have a repro from different code.

llvm-project/llvm/include/llvm/ADT/Optional.h:96: T& llvm::optional_detail::OptionalStorage<T, <anonymous> >::getValue() & [with T = llvm::WeakTrackingVH; bool <anonymous> = false]: Assertion `hasVal' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.	Program arguments: $HOME/clang/bin/llc /tmp/zgemm-bb1d9d-gfx908-linked-b8cb9f.bc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -filetype=obj -o /tmp/zgemm-bb1d9d-gfx908-2e0957.o
1.	Running pass 'Replace within non-kernel function use of LDS with pointer' on module '/tmp/zgemm-bb1d9d-gfx908-linked-b8cb9f.bc'.
 #0 0x00007f1e8ed35ebf PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x00007f1e8ed33879 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f1e93a112d0 __restore_rt (/lib64/libpthread.so.0+0x132d0)
 #3 0x00007f1e8dec5520 raise (/lib64/libc.so.6+0x39520)
 #4 0x00007f1e8dec6b01 abort (/lib64/libc.so.6+0x3ab01)
 #5 0x00007f1e8debdb1a __assert_fail_base (/lib64/libc.so.6+0x31b1a)
 #6 0x00007f1e8debdb92 (/lib64/libc.so.6+0x31b92)
 #7 0x00007f1e8d753ad4 llvm::AMDGPU::CollectReachableCallees::collectReachableCallees(llvm::Function*) ($HOME/clang/lib/libLLVMAMDGPUUtils.so.13git+0x1fad4)
 #8 0x00007f1e8d753e27 llvm::AMDGPU::collectReachableCallees(llvm::Module&, llvm::DenseMap<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u>, llvm::DenseMapInfo<llvm::Function*>, llvm::detail::DenseMapPair<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u> > >&) ($HOME/clang/lib/libLLVMAMDGPUUtils.so.13git+0x1fe27)
 #9 0x00007f1e9cc2bf67 (anonymous namespace)::ReplaceLDSUseImpl::replaceLDSUse() AMDGPUReplaceLDSUseWithPointer.cpp:0:0
#10 0x00007f1e9cc2eec8 (anonymous namespace)::AMDGPUReplaceLDSUseWithPointer::runOnModule(llvm::Module&) AMDGPUReplaceLDSUseWithPointer.cpp:0:0
#11 0x00007f1e8f40f4e8 llvm::legacy::PassManagerImpl::run(llvm::Module&) ($HOME/clang/lib/libLLVMCore.so.13git+0x2174e8)
#12 0x0000000000416615 compileModule(char**, llvm::LLVMContext&) llc.cpp:0:0
#13 0x000000000040d596 main ($HOME/clang/bin/llc+0x40d596)
#14 0x00007f1e8deb034a __libc_start_main (/lib64/libc.so.6+0x2434a)

seeing this assert in trunk today in pass

llvm::AMDGPU::CollectReachableCallees::collectReachableCallees

rlieberm@r4:~/git/aomp12/aomp/test/smoke/helloworld$ /home/rlieberm/rocm/trunk_1.0bin/clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 helloworld.c -o helloworld
llc: /work/rlieberm/mono-repo/llvm-project/llvm/include/llvm/ADT/Optional.h:96: T& llvm::optional_detail::OptionalStorage<T, <anonymous> >::getValue() & [with T = llvm::WeakTrackingVH; bool <anonymous> = false]: Assertion `hasVal' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0. Program arguments: /home/rlieberm/rocm/trunk_1.0bin/llc /tmp/helloworld-9e3621-gfx906-linked-3dda69.bc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -filetype=obj -o /tmp/helloworld-9e3621-gfx906-1499ac.o

Running pass 'Replace within non-kernel function use of LDS with pointer' on module '/tmp/helloworld-9e3621-gfx906-linked-3dda69.bc'. #0 0x000055e8771bf08f PrintStackTraceSignalHandler(void*) Signals.cpp:0:0 #1 0x000055e8771bc8dd SignalHandler(int) Signals.cpp:0:0 #2 0x00007efe2440d980 restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980) #3 0x00007efe230befb7 raise /build/glibc-S7xCS9/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0 #4 0x00007efe230c0921 abort /build/glibc-S7xCS9/glibc-2.27/stdlib/abort.c:81:0 #5 0x00007efe230b048a assert_fail_base /build/glibc-S7xCS9/glibc-2.27/assert/assert.c:89:0 #6 0x00007efe230b0502 (/lib/x86_64-linux-gnu/libc.so.6+0x30502) #7 0x000055e8774c87c2 llvm::AMDGPU::CollectReachableCallees::collectReachableCallees(llvm::Function*) (/home/rlieberm/rocm/trunk_1.0bin/llc+0x20b27c2) #8 0x000055e8774c8b1f llvm::AMDGPU::collectReachableCallees(llvm::Module&, llvm::DenseMap<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u>, llvm::DenseMapInfo<llvm::Function*>, llvm::detail::DenseMapPair<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u> > >&) (/home/rlieberm/rocm/trunk_1.0bin/llc+0x20b2b1f)

instructions to build aomp from trunk and run test
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/tmp/trunk_1.0 \

-DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" \
-DLLVM_ENABLE_ASSERTIONS=ON                        \
-DLLVM_CCACHE_BUILD=$OPTION_CCACHE                 \
-DLLVM_ENABLE_RUNTIMES="openmp" \
-DOPENMP_ENABLE_LIBOMPTARGET_HSA=1 \
-DCLANG_DEFAULT_LINKER=lld                         \
../llvm -GNinja

ninja install

git clone http://github.com/rocm-developer-tools/aomp
cd aomp/test/smoke/helloworld
/tmp/trunk_1.0//bin/clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 helloworld.c -o helloworld

those clone instructions will yield a bunch of code, including some makefile machinery and the following C:

// helloworld.c
#include <stdio.h>
#include <omp.h>

int main(void) {
  int isHost = 1;

#pragma omp target map(tofrom: isHost)
  {
    isHost = omp_is_initial_device();
    printf("Hello world. %d\n", 100);
    for (int i =0; i<5; i++) {
      printf("Hello world. iteration %d\n", i);
    }
  }

  printf("Target region executed on the %s\n", isHost ? "host" : "device");

  return isHost;
}

I am going to revert this change as the author is unlikely to be available at present (early morning Saturday), in accordance with the guidelines at https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy. It'll reapply easily once the assert is run to ground.

It breaks a lot of the smoke tests in aomp. It's also been hit externally. It reproduces on a minimal openmp example (presumably because our runtime does things with shared memory). I don't see how this change could have passed epsdb, so reverting it also helps with the merge from trunk into rocm.

edit: doesn't revert cleanly. Considering whether to leave trunk broken or not.

Git revert is unclean, gone with D104962 instead. That changes the pass to disabled by default, changes the test header for these tests to re-enable it, deletes one test where that strategy failed.

I am seeing a run-time failure as well in addition to the compile-time failure reported earlier. Here's a test case, veccopy.c.

#include <stdio.h>
#include <omp.h>

int main()
{

int N = 100000;

int a[N];
int b[N];

int i;

for (i=0; i<N; i++){
  a[i]=0;
  b[i]=i;
}

#pragma omp target teams distribute parallel for map(from: a[0:N]) map(to: b[0:N])

{
  for (int j = 0; j< N; j++)
    a[j]=b[j];
}

int rc = 0;
for (i=0; i<N; i++)
  if (a[i] != b[i] ) {
    rc++;
    printf ("Wrong value: a[%d]=%d\n", i, a[i]);
  }

if (!rc){
  printf("Success\n");
  return EXIT_SUCCESS;
} else{
  printf("Failure\n");
  return EXIT_FAILURE;
}

}

Compile/run:
clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 veccopy.c -o veccopy
./veccopy
[GPU Memory Error] Addr: 0x7f60a4210000 Reason: Page not present or supervisor privilege.
Memory access fault by GPU node-1 (Agent handle: 0x168c5d0) on address 0x7f60a4210000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Adding -mllvm -amdgpu-enable-lds-replace-with-pointer=false gets rid of the problem.
Running with env var LIBOMPTARGET_KERNEL_TRACE=1
shows LDS went down from 16400B to 32B but it triggered the above problem.

W.r.t compile time failure, it might be because of the patch D104704 which is added on top of this patch.

W.r.t run time issues, I am surprised that our internal psdb job did not catch it. We really need to look into it.

psdb (our classic testing) has zero OpenMP tests.
The ePSDB does have OpenMP testing.

a jenkins request to run ePSDB on a phab patch would be needed to get ePSDB testing.

After the patch has landed in trunk, the twice-daily merges from trunk to amd-stg-open will run both a psdb and ePSDB to validate the set of patches which might get merged.
if the merge testing in psdb/ePSDB fails to pass, merge does not land in amd-stg-open.
i think our merge process is about to try and digest the LDS patch.

Jon, Dhruva, i added your two tests to our smoke-fails this morning. i see both failing as you described with trunk build. I see both passing in this mornings amd-stg-open with the pass enabled, and with the pass disabled.

In D103225#2842126, @hsmhsm wrote:

W.r.t compile time failure, it might be because of the patch D104704 which is added on top of this patch.

W.r.t run time issues, I am surprised that our internal psdb job did not catch it. We really need to look into it.

Given the call stack that is a very good chance it is D104704. Can we try to run that w/o D104704 and with LDS pointer replacement enabled?

Here's a .ll testcase reduced from the openmp helloworld thing:

; ModuleID = 'z.ll'
source_filename = "llvm-link"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn-amd-amdhsa"

@usedSlotIdx = internal local_unnamed_addr addrspace(3) global i32 undef, align 4

define weak amdgpu_kernel void @__omp_offloading_10302_1a22e53_main_l7() {
entry:
  %nvptx_num_threads = call i32 @__kmpc_amdgcn_gpu_num_threads()
  ret void
}

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0

declare i32 @__kmpc_amdgcn_gpu_num_threads()

define internal void @__kmpc_kernel_init() {
entry:
  store i32 undef, i32 addrspace(3)* @usedSlotIdx, align 4, !tbaa !0
  ret void
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p5i8(i64 immarg, i8 addrspace(5)* nocapture) #1

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p5i8(i64 immarg, i8 addrspace(5)* nocapture) #1

attributes #0 = { nounwind readnone speculatable willreturn }
attributes #1 = { argmemonly nofree nosync nounwind willreturn }

!0 = !{!1, !1, i64 0}
!1 = !{!"int", !2, i64 0}
!2 = !{!"omnipotent char", !3, i64 0}
!3 = !{!"Simple C++ TBAA"}

Compile with: llc -march=amdgcn -mcpu=gfx906 -o /dev/null z.ll

It bisects to this patch D103225, not D104704.

hsmhsm mentioned this in D107329: [AMDGPU] Ignore call graph node which does not have function info..Aug 3 2021, 8:04 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

9 lines

AMDGPULowerModuleLDSPass.cpp

7 lines

AMDGPUReplaceLDSUseWithPointer.cpp

460 lines

AMDGPUTargetMachine.cpp

19 lines

CMakeLists.txt

1 line

Utils/

AMDGPULDSUtils.h

20 lines

AMDGPULDSUtils.cpp

185 lines

test/

CodeGen/

AMDGPU/

llc-pipeline.ll

5 lines

replace-lds-by-ptr-call-diamond-shape.ll

88 lines

replace-lds-by-ptr-call-selected_functions.ll

130 lines

replace-lds-by-ptr-ignore-global-scope-use.ll

53 lines

replace-lds-by-ptr-ignore-inline-asm-call.ll

30 lines

replace-lds-by-ptr-ignore-kernel-only-used-lds.ll

25 lines

replace-lds-by-ptr-ignore-not-reachable-lds.ll

28 lines

replace-lds-by-ptr-ignore-small-lds.ll

31 lines

replace-lds-by-ptr-indirect-call-diamond-shape.ll

95 lines

replace-lds-by-ptr-indirect-call-selected_functions.ll

151 lines

replace-lds-by-ptr-indirect-call-signature-match.ll

94 lines

replace-lds-by-ptr-lds-offsets.ll

214 lines

replace-lds-by-ptr-use-multiple-lds.ll

66 lines

replace-lds-by-ptr-use-same-lds.ll

53 lines

replace-lds-by-ptr-use-within-const-expr1.ll

54 lines

replace-lds-by-ptr-use-within-const-expr2.ll

58 lines

replace-lds-by-ptr-use-within-phi-inst.ll

93 lines

Diff 353276

llvm/lib/Target/AMDGPU/AMDGPU.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );		FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );
FunctionPass *createAMDGPUUseNativeCallsPass();		FunctionPass *createAMDGPUUseNativeCallsPass();
FunctionPass *createAMDGPUCodeGenPreparePass();		FunctionPass *createAMDGPUCodeGenPreparePass();
FunctionPass *createAMDGPULateCodeGenPreparePass();		FunctionPass *createAMDGPULateCodeGenPreparePass();
FunctionPass *createAMDGPUMachineCFGStructurizerPass();		FunctionPass *createAMDGPUMachineCFGStructurizerPass();
FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );		FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );
ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );		ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );
FunctionPass *createAMDGPURewriteOutArgumentsPass();		FunctionPass *createAMDGPURewriteOutArgumentsPass();
		ModulePass *createAMDGPUReplaceLDSUseWithPointerPass();
ModulePass *createAMDGPULowerModuleLDSPass();		ModulePass *createAMDGPULowerModuleLDSPass();
FunctionPass *createSIModeRegisterPass();		FunctionPass *createSIModeRegisterPass();

struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {		struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {
AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}		AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

private:		private:
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	struct AMDGPUPropagateAttributesLatePass
: PassInfoMixin<AMDGPUPropagateAttributesLatePass> {		: PassInfoMixin<AMDGPUPropagateAttributesLatePass> {
AMDGPUPropagateAttributesLatePass(TargetMachine &TM) : TM(TM) {}		AMDGPUPropagateAttributesLatePass(TargetMachine &TM) : TM(TM) {}
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);

private:		private:
TargetMachine &TM;		TargetMachine &TM;
};		};

		void initializeAMDGPUReplaceLDSUseWithPointerPass(PassRegistry &);
		extern char &AMDGPUReplaceLDSUseWithPointerID;

		struct AMDGPUReplaceLDSUseWithPointerPass
		: PassInfoMixin<AMDGPUReplaceLDSUseWithPointerPass> {
		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
		};

void initializeAMDGPULowerModuleLDSPass(PassRegistry &);		void initializeAMDGPULowerModuleLDSPass(PassRegistry &);
extern char &AMDGPULowerModuleLDSID;		extern char &AMDGPULowerModuleLDSID;

struct AMDGPULowerModuleLDSPass : PassInfoMixin<AMDGPULowerModuleLDSPass> {		struct AMDGPULowerModuleLDSPass : PassInfoMixin<AMDGPULowerModuleLDSPass> {
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
};		};

void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);		void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);
▲ Show 20 Lines • Show All 261 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

	Show All 18 Lines
	// To reduce the memory overhead variables that are only used by kernels are			// To reduce the memory overhead variables that are only used by kernels are
	// excluded from this transform. The analysis to determine whether a variable			// excluded from this transform. The analysis to determine whether a variable
	// is only used by a kernel is cheap and conservative so this may allocate			// is only used by a kernel is cheap and conservative so this may allocate
	// a variable in every kernel when it was not strictly necessary to do so.			// a variable in every kernel when it was not strictly necessary to do so.
	//			//
	// A possible future refinement is to specialise the structure per-kernel, so			// A possible future refinement is to specialise the structure per-kernel, so
	// that fields can be elided based on more expensive analysis.			// that fields can be elided based on more expensive analysis.
	//			//
				// NOTE: Since this pass will directly pack LDS (assume large LDS) into a struct
				// type which would cause allocating huge memory for struct instance within
				// every kernel. Hence, before running this pass, it is advisable to run the
				arsenmUnsubmitted Done Reply Inline Actions Typo runnning arsenm: Typo runnning
				// pass "amdgpu-replace-lds-use-with-pointer" which will replace LDS uses within
				// non-kernel functions by pointers and thereby minimizes the unnecessary per
				// kernel allocation of LDS memory.
				//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPU.h"			#include "AMDGPU.h"
	#include "Utils/AMDGPUBaseInfo.h"			#include "Utils/AMDGPUBaseInfo.h"
	#include "Utils/AMDGPULDSUtils.h"			#include "Utils/AMDGPULDSUtils.h"
	#include "llvm/ADT/STLExtras.h"			#include "llvm/ADT/STLExtras.h"
	#include "llvm/IR/Constants.h"			#include "llvm/IR/Constants.h"
	#include "llvm/IR/DerivedTypes.h"			#include "llvm/IR/DerivedTypes.h"
	▲ Show 20 Lines • Show All 303 Lines • ▼ Show 20 Lines

	char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;			char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;

	INITIALIZE_PASS(AMDGPULowerModuleLDS, DEBUG_TYPE,			INITIALIZE_PASS(AMDGPULowerModuleLDS, DEBUG_TYPE,
	"Lower uses of LDS variables from non-kernel functions", false,			"Lower uses of LDS variables from non-kernel functions", false,
	false)			false)

	ModulePass *llvm::createAMDGPULowerModuleLDSPass() {			ModulePass *llvm::createAMDGPULowerModuleLDSPass() {
	return new AMDGPULowerModuleLDS();			return new AMDGPULowerModuleLDS();
				rampitecUnsubmitted Done Reply Inline Actions It is not strictly required. I would just add it from TM. rampitec: It is not strictly required. I would just add it from TM.
	}			}

	PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,			PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,
	ModuleAnalysisManager &) {			ModuleAnalysisManager &) {
	return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()			return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()
	: PreservedAnalyses::all();			: PreservedAnalyses::all();
	}			}

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

This file was added.

				//===-- AMDGPUReplaceLDSUseWithPointer.cpp --------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass replaces all the uses of LDS within non-kernel functions by
				// corresponding pointer counter-parts.
				//
				// The main motivation behind this pass is - to avoid subsequent LDS lowering
				// pass from directly packing LDS (assume large LDS) into a struct type which
				// would otherwise cause allocating huge memory for struct instance within every
				// kernel.
				//
				// Brief sketch of the algorithm implemented in this pass is as below:
				//
				// 1. Collect all the LDS defined in the module which qualify for pointer
				// replacement, say it is, LDSGlobals set.
				//
				// 2. Collect all the reachable callees for each kernel defined in the module,
				// say it is, KernelToCallees map.
				//
				// 3. FOR (each global GV from LDSGlobals set) DO
				// LDSUsedNonKernels = Collect all non-kernel functions which use GV.
				// FOR (each kernel K in KernelToCallees map) DO
				// ReachableCallees = KernelToCallees[K]
				// ReachableAndLDSUsedCallees =
				// SetIntersect(LDSUsedNonKernels, ReachableCallees)
				// IF (ReachableAndLDSUsedCallees is not empty) THEN
				// Pointer = Create a pointer to point-to GV if not created.
				// Initialize Pointer to point-to GV within kernel K.
				// ENDIF
				// ENDFOR
				// Replace all uses of GV within non kernel functions by Pointer.
				// ENFOR
				//
				// LLVM IR example:
				//
				// Input IR:
				//
				// @lds = internal addrspace(3) global [4 x i32] undef, align 16
				//
				// define internal void @f0() {
				// entry:
				// %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds,
				// i32 0, i32 0
				// ret void
				// }
				//
				// define protected amdgpu_kernel void @k0() {
				// entry:
				// call void @f0()
				// ret void
				// }
				//
				// Output IR:
				//
				// @lds = internal addrspace(3) global [4 x i32] undef, align 16
				// @lds.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				//
				// define internal void @f0() {
				// entry:
				// %0 = load i16, i16 addrspace(3)* @lds.ptr, align 2
				// %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				// %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				// %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2,
				// i32 0, i32 0
				// ret void
				// }
				//
				// define protected amdgpu_kernel void @k0() {
				// entry:
				// store i16 ptrtoint ([4 x i32] addrspace(3)* @lds to i16),
				// i16 addrspace(3)* @lds.ptr, align 2
				// call void @f0()
				// ret void
				// }
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "GCNSubtarget.h"
				#include "Utils/AMDGPUBaseInfo.h"
				#include "Utils/AMDGPULDSUtils.h"
				#include "llvm/ADT/DenseMap.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/ADT/SetOperations.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/IR/DerivedTypes.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/InlineAsm.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicsAMDGPU.h"
				#include "llvm/IR/ReplaceConstant.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Target/TargetMachine.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/ModuleUtils.h"
				#include <algorithm>
				#include <vector>

				#define DEBUG_TYPE "amdgpu-replace-lds-use-with-pointer"

				using namespace llvm;

				namespace {

				class ReplaceLDSUseImpl {
				Module &M;
				LLVMContext &Ctx;
				const DataLayout &DL;
				Constant *LDSMemBaseAddr;

				DenseMap<GlobalVariable , GlobalVariable > LDSToPointer;
				DenseMap<GlobalVariable , SmallPtrSet<Function , 8>> LDSToNonKernels;
				DenseMap<Function , SmallPtrSet<Function , 8>> KernelToCallees;
				rampitecUnsubmitted Done Reply Inline Actions Do you really need this new map? You can get a pointer to block from KernelToLDSPointers by getting a parent of the first instruction. rampitec: Do you really need this new map? You can get a pointer to block from KernelToLDSPointers by…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I think we need it. Consider below transformed code. Here we need to track the newly introduced block st since this is where we need to insert store instructions. And I am not sure, how can I get it from KernelToLDSPointers. define protected amdgpu_kernel void @k0() { entry: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %1 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %0) %2 = icmp eq i32 %1, 0 br i1 %2, label %st, label %ft st: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds to i16), i16 addrspace(3)* @lds.ptr, align 2 br label %ft ft: call void @llvm.amdgcn.wave.barrier() call void @f0() ret void } hsmhsm: I think we need it. Consider below transformed code. Here we need to track the newly introduced…
				rampitecUnsubmitted Done Reply Inline Actions Ok. rampitec: Ok.
				DenseMap<Function , SmallPtrSet<GlobalVariable , 8>> KernelToLDSPointers;
				DenseMap<Function , BasicBlock > KernelToInitBB;
				rampitecUnsubmitted Done Reply Inline Actions auto UsedList. Less updates if return type will change. rampitec: auto UsedList. Less updates if return type will change.
				DenseMap<Function , DenseMap<GlobalVariable , Value *>>
				FunctionToLDSToReplaceInst;

				// Collect LDS which requires their uses to be replaced by pointer.
				std::vector<GlobalVariable *> collectLDSRequiringPointerReplace() {
				// Collect LDS which requires module lowering.
				std::vector<GlobalVariable *> LDSGlobals = AMDGPU::findVariablesToLower(M);

				// Remove LDS which don't qualify for replacement.
				LDSGlobals.erase(std::remove_if(LDSGlobals.begin(), LDSGlobals.end(),
				[&](GlobalVariable *GV) {
				return shouldIgnorePointerReplacement(GV);
				}),
				rampitecUnsubmitted Done Reply Inline Actions It is local variable, no need to clean. It will be destroyed. rampitec: It is local variable, no need to clean. It will be destroyed.
				LDSGlobals.end());

				return LDSGlobals;
				}

				rampitecUnsubmitted Done Reply Inline Actions Require is too strong word here. Not beneficial. rampitec: Require is too strong word here. Not beneficial.
				// Returns true if uses of given LDS global within non-kernel functions should
				// be keep as it is without pointer replacement.
				bool shouldIgnorePointerReplacement(GlobalVariable *GV) {
				// LDS whose size is very small and doesn`t exceed pointer size is not worth
				// replacing.
				if (DL.getTypeAllocSize(GV->getValueType()) <= 2)
				return true;

				// LDS which is not used from non-kernel function scope or it is used from
				// global scope does not qualify for replacement.
				arsenmUnsubmitted Done Reply Inline Actions Why not check this first? arsenm: Why not check this first?
				rampitecUnsubmitted Done Reply Inline Actions If it is empty it probably will not be returned by findVariablesToLower()? rampitec: If it is empty it probably will not be returned by findVariablesToLower()?
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it. In this case, collectNonKernelAccessorsOfLDS() returns empty set. @lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1 @gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64), align 8 define void @f0() { %bc = bitcast [1 x i8] addrspace(3) @lds.1 to i8 addrspace(3)* store i8 1, i8 addrspace(3)* %bc, align 1 ret void } define amdgpu_kernel void @k0() { call void @f0() ret void } hsmhsm: That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non…
				rampitecUnsubmitted Done Reply Inline Actions That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it. In this case, collectNonKernelAccessorsOfLDS() returns empty set. @lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1 @gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64), align 8 define void @f0() { %bc = bitcast [1 x i8] addrspace(3) @lds.1 to i8 addrspace(3)* store i8 1, i8 addrspace(3)* %bc, align 1 ret void } define amdgpu_kernel void @k0() { call void @f0() ret void } Should not it be fixed by D103431? rampitec: > That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions The patch https://reviews.llvm.org/D103431 has fixed cases where few globally used lds were not getting lowering. Now, here is the basic funda: The pass LowerModuleLDS lowers LDS which are used either in global scope or in non-kernel function scope or in both. And it directly packs the LDS within struct which we want to avoid since it wastes memory. So, in this pointer-replacement patch, we try to replace uses of LDS by pointers so that LowerModuleLDS pass ends-up packing pointers instead of LDS themselves. However, we can only pointer replace those LDS which are used within non-kernel function scope, but not the ones which are used in global scope. Now in the above example, assume that we create a pointer to @lds.1 since it is used within non-kernel function scope, and replace its use within function by pointer. Then LowerModuleLDS pass sees use of @lds.1 (in global scope) and the use of pointer within function. Hence it lowers both lds and its pointer, means it packs both @lds.1 and its pointer within struct, which is waste. Hence this pass does not touch the lds which are used both in global and non-kernel function scope, and leaves those lds to getting directly lowered by LowerModuleLDS pass, means directly getting packed within struct. And, I assume such cases are not frequent in practice. hsmhsm: The patch https://reviews.llvm.org/D103431 has fixed cases where few globally used lds were not…
				LDSToNonKernels[GV] = AMDGPU::collectNonKernelAccessorsOfLDS(GV);
				rampitecUnsubmitted Done Reply Inline Actions return !LDSToNonKernels[GV].empty(); rampitec: return !LDSToNonKernels[GV].empty();
				return LDSToNonKernels[GV].empty();
				JonChesterfieldUnsubmitted Done Reply Inline Actions Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels. Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit. JonChesterfield: Don't replace it when the variable is used by all (possibly most) kernels that can access any…
				rampitecUnsubmitted Done Reply Inline Actions Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels. Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit. Sound like a good TODO at the very least. rampitec: > Don't replace it when the variable is used by all (possibly most) kernels that can access any…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions It is in my TODO list. It requires some logical thinking, before carefully handling this case. That is the reason, I had not yet replied to Jon's comment. hsmhsm: It is in my TODO list. It requires some logical thinking, before carefully handling this case.
				rampitecUnsubmitted Done Reply Inline Actions You can leave a TODO comment here for now. Patch is already too big, so it will need a separate change anyway. rampitec: You can leave a TODO comment here for now. Patch is already too big, so it will need a separate…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions I have kept is as TODO and come to it later in a separate patch. hsmhsm: I have kept is as TODO and come to it later in a separate patch.

				// FIXME: When GV is used within all (or within most of the kernels), then
				// it does not make sense to create a pointer for it.
				}

				// Insert new global LDS pointer which points to LDS.
				GlobalVariable createLDSPointer(GlobalVariable GV) {
				// LDS pointer which points to LDS is already created? return it.
				auto PointerEntry = LDSToPointer.insert(std::make_pair(GV, nullptr));
				if (!PointerEntry.second)
				return PointerEntry.first->second;

				// We need to create new LDS pointer which points to LDS.
				//
				// Each CU owns at max 64K of LDS memory, so LDS address ranges from 0 to
				// 2^16 - 1. Hence 16 bit pointer is enough to hold the LDS address.
				rampitecUnsubmitted Done Reply Inline Actions Technically it can be less. Some ASICs have 32K. rampitec: Technically it can be less. Some ASICs have 32K.
				auto *I16Ty = Type::getInt16Ty(Ctx);
				GlobalVariable *LDSPointer = new GlobalVariable(
				M, I16Ty, false, GlobalValue::InternalLinkage, UndefValue::get(I16Ty),
				GV->getName() + Twine(".ptr"), nullptr, GlobalVariable::NotThreadLocal,
				AMDGPUAS::LOCAL_ADDRESS);

				LDSPointer->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
				LDSPointer->setAlignment(AMDGPU::getAlign(DL, LDSPointer));

				// Mark that an associated LDS pointer is created for LDS.
				LDSToPointer[GV] = LDSPointer;

				return LDSPointer;
				}

				// Split entry basic block in such a way that only lane 0 of each wave does
				// the LDS pointer initialization, and return newly created basic block.
				BasicBlock activateLaneZero(Function K) {
				// If the entry basic block of kernel K is already splitted, then return
				// newly created basic block.
				auto BasicBlockEntry = KernelToInitBB.insert(std::make_pair(K, nullptr));
				if (!BasicBlockEntry.second)
				return BasicBlockEntry.first->second;

				// Split entry basic block of kernel K.
				auto EI = &((K->getEntryBlock().getFirstInsertionPt()));
				IRBuilder<> Builder(EI);

				Value *Mbcnt =
				Builder.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},
				{Builder.getInt32(-1), Builder.getInt32(0)});
				Value *Cond = Builder.CreateICmpEQ(Mbcnt, Builder.getInt32(0));
				arsenmUnsubmitted Done Reply Inline Actions The compiler should never introduce ptrtoint. This should offset from a base address with getelementptr arsenm: The compiler should never introduce ptrtoint. This should offset from a base address with…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions Actually I am wondering - how to do it. "inttoptr" can be replaced with getelementptr. On the other-hand, how to achieve it for "ptrtoint"? getelementptr gives pointer value, not integer value, what I want here is - an integer value (pointer converted to integer). So, how to get integer value from getelementptr? hsmhsm: Actually I am wondering - how to do it. "inttoptr" can be replaced with getelementptr. On the…
				Instruction *WB = cast<Instruction>(
				Builder.CreateIntrinsic(Intrinsic::amdgcn_wave_barrier, {}, {}));

				BasicBlock *NBB = SplitBlockAndInsertIfThen(Cond, WB, false)->getParent();

				// Mark that the entry basic block of kernel K is splitted.
				KernelToInitBB[K] = NBB;

				return NBB;
				}

				// Within given kernel, initialize given LDS pointer to point to given LDS.
				void initializeLDSPointer(Function K, GlobalVariable GV,
				GlobalVariable *LDSPointer) {
				// If LDS pointer is already initialized within K, then nothing to do.
				auto PointerEntry = KernelToLDSPointers.insert(
				std::make_pair(K, SmallPtrSet<GlobalVariable *, 8>()));
				if (!PointerEntry.second)
				if (PointerEntry.first->second.contains(LDSPointer))
				return;

				// Insert instructions at EI which initialize LDS pointer to point-to LDS
				// within kernel K.
				//
				// That is, convert pointer type of GV to i16, and then store this converted
				// i16 value within LDSPointer which is of type i16*.
				auto EI = &((activateLaneZero(K)->getFirstInsertionPt()));
				IRBuilder<> Builder(EI);
				Builder.CreateStore(Builder.CreatePtrToInt(GV, Type::getInt16Ty(Ctx)),
				LDSPointer);

				// Mark that LDS pointer is initialized within kernel K.
				KernelToLDSPointers[K].insert(LDSPointer);
				}

				// We have created an LDS pointer for LDS, and initialized it to point-to LDS
				// within all relevent kernels. Now replace all the uses of LDS within
				// non-kernel functions by LDS pointer.
				void replaceLDSUseByPointer(GlobalVariable GV, GlobalVariable LDSPointer) {
				SmallVector<User *, 8> LDSUsers(GV->users());
				for (auto *U : LDSUsers) {
				// When `U` is a constant expression, it is possible that same constant
				// expression exists within multiple instructions, and within multiple
				// non-kernel functions. Collect all those non-kernel functions and all
				// those instructions within which `U` exist.
				auto FunctionToInsts =
				AMDGPU::getFunctionToInstsMap(U, false /=CollectKernelInsts/);

				for (auto FI = FunctionToInsts.begin(), FE = FunctionToInsts.end();
				FI != FE; ++FI) {
				Function *F = FI->first;
				auto &Insts = FI->second;
				for (auto *I : Insts) {
				// If `U` is a constant expression, then we need to break the
				// associated instruction into a set of separate instructions by
				// converting constant expressions into instructions.
				SmallPtrSet<Instruction *, 8> UserInsts;

				if (U == I) {
				// `U` is an instruction, conversion from constant expression to
				// set of instructions is not required.
				UserInsts.insert(I);
				rampitecUnsubmitted Done Reply Inline Actions Can you name it somehow more readable? Entry, Entry2... Not speaking names. rampitec: Can you name it somehow more readable? Entry, Entry2... Not speaking names.
				} else {
				// `U` is a constant expression, convert it into corresponding set
				// of instructions.
				auto *CE = cast<ConstantExpr>(U);
				convertConstantExprsToInstructions(I, CE, &UserInsts);
				}

				// Go through all the user instrutions, if LDS exist within them as an
				// operand, then replace it by replace instruction.
				for (auto *II : UserInsts) {
				auto *ReplaceInst = getReplacementInst(F, GV, LDSPointer);
				II->replaceUsesOfWith(GV, ReplaceInst);
				}
				}
				}
				}
				}

				// Create a set of replacement instructions which together replace LDS within
				// non-kernel function F by accessing LDS indirectly using LDS pointer.
				Value getReplacementInst(Function F, GlobalVariable *GV,
				GlobalVariable *LDSPointer) {
				// If the instruction which replaces LDS within F is already created, then
				// return it.
				auto LDSEntry = FunctionToLDSToReplaceInst.insert(
				std::make_pair(F, DenseMap<GlobalVariable , Value >()));
				if (!LDSEntry.second) {
				auto ReplaceInstEntry =
				LDSEntry.first->second.insert(std::make_pair(GV, nullptr));
				if (!ReplaceInstEntry.second)
				return ReplaceInstEntry.first->second;
				}

				// Get the instruction insertion point within the beginning of the entry
				// block of current non-kernel function.
				auto EI = &((F->getEntryBlock().getFirstInsertionPt()));
				IRBuilder<> Builder(EI);

				// Insert required set of instructions which replace LDS within F.
				auto *V = Builder.CreateBitCast(
				Builder.CreateGEP(
				LDSMemBaseAddr,
				Builder.CreateLoad(LDSPointer->getValueType(), LDSPointer)),
				GV->getType());

				// Mark that the replacement instruction which replace LDS within F is
				// created.
				rampitecUnsubmitted Done Reply Inline Actions That's a copy of set, you could use reference. rampitec: That's a copy of set, you could use reference.
				FunctionToLDSToReplaceInst[F][GV] = V;

				return V;
				}

				public:
				ReplaceLDSUseImpl(Module &M)
				: M(M), Ctx(M.getContext()), DL(M.getDataLayout()) {
				LDSMemBaseAddr = Constant::getIntegerValue(
				PointerType::get(Type::getInt8Ty(M.getContext()),
				AMDGPUAS::LOCAL_ADDRESS),
				APInt(32, 0));
				}

				// Entry-point function which interface ReplaceLDSUseImpl with outside of the
				// class.
				bool replaceLDSUse();

				private:
				// For a given LDS from collected LDS globals set, replace its non-kernel
				// function scope uses by pointer.
				bool replaceLDSUse(GlobalVariable *GV);
				};

				// For given LDS from collected LDS globals set, replace its non-kernel function
				// scope uses by pointer.
				bool ReplaceLDSUseImpl::replaceLDSUse(GlobalVariable *GV) {
				// Holds all those non-kernel functions within which LDS is being accessed.
				SmallPtrSet<Function *, 8> &LDSAccessors = LDSToNonKernels[GV];

				// The LDS pointer which points to LDS and replaces all the uses of LDS.
				GlobalVariable *LDSPointer = nullptr;

				// Traverse through each kernel K, check and if required, initialize the
				// LDS pointer to point to LDS within K.
				for (auto KI = KernelToCallees.begin(), KE = KernelToCallees.end(); KI != KE;
				++KI) {
				Function *K = KI->first;
				SmallPtrSet<Function *, 8> Callees = KI->second;

				// Compute reachable and LDS used callees for kernel K.
				set_intersect(Callees, LDSAccessors);

				// None of the LDS accessing non-kernel functions are reachable from
				// kernel K. Hence, no need to initialize LDS pointer within kernel K.
				if (Callees.empty())
				continue;

				// We have found reachable and LDS used callees for kernel K, and we need to
				// initialize LDS pointer within kernel K, and we need to replace LDS use
				// within those callees by LDS pointer.
				//
				// But, first check if LDS pointer is already created, if not create one.
				LDSPointer = createLDSPointer(GV);

				// Initialize LDS pointer to point to LDS within kernel K.
				initializeLDSPointer(K, GV, LDSPointer);
				}

				// We have not found reachable and LDS used callees for any of the kernels,
				// and hence we have not created LDS pointer.
				if (!LDSPointer)
				return false;

				// We have created an LDS pointer for LDS, and initialized it to point-to LDS
				// within all relevent kernels. Now replace all the uses of LDS within
				// non-kernel functions by LDS pointer.
				replaceLDSUseByPointer(GV, LDSPointer);

				return true;
				}

				// Entry-point function which interface ReplaceLDSUseImpl with outside of the
				// class.
				bool ReplaceLDSUseImpl::replaceLDSUse() {
				// Collect LDS which requires their uses to be replaced by pointer.
				std::vector<GlobalVariable *> LDSGlobals =
				collectLDSRequiringPointerReplace();

				// No LDS to pointer-replace. Nothing to do.
				if (LDSGlobals.empty())
				return false;

				// Collect reachable callee set for each kernel defined in the module.
				AMDGPU::collectReachableCallees(M, KernelToCallees);

				if (KernelToCallees.empty()) {
				// Either module does not have any kernel definitions, or none of the kernel
				// has a call to non-kernel functions, or we could not resolve any of the
				// call sites to proper non-kernel functions, because of the situations like
				// inline asm calls. Nothing to replace.
				return false;
				}

				// For every LDS from collected LDS globals set, replace its non-kernel
				// function scope use by pointer.
				bool Changed = false;
				for (auto *GV : LDSGlobals)
				Changed \|= replaceLDSUse(GV);

				return Changed;
				}

				class AMDGPUReplaceLDSUseWithPointer : public ModulePass {
				public:
				static char ID;

				AMDGPUReplaceLDSUseWithPointer() : ModulePass(ID) {
				initializeAMDGPUReplaceLDSUseWithPointerPass(
				*PassRegistry::getPassRegistry());
				}

				bool runOnModule(Module &M) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<TargetPassConfig>();
				}
				};

				} // namespace

				char AMDGPUReplaceLDSUseWithPointer::ID = 0;
				char &llvm::AMDGPUReplaceLDSUseWithPointerID =
				AMDGPUReplaceLDSUseWithPointer::ID;

				INITIALIZE_PASS_BEGIN(
				AMDGPUReplaceLDSUseWithPointer, DEBUG_TYPE,
				"Replace within non-kernel function use of LDS with pointer",
				false /only look at the cfg/, false /analysis pass/)
				INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
				INITIALIZE_PASS_END(
				AMDGPUReplaceLDSUseWithPointer, DEBUG_TYPE,
				"Replace within non-kernel function use of LDS with pointer",
				false /only look at the cfg/, false /analysis pass/)

				bool AMDGPUReplaceLDSUseWithPointer::runOnModule(Module &M) {
				ReplaceLDSUseImpl LDSUseReplacer{M};
				return LDSUseReplacer.replaceLDSUse();
				}

				ModulePass *llvm::createAMDGPUReplaceLDSUseWithPointerPass() {
				return new AMDGPUReplaceLDSUseWithPointer();
				}

				PreservedAnalyses
				AMDGPUReplaceLDSUseWithPointerPass::run(Module &M, ModuleAnalysisManager &AM) {
				ReplaceLDSUseImpl LDSUseReplacer{M};
				LDSUseReplacer.replaceLDSUse();
				return PreservedAnalyses::all();
				}

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableScalarIRPasses(
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

static cl::opt<bool> EnableStructurizerWorkarounds(		static cl::opt<bool> EnableStructurizerWorkarounds(
"amdgpu-enable-structurizer-workarounds",		"amdgpu-enable-structurizer-workarounds",
cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),		cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),
cl::Hidden);		cl::Hidden);

		static cl::opt<bool> EnableLDSReplaceWithPointer(
		"amdgpu-enable-lds-replace-with-pointer",
		cl::desc("Enable LDS replace with pointer pass"), cl::init(true),
		cl::Hidden);

static cl::opt<bool, true> EnableLowerModuleLDS(		static cl::opt<bool, true> EnableLowerModuleLDS(
"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),		"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),
cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),		cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),
cl::Hidden);		cl::Hidden);

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
Show All 31 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUPreLegalizerCombinerPass(*PR);		initializeAMDGPUPreLegalizerCombinerPass(*PR);
initializeAMDGPURegBankCombinerPass(*PR);		initializeAMDGPURegBankCombinerPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
initializeAMDGPUPromoteAllocaToVectorPass(*PR);		initializeAMDGPUPromoteAllocaToVectorPass(*PR);
initializeAMDGPUCodeGenPreparePass(*PR);		initializeAMDGPUCodeGenPreparePass(*PR);
initializeAMDGPULateCodeGenPreparePass(*PR);		initializeAMDGPULateCodeGenPreparePass(*PR);
initializeAMDGPUPropagateAttributesEarlyPass(*PR);		initializeAMDGPUPropagateAttributesEarlyPass(*PR);
initializeAMDGPUPropagateAttributesLatePass(*PR);		initializeAMDGPUPropagateAttributesLatePass(*PR);
		initializeAMDGPUReplaceLDSUseWithPointerPass(*PR);
initializeAMDGPULowerModuleLDSPass(*PR);		initializeAMDGPULowerModuleLDSPass(*PR);
initializeAMDGPURewriteOutArgumentsPass(*PR);		initializeAMDGPURewriteOutArgumentsPass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertHardClausesPass(*PR);		initializeSIInsertHardClausesPass(*PR);
initializeSIInsertWaitcntsPass(*PR);		initializeSIInsertWaitcntsPass(*PR);
initializeSIModeRegisterPass(*PR);		initializeSIModeRegisterPass(*PR);
initializeSIWholeQuadModePass(*PR);		initializeSIWholeQuadModePass(*PR);
▲ Show 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	PB.registerPipelineParsingCallback(
if (PassName == "amdgpu-printf-runtime-binding") {		if (PassName == "amdgpu-printf-runtime-binding") {
PM.addPass(AMDGPUPrintfRuntimeBindingPass());		PM.addPass(AMDGPUPrintfRuntimeBindingPass());
return true;		return true;
}		}
if (PassName == "amdgpu-always-inline") {		if (PassName == "amdgpu-always-inline") {
PM.addPass(AMDGPUAlwaysInlinePass());		PM.addPass(AMDGPUAlwaysInlinePass());
return true;		return true;
}		}
		if (PassName == "amdgpu-replace-lds-use-with-pointer") {
		PM.addPass(AMDGPUReplaceLDSUseWithPointerPass());
		return true;
		}
if (PassName == "amdgpu-lower-module-lds") {		if (PassName == "amdgpu-lower-module-lds") {
PM.addPass(AMDGPULowerModuleLDSPass());		PM.addPass(AMDGPULowerModuleLDSPass());
return true;		return true;
}		}
return false;		return false;
});		});
PB.registerPipelineParsingCallback(		PB.registerPipelineParsingCallback(
[this](StringRef PassName, FunctionPassManager &PM,		[this](StringRef PassName, FunctionPassManager &PM,
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	void AMDGPUPassConfig::addIRPasses() {
// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.		// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.
if (TM.getTargetTriple().getArch() == Triple::r600)		if (TM.getTargetTriple().getArch() == Triple::r600)
addPass(createR600OpenCLImageTypeLoweringPass());		addPass(createR600OpenCLImageTypeLoweringPass());

// Replace OpenCL enqueued block function pointers with global variables.		// Replace OpenCL enqueued block function pointers with global variables.
addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());		addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());

// Can increase LDS used by kernel so runs before PromoteAlloca		// Can increase LDS used by kernel so runs before PromoteAlloca
if (EnableLowerModuleLDS)		if (EnableLowerModuleLDS) {
		// The pass "amdgpu-replace-lds-use-with-pointer" need to be run before the
		// pass "amdgpu-lower-module-lds", and also it required to be run only if
		// "amdgpu-lower-module-lds" pass is enabled.
		if (EnableLDSReplaceWithPointer)
		addPass(createAMDGPUReplaceLDSUseWithPointerPass());

addPass(createAMDGPULowerModuleLDSPass());		addPass(createAMDGPULowerModuleLDSPass());
		}

if (TM.getOptLevel() > CodeGenOpt::None) {		if (TM.getOptLevel() > CodeGenOpt::None) {
addPass(createInferAddressSpacesPass());		addPass(createInferAddressSpacesPass());
addPass(createAMDGPUPromoteAlloca());		addPass(createAMDGPUPromoteAlloca());

if (EnableSROA)		if (EnableSROA)
addPass(createSROAPass());		addPass(createSROAPass());
if (EnableScalarIRPasses.getNumOccurrences()		if (EnableScalarIRPasses.getNumOccurrences()
▲ Show 20 Lines • Show All 494 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPUMIRFormatter.cpp		AMDGPUMIRFormatter.cpp
AMDGPUOpenCLEnqueuedBlockLowering.cpp		AMDGPUOpenCLEnqueuedBlockLowering.cpp
AMDGPUPostLegalizerCombiner.cpp		AMDGPUPostLegalizerCombiner.cpp
AMDGPUPreLegalizerCombiner.cpp		AMDGPUPreLegalizerCombiner.cpp
AMDGPUPromoteAlloca.cpp		AMDGPUPromoteAlloca.cpp
AMDGPUPropagateAttributes.cpp		AMDGPUPropagateAttributes.cpp
AMDGPURegBankCombiner.cpp		AMDGPURegBankCombiner.cpp
AMDGPURegisterBankInfo.cpp		AMDGPURegisterBankInfo.cpp
		AMDGPUReplaceLDSUseWithPointer.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
AMDGPUPerfHintAnalysis.cpp		AMDGPUPerfHintAnalysis.cpp
▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h

	//===- AMDGPULDSUtils.h - LDS related helper functions -- C++ -----------===//			//===- AMDGPULDSUtils.h - LDS related helper functions -- C++ -----------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// AMDGPU LDS related helper utility functions.			// AMDGPU LDS related helper utility functions.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H			#ifndef LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H
	#define LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H			#define LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H

	#include "AMDGPU.h"			#include "AMDGPU.h"
				#include "llvm/ADT/DenseMap.h"
				#include "llvm/IR/Constants.h"

	namespace llvm {			namespace llvm {

	class ConstantExpr;			class ConstantExpr;

	namespace AMDGPU {			namespace AMDGPU {

				/// Collect reachable callees for each kernel defined in the module \p M and
				/// return collected callees at \p KernelToCallees.
				void collectReachableCallees(
				Module &M,
				DenseMap<Function , SmallPtrSet<Function , 8>> &KernelToCallees);

				/// For the given LDS global \p GV, visit all its users and collect all
				/// non-kernel functions within which \p GV is used and return collected list of
				/// such non-kernel functions.
				SmallPtrSet<Function , 8> collectNonKernelAccessorsOfLDS(GlobalVariable GV);

				/// Collect all the instructions where user \p U belongs to. \p U could be
				/// instruction itself or it could be a constant expression which is used within
				/// an instruction. If \p CollectKernelInsts is true, collect instructions only
				/// from kernels, otherwise collect instructions only from non-kernel functions.
				DenseMap<Function , SmallPtrSet<Instruction , 8>>
				getFunctionToInstsMap(User *U, bool CollectKernelInsts);

	bool isKernelCC(const Function *Func);			bool isKernelCC(const Function *Func);

	Align getAlign(DataLayout const &DL, const GlobalVariable *GV);			Align getAlign(DataLayout const &DL, const GlobalVariable *GV);

	/// \returns true if a given global variable \p GV (or its global users) appear			/// \returns true if a given global variable \p GV (or its global users) appear
	/// as an use within some instruction (either from kernel or from non-kernel).			/// as an use within some instruction (either from kernel or from non-kernel).
	bool hasUserInstruction(const GlobalValue *GV);			bool hasUserInstruction(const GlobalValue *GV);

	Show All 19 Lines

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp

	//===- AMDGPULDSUtils.cpp -------------------------------------------------===//			//===- AMDGPULDSUtils.cpp -------------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// AMDGPU LDS related helper utility functions.			// AMDGPU LDS related helper utility functions.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPULDSUtils.h"			#include "AMDGPULDSUtils.h"
	#include "Utils/AMDGPUBaseInfo.h"			#include "Utils/AMDGPUBaseInfo.h"
				#include "llvm/ADT/SCCIterator.h"
	#include "llvm/ADT/SetVector.h"			#include "llvm/ADT/SetVector.h"
				#include "llvm/Analysis/CallGraph.h"
	#include "llvm/IR/Constants.h"			#include "llvm/IR/Constants.h"
	#include "llvm/IR/ReplaceConstant.h"			#include "llvm/IR/ReplaceConstant.h"

	using namespace llvm;			using namespace llvm;

	namespace llvm {			namespace llvm {

	namespace AMDGPU {			namespace AMDGPU {

				// An helper class for collecting all reachable callees for each kernel defined
				// within the module.
				class CollectReachableCallees {
				Module &M;
				CallGraph CG;
				SmallPtrSet<CallGraphNode *, 8> AddressTakenFunctions;

				// Collect all address taken functions within the module.
				void collectAddressTakenFunctions() {
				auto *ECNode = CG.getExternalCallingNode();

				for (auto GI = ECNode->begin(), GE = ECNode->end(); GI != GE; ++GI) {
				auto *CGN = GI->second;
				auto *F = CGN->getFunction();
				if (!F \|\| F->isDeclaration() \|\| AMDGPU::isKernelCC(F))
				continue;
				AddressTakenFunctions.insert(CGN);
				}
				}

				// For a given caller node, collect all reachable callee nodes.
				foadUnsubmitted Done Reply Inline Actions You say "all reachable ... nodes" but then you only iterate over an SCC, which is different, and will miss some reachable nodes. If you really want all reachable nodes then use depth_first_ext which will do it all for you. (See EliminateUnreachableBlocks in BasicBlockUtils.cpp for an example of how to use it.) foad: You say "all reachable ... nodes" but then you only iterate over an SCC, which is different…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions The comment is bit vague. I actually meant this - "In a call graph, for a given (caller) node, collect all reachable callee nodes". Here, when I say reachable callee nodes, it includes those which are called within callees of caller, and so on. For example, say, f1 calls f2, f2 calls f3, and f3 calls f4. Then reachable callee set for f1 is {f2, f3, f4} So, given f1, is not that SCC iterator will collect {f2, f3, f4} for me? What are the cases that it cannot handle? hsmhsm: The comment is bit vague. I actually meant this - "In a call graph, for a given (caller) node…
				foadUnsubmitted Done Reply Inline Actions See https://en.wikipedia.org/wiki/Strongly_connected_component f4 will only be in the same SCC as f1 if there is a path from f1 to f4 and back to f1 again. foad: See https://en.wikipedia.org/wiki/Strongly_connected_component f4 will only be in the same SCC…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions But, that is fine, it does not matter. What we are fundamentally doing here is - We collect all reachable SCCs from given kernel, then we collect all the nodes within those collected SCCs. Actually, I am not sure, if I get you here. Let me further think about it, in case if I am missing any obvious point here. hsmhsm: But, that is fine, it does not matter. What we are fundamentally doing here is - We collect all…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions Further, I experimented for the graph that you mentioned: f1 calls f2 f1 calls f4 f2 calls f3 f4 calls f1 For this graph, yes, we land-up with three SCCs and we collect nodes from all three SCCs. Hence we have collected all reachable callees from f1. Start node: f1 SCC1: f3 SCC2: f2 SCC3: f4 f1 Reachable nodes from f1: f3 f2 f4 f1 hsmhsm: Further, I experimented for the graph that you mentioned: ``` f1 calls f2 f1 calls f4 f2…
				SmallPtrSet<CallGraphNode , 8> collectCGNodes(CallGraphNode CGN) {
				SmallPtrSet<CallGraphNode *, 8> CGNodes;

				for (scc_iterator<CallGraphNode *> I = scc_begin(CGN); !I.isAtEnd(); ++I) {
				foadUnsubmitted Not Done Reply Inline Actions There is no need to find strongly connected components. This should use a df_iterator instead. foad: There is no need to find strongly connected components. This should use a df_iterator instead.
				foadUnsubmitted Not Done Reply Inline Actions D104704 foad: D104704
				const std::vector<CallGraphNode > &SCC = I;
				assert(!SCC.empty() && "SCC with no functions?");
				for (auto *CGNode : SCC)
				CGNodes.insert(CGNode);
				}

				return CGNodes;
				}

				// For given kernel, collect all its reachable non-kernel functions.
				SmallPtrSet<Function , 8> collectReachableCallees(Function K) {
				SmallPtrSet<Function *, 8> ReachableCallees;

				// Call graph node which represents this kernel.
				auto *KCGN = CG[K];

				// Collect all reachable call graph nodes from the node representing this
				// kernel.
				SmallPtrSet<CallGraphNode *, 8> CGNodes = collectCGNodes(KCGN);

				// Go through collected reachable nodes, visit all thier call sites, if the
				// call site is direct, add corresponding callee to reachable callee set, if
				// it is indirect, resolve the indirect call site to potential reachable
				// callees, add them to reachable callee set, and repeat the process for the
				// newly added potential callee nodes.
				//
				// FIXME: Need to handle bit-casted function pointers.
				//
				SmallVector<CallGraphNode *, 8> CGNStack(CGNodes.begin(), CGNodes.end());
				SmallPtrSet<CallGraphNode *, 8> VisitedCGNodes;
				while (!CGNStack.empty()) {
				auto *CGN = CGNStack.pop_back_val();

				if (!VisitedCGNodes.insert(CGN).second)
				continue;

				for (auto GI = CGN->begin(), GE = CGN->end(); GI != GE; ++GI) {
				auto *RCB = cast<CallBase>(GI->first.getValue());
				auto *RCGN = GI->second;

				if (auto *DCallee = RCGN->getFunction()) {
				ReachableCallees.insert(DCallee);
				} else if (RCB->isIndirectCall()) {
				auto *RCBFTy = RCB->getFunctionType();
				for (auto *ACGN : AddressTakenFunctions) {
				auto *ACallee = ACGN->getFunction();
				if (ACallee->getFunctionType() == RCBFTy) {
				ReachableCallees.insert(ACallee);
				SmallPtrSet<CallGraphNode *, 8> IGCNNodes = collectCGNodes(ACGN);
				for (auto *IGCN : IGCNNodes)
				CGNStack.push_back(IGCN);
				}
				}
				}
				}
				}

				return ReachableCallees;
				}

				public:
				explicit CollectReachableCallees(Module &M) : M(M), CG(CallGraph(M)) {
				// Collect address taken functions.
				collectAddressTakenFunctions();
				}

				void collectReachableCallees(
				DenseMap<Function , SmallPtrSet<Function , 8>> &KernelToCallees) {
				// Collect reachable callee set for each kernel defined in the module.
				for (Function &F : M.functions()) {
				if (!AMDGPU::isKernelCC(&F))
				continue;
				Function *K = &F;
				KernelToCallees[K] = collectReachableCallees(K);
				}
				}
				};

				void collectReachableCallees(
				Module &M,
				DenseMap<Function , SmallPtrSet<Function , 8>> &KernelToCallees) {
				CollectReachableCallees CRC{M};
				CRC.collectReachableCallees(KernelToCallees);
				}

				SmallPtrSet<Function , 8> collectNonKernelAccessorsOfLDS(GlobalVariable GV) {
				SmallPtrSet<Function *, 8> LDSAccessors;
				SmallVector<User *, 8> UserStack(GV->users());
				SmallPtrSet<User *, 8> VisitedUsers;

				while (!UserStack.empty()) {
				auto *U = UserStack.pop_back_val();

				// `U` is already visited? continue to next one.
				if (!VisitedUsers.insert(U).second)
				continue;

				// `U` is a global variable which is initialized with LDS. Ignore LDS.
				if (isa<GlobalValue>(U))
				return SmallPtrSet<Function *, 8>();

				// Recursively explore constant users.
				if (isa<Constant>(U)) {
				append_range(UserStack, U->users());
				continue;
				}

				// `U` should be an instruction, if it belongs to a non-kernel function F,
				// then collect F.
				Function *F = cast<Instruction>(U)->getFunction();
				if (!AMDGPU::isKernelCC(F))
				LDSAccessors.insert(F);
				}

				return LDSAccessors;
				}

				DenseMap<Function , SmallPtrSet<Instruction , 8>>
				getFunctionToInstsMap(User *U, bool CollectKernelInsts) {
				DenseMap<Function , SmallPtrSet<Instruction , 8>> FunctionToInsts;
				SmallVector<User *, 8> UserStack;
				SmallPtrSet<User *, 8> VisitedUsers;

				UserStack.push_back(U);

				while (!UserStack.empty()) {
				auto *UU = UserStack.pop_back_val();

				if (!VisitedUsers.insert(UU).second)
				continue;

				if (isa<GlobalValue>(UU))
				continue;

				if (isa<Constant>(UU)) {
				append_range(UserStack, UU->users());
				continue;
				}

				auto *I = cast<Instruction>(UU);
				Function *F = I->getFunction();
				if (CollectKernelInsts) {
				if (!AMDGPU::isKernelCC(F)) {
				continue;
				}
				} else {
				if (AMDGPU::isKernelCC(F)) {
				continue;
				}
				}

				FunctionToInsts.insert(std::make_pair(F, SmallPtrSet<Instruction *, 8>()));
				FunctionToInsts[F].insert(I);
				}

				return FunctionToInsts;
				}

	bool isKernelCC(const Function *Func) {			bool isKernelCC(const Function *Func) {
	return AMDGPU::isModuleEntryFunctionCC(Func->getCallingConv());			return AMDGPU::isModuleEntryFunctionCC(Func->getCallingConv());
	}			}

	Align getAlign(DataLayout const &DL, const GlobalVariable *GV) {			Align getAlign(DataLayout const &DL, const GlobalVariable *GV) {
	return DL.getValueOrABITypeAlignment(GV->getPointerAlignment(DL),			return DL.getValueOrABITypeAlignment(GV->getPointerAlignment(DL),
	GV->getValueType());			GV->getValueType());
	}			}
	Show All 20 Lines

	void replaceConstantUsesInFunction(ConstantExpr C, const Function F) {			void replaceConstantUsesInFunction(ConstantExpr C, const Function F) {
	SetVector<Instruction *> InstUsers;			SetVector<Instruction *> InstUsers;

	collectFunctionUses(C, F, InstUsers);			collectFunctionUses(C, F, InstUsers);
	for (Instruction *I : InstUsers) {			for (Instruction *I : InstUsers) {
	convertConstantExprsToInstructions(I, C);			convertConstantExprsToInstructions(I, C);
	}			}
	}			}
				arsenmUnsubmitted Done Reply Inline Actions This should really be a generic utility function arsenm: This should really be a generic utility function

	bool hasUserInstruction(const GlobalValue *GV) {			bool hasUserInstruction(const GlobalValue *GV) {
	SmallPtrSet<const User *, 8> Visited;			SmallPtrSet<const User *, 8> Visited;
	SmallVector<const User *, 16> Stack(GV->users());			SmallVector<const User *, 16> Stack(GV->users());

	while (!Stack.empty()) {			while (!Stack.empty()) {
	const User *U = Stack.pop_back_val();			const User *U = Stack.pop_back_val();

	▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll

	Show All 36 Lines
	; GCN-O0-NEXT: Expand Atomic instructions			; GCN-O0-NEXT: Expand Atomic instructions
	; GCN-O0-NEXT: AMDGPU Lower Intrinsics			; GCN-O0-NEXT: AMDGPU Lower Intrinsics
	; GCN-O0-NEXT: AMDGPU Inline All Functions			; GCN-O0-NEXT: AMDGPU Inline All Functions
	; GCN-O0-NEXT: CallGraph Construction			; GCN-O0-NEXT: CallGraph Construction
	; GCN-O0-NEXT: Call Graph SCC Pass Manager			; GCN-O0-NEXT: Call Graph SCC Pass Manager
	; GCN-O0-NEXT: Inliner for always_inline functions			; GCN-O0-NEXT: Inliner for always_inline functions
	; GCN-O0-NEXT: A No-Op Barrier Pass			; GCN-O0-NEXT: A No-Op Barrier Pass
	; GCN-O0-NEXT: Lower OpenCL enqueued blocks			; GCN-O0-NEXT: Lower OpenCL enqueued blocks
				; GCN-O0-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O0-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O0-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O0-NEXT: FunctionPass Manager			; GCN-O0-NEXT: FunctionPass Manager
	; GCN-O0-NEXT: Dominator Tree Construction			; GCN-O0-NEXT: Dominator Tree Construction
	; GCN-O0-NEXT: Post-Dominator Tree Construction			; GCN-O0-NEXT: Post-Dominator Tree Construction
	; GCN-O0-NEXT: Natural Loop Information			; GCN-O0-NEXT: Natural Loop Information
	; GCN-O0-NEXT: Legacy Divergence Analysis			; GCN-O0-NEXT: Legacy Divergence Analysis
	; GCN-O0-NEXT: AMDGPU IR optimizations			; GCN-O0-NEXT: AMDGPU IR optimizations
	; GCN-O0-NEXT: Lower Garbage Collection Instructions			; GCN-O0-NEXT: Lower Garbage Collection Instructions
	▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines
	; GCN-O1-NEXT: Expand Atomic instructions			; GCN-O1-NEXT: Expand Atomic instructions
	; GCN-O1-NEXT: AMDGPU Lower Intrinsics			; GCN-O1-NEXT: AMDGPU Lower Intrinsics
	; GCN-O1-NEXT: AMDGPU Inline All Functions			; GCN-O1-NEXT: AMDGPU Inline All Functions
	; GCN-O1-NEXT: CallGraph Construction			; GCN-O1-NEXT: CallGraph Construction
	; GCN-O1-NEXT: Call Graph SCC Pass Manager			; GCN-O1-NEXT: Call Graph SCC Pass Manager
	; GCN-O1-NEXT: Inliner for always_inline functions			; GCN-O1-NEXT: Inliner for always_inline functions
	; GCN-O1-NEXT: A No-Op Barrier Pass			; GCN-O1-NEXT: A No-Op Barrier Pass
	; GCN-O1-NEXT: Lower OpenCL enqueued blocks			; GCN-O1-NEXT: Lower OpenCL enqueued blocks
				; GCN-O1-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O1-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O1-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O1-NEXT: FunctionPass Manager			; GCN-O1-NEXT: FunctionPass Manager
	; GCN-O1-NEXT: Infer address spaces			; GCN-O1-NEXT: Infer address spaces
	; GCN-O1-NEXT: AMDGPU Promote Alloca			; GCN-O1-NEXT: AMDGPU Promote Alloca
	; GCN-O1-NEXT: Dominator Tree Construction			; GCN-O1-NEXT: Dominator Tree Construction
	; GCN-O1-NEXT: SROA			; GCN-O1-NEXT: SROA
	; GCN-O1-NEXT: Post-Dominator Tree Construction			; GCN-O1-NEXT: Post-Dominator Tree Construction
	; GCN-O1-NEXT: Natural Loop Information			; GCN-O1-NEXT: Natural Loop Information
	▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines
	; GCN-O1-OPTS-NEXT: Expand Atomic instructions			; GCN-O1-OPTS-NEXT: Expand Atomic instructions
	; GCN-O1-OPTS-NEXT: AMDGPU Lower Intrinsics			; GCN-O1-OPTS-NEXT: AMDGPU Lower Intrinsics
	; GCN-O1-OPTS-NEXT: AMDGPU Inline All Functions			; GCN-O1-OPTS-NEXT: AMDGPU Inline All Functions
	; GCN-O1-OPTS-NEXT: CallGraph Construction			; GCN-O1-OPTS-NEXT: CallGraph Construction
	; GCN-O1-OPTS-NEXT: Call Graph SCC Pass Manager			; GCN-O1-OPTS-NEXT: Call Graph SCC Pass Manager
	; GCN-O1-OPTS-NEXT: Inliner for always_inline functions			; GCN-O1-OPTS-NEXT: Inliner for always_inline functions
	; GCN-O1-OPTS-NEXT: A No-Op Barrier Pass			; GCN-O1-OPTS-NEXT: A No-Op Barrier Pass
	; GCN-O1-OPTS-NEXT: Lower OpenCL enqueued blocks			; GCN-O1-OPTS-NEXT: Lower OpenCL enqueued blocks
				; GCN-O1-OPTS-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O1-OPTS-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O1-OPTS-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O1-OPTS-NEXT: FunctionPass Manager			; GCN-O1-OPTS-NEXT: FunctionPass Manager
	; GCN-O1-OPTS-NEXT: Infer address spaces			; GCN-O1-OPTS-NEXT: Infer address spaces
	; GCN-O1-OPTS-NEXT: AMDGPU Promote Alloca			; GCN-O1-OPTS-NEXT: AMDGPU Promote Alloca
	; GCN-O1-OPTS-NEXT: Dominator Tree Construction			; GCN-O1-OPTS-NEXT: Dominator Tree Construction
	; GCN-O1-OPTS-NEXT: SROA			; GCN-O1-OPTS-NEXT: SROA
	; GCN-O1-OPTS-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O1-OPTS-NEXT: Basic Alias Analysis (stateless AA impl)
	; GCN-O1-OPTS-NEXT: Function Alias Analysis Results			; GCN-O1-OPTS-NEXT: Function Alias Analysis Results
	▲ Show 20 Lines • Show All 263 Lines • ▼ Show 20 Lines
	; GCN-O2-NEXT: Expand Atomic instructions			; GCN-O2-NEXT: Expand Atomic instructions
	; GCN-O2-NEXT: AMDGPU Lower Intrinsics			; GCN-O2-NEXT: AMDGPU Lower Intrinsics
	; GCN-O2-NEXT: AMDGPU Inline All Functions			; GCN-O2-NEXT: AMDGPU Inline All Functions
	; GCN-O2-NEXT: CallGraph Construction			; GCN-O2-NEXT: CallGraph Construction
	; GCN-O2-NEXT: Call Graph SCC Pass Manager			; GCN-O2-NEXT: Call Graph SCC Pass Manager
	; GCN-O2-NEXT: Inliner for always_inline functions			; GCN-O2-NEXT: Inliner for always_inline functions
	; GCN-O2-NEXT: A No-Op Barrier Pass			; GCN-O2-NEXT: A No-Op Barrier Pass
	; GCN-O2-NEXT: Lower OpenCL enqueued blocks			; GCN-O2-NEXT: Lower OpenCL enqueued blocks
				; GCN-O2-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O2-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O2-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O2-NEXT: FunctionPass Manager			; GCN-O2-NEXT: FunctionPass Manager
	; GCN-O2-NEXT: Infer address spaces			; GCN-O2-NEXT: Infer address spaces
	; GCN-O2-NEXT: AMDGPU Promote Alloca			; GCN-O2-NEXT: AMDGPU Promote Alloca
	; GCN-O2-NEXT: Dominator Tree Construction			; GCN-O2-NEXT: Dominator Tree Construction
	; GCN-O2-NEXT: SROA			; GCN-O2-NEXT: SROA
	; GCN-O2-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O2-NEXT: Basic Alias Analysis (stateless AA impl)
	; GCN-O2-NEXT: Function Alias Analysis Results			; GCN-O2-NEXT: Function Alias Analysis Results
	▲ Show 20 Lines • Show All 264 Lines • ▼ Show 20 Lines
	; GCN-O3-NEXT: Expand Atomic instructions			; GCN-O3-NEXT: Expand Atomic instructions
	; GCN-O3-NEXT: AMDGPU Lower Intrinsics			; GCN-O3-NEXT: AMDGPU Lower Intrinsics
	; GCN-O3-NEXT: AMDGPU Inline All Functions			; GCN-O3-NEXT: AMDGPU Inline All Functions
	; GCN-O3-NEXT: CallGraph Construction			; GCN-O3-NEXT: CallGraph Construction
	; GCN-O3-NEXT: Call Graph SCC Pass Manager			; GCN-O3-NEXT: Call Graph SCC Pass Manager
	; GCN-O3-NEXT: Inliner for always_inline functions			; GCN-O3-NEXT: Inliner for always_inline functions
	; GCN-O3-NEXT: A No-Op Barrier Pass			; GCN-O3-NEXT: A No-Op Barrier Pass
	; GCN-O3-NEXT: Lower OpenCL enqueued blocks			; GCN-O3-NEXT: Lower OpenCL enqueued blocks
				; GCN-O3-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O3-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O3-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O3-NEXT: FunctionPass Manager			; GCN-O3-NEXT: FunctionPass Manager
	; GCN-O3-NEXT: Infer address spaces			; GCN-O3-NEXT: Infer address spaces
	; GCN-O3-NEXT: AMDGPU Promote Alloca			; GCN-O3-NEXT: AMDGPU Promote Alloca
	; GCN-O3-NEXT: Dominator Tree Construction			; GCN-O3-NEXT: Dominator Tree Construction
	; GCN-O3-NEXT: SROA			; GCN-O3-NEXT: SROA
	; GCN-O3-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O3-NEXT: Basic Alias Analysis (stateless AA impl)
	; GCN-O3-NEXT: Function Alias Analysis Results			; GCN-O3-NEXT: Function Alias Analysis Results
	▲ Show 20 Lines • Show All 258 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; The lds global @lds_used_within_func is used within non-kernel function @func_uses_lds
				; which is recheable from kernel @kernel_reaches_lds, hence pointer replacement takes place
				; for @lds_used_within_func.
				;

				; Original LDS should exist.
				; CHECK: @lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4

				; Pointer should be created.
				; CHECK: @lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				rampitecUnsubmitted Done Reply Inline Actions This likely should not be i16. You could do it this way: @lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i8 addrspace(3)* undef, align 2 Then use bitcast to i8 addrspace(3)* instead of ptrtoint on store. But I still think that creating module lds right here is better. rampitec: This likely should not be i16. You could do it this way: ``` @lds_used_within_func.ptr =…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using "i8 addrspace(3)" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here. Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass). This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems. hsmhsm:* The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as…
				rampitecUnsubmitted Done Reply Inline Actions The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using "i8 addrspace(3)" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here. Should not sizeof local pointer be 2? Checked the DL: p3:32:32. Sigh! Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass). This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems. My concern is that you assume these i16 pointers will start a null, but that might not be the case if you rely on the module LDS pass to pack them into the struct along with other LDS and let it sort it. Can you add a test where you have these pointers and also another module LDS which will have higher alignment and size, so module LDS will pack them together? We could then inspect output of 2 passes and the ISA. rampitec:* > The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions No, what you concern here is not at all arise. Please have a look the new test case - replace-lds-by-ptr-lds-offsets.ll. I have given detailed description within this test, and I hope, it will clarify your above doubt. hsmhsm: No, what you concern here is not at all arise. Please have a look the new test case - replace…
				rampitecUnsubmitted Not Done Reply Inline Actions Thanks, that helps. Tracked pointers and stores there, it seems to be consistent. Matt raised a concern that IR passes may refuse to produce a 'noalias' answer in presence of null, but that is another concern which we may be able to solve later. In particular any operation on such pointer is invariant and noalias in a given function, essentially it is just a single load, so we may just annotate it accordingly. Not required for this patch though. rampitec: Thanks, that helps. Tracked pointers and stores there, it seems to be consistent. Matt raised…
				; Pointer replacement code should be added.
				define internal void @func_uses_lds() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_func.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_func, i32 0, i32 0
				ret void
				}

				; No change
				define internal void @func_does_not_use_lds_3() {
				; CHECK-LABEL: entry:
				; CHECK: call void @func_uses_lds()
				; CHECK: ret void
				entry:
				call void @func_uses_lds()
				ret void
				}

				; No change
				define internal void @func_does_not_use_lds_2() {
				; CHECK-LABEL: entry:
				; CHECK: call void @func_uses_lds()
				; CHECK: ret void
				entry:
				call void @func_uses_lds()
				ret void
				}

				; No change
				define internal void @func_does_not_use_lds_1() {
				; CHECK-LABEL: entry:
				; CHECK: call void @func_does_not_use_lds_2()
				; CHECK: call void @func_does_not_use_lds_3()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_2()
				call void @func_does_not_use_lds_3()
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_reaches_lds() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_func to i16), i16 addrspace(3)* @lds_used_within_func.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @func_does_not_use_lds_1()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_1()
				ret void
				}

				; No change here since this kernel does not reach @func_uses_lds which uses lds.
				define protected amdgpu_kernel void @kernel_does_not_reach_lds() {
				; CHECK-LABEL: entry:
				; CHECK: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There are three lds globals defined here, and these three lds are used respectively within
				; three non-kernel functions. There are three kernels, which call two of the non-kernel functions.
				; Hence pointer replacement should take place for all three lds, and pointer initialization within
				; kernel should selectively happen depending on which lds is reachable from the kernel.
				;

				; Original LDS should exist.
				; CHECK: @lds_used_within_function_1 = internal addrspace(3) global [1 x i32] undef, align 4
				; CHECK: @lds_used_within_function_2 = internal addrspace(3) global [2 x i32] undef, align 4
				; CHECK: @lds_used_within_function_3 = internal addrspace(3) global [3 x i32] undef, align 4
				@lds_used_within_function_1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds_used_within_function_2 = internal addrspace(3) global [2 x i32] undef, align 4
				@lds_used_within_function_3 = internal addrspace(3) global [3 x i32] undef, align 4

				; Pointers should be created.
				; CHECK: @lds_used_within_function_1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_3.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define internal void @function_3() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [3 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [3 x i32], [3 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [3 x i32], [3 x i32] addrspace(3)* @lds_used_within_function_3, i32 0, i32 0
				ret void
				}

				; Pointer replacement code should be added.
				define internal void @function_2() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [2 x i32] addrspace(3)*
				rampitecUnsubmitted Done Reply Inline Actions We had problems with null before. Instruction selection will happily allocate it overlapping with first non-null variable. That's how AMDGPULowerModuleLDSPass turned to use struct instead of null. Did you check we are not introducing this again in the final ISA? rampitec: We had problems with null before. Instruction selection will happily allocate it overlapping…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions The null represent the base address of LDS memory. My understanding is that, earlier, instruction selection had gone for toss because of the bug where we were not actually allocating memory at all for llvm.amdgcn.module.lds, but we were accessing it at address 0. And, instruction selection was allocating address 0 for some other lds (since this address was not occupied), so there was an issue. But, nevertheless, let me look into ISA and see if we are fine or any issue because of null. hsmhsm: The null represent the base address of LDS memory. My understanding is that, earlier…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions Further, I looked into ISA, it looks good to me. For example consider below input code. @lds.1 = internal addrspace(3) global i32 undef, align 4 @lds.2 = internal addrspace(3) global i64 undef, align 4 define internal void @func_uses_lds() { entry: store i32 7, i32 addrspace(3)* @lds.1 store i64 31, i64 addrspace(3)* @lds.2 ret void } define protected amdgpu_kernel void @kernel_reaches_lds() { entry: call void @func_uses_lds() ret void } output from pointer-replacement is: @lds.1 = internal addrspace(3) global i32 undef, align 4 @lds.2 = internal addrspace(3) global i64 undef, align 4 @lds.1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2 @lds.2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2 define internal void @func_uses_lds() { entry: %0 = load i16, i16 addrspace(3)* @lds.2.ptr, align 2 %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0 %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)* %3 = load i16, i16 addrspace(3)* @lds.1.ptr, align 2 %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3 %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)* store i32 7, i32 addrspace(3)* %5, align 4 store i64 31, i64 addrspace(3)* %2, align 4 ret void } define protected amdgpu_kernel void @kernel_reaches_lds() { entry: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %1 = icmp eq i32 %0, 0 br i1 %1, label %2, label %3 2: ; preds = %entry store i16 ptrtoint (i64 addrspace(3)* @lds.2 to i16), i16 addrspace(3)* @lds.2.ptr, align 2 store i16 ptrtoint (i32 addrspace(3)* @lds.1 to i16), i16 addrspace(3)* @lds.1.ptr, align 2 br label %3 3: ; preds = %entry, %2 call void @llvm.amdgcn.wave.barrier() call void @func_uses_lds() ret void } output from module (kernel) lds lowering is: %llvm.amdgcn.module.lds.t = type { i16, i16 } %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t = type { i64, i32 } @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 2 @llvm.compiler.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)) to i8)], section "llvm.metadata" @llvm.amdgcn.kernel.kernel_reaches_lds.lds = internal addrspace(3) global %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t undef, align 8 define internal void @func_uses_lds() { entry: %0 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2 %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0 %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)* %3 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2 %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3 %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)* store i32 7, i32 addrspace(3)* %5, align 4 store i64 31, i64 addrspace(3)* %2, align 4 ret void } define protected amdgpu_kernel void @kernel_reaches_lds() { entry: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ] %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %1 = icmp eq i32 %0, 0 br i1 %1, label %2, label %5 2: ; preds = %entry %3 = ptrtoint i64 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 0) to i16 store i16 %3, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2 %4 = ptrtoint i32 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 1) to i16 store i16 %4, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2 br label %5 5: ; preds = %entry, %2 call void @llvm.amdgcn.wave.barrier() call void @func_uses_lds() ret void } what it fundamentally means is: address 0 holds the address of @lds.1, we need to load address of @lds.1 from address 0, and then we need to store value 7 to @lds.1. address 0 + 2 (offset 2) holds the address of @lds.2, we need to load address of @lds.2 from address 0 + 2, and then we need to store value 31 to @lds.2. That is what happening in the below ISA which is produced for above input ir. func_uses_lds: ; @func_uses_lds ; %bb.0: ; %entry s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) v_mov_b32_e32 v1, 0 ds_read_i16 v2, v1 ds_read_i16 v3, v1 offset:2 v_mov_b32_e32 v4, 7 v_mov_b32_e32 v0, 31 s_waitcnt lgkmcnt(1) ds_write_b32 v2, v4 s_waitcnt lgkmcnt(1) ds_write_b64 v3, v[0:1] s_waitcnt lgkmcnt(0) s_setpc_b64 s[30:31] hsmhsm: Further, I looked into ISA, it looks good to me. For example consider below input code. ```…
				rampitecUnsubmitted Done Reply Inline Actions So the idea is llvm.amdgcn.module.lds will only contain converted pointers and since it is forced to be allocated at address zero these pointers will fall into the allocated space? We need to be super careful here because if anything else will be injected into the llvm.amdgcn.module.lds by the module lds pass it will all break. In fact there seems to be no work for module lds after this for the module, only for kernels. You probably can just create module structure right here. Then module lds should skip run on module (but not on kernels) if that variable already exists. It will also untangle dependency between passes. In addition you will resolve issue with null as you would just use that structure and Matt's concern about ptrtoint. rampitec: So the idea is llvm.amdgcn.module.lds will only contain converted pointers and since it is…
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions As I mentioned in one my earlier comments - we still need llvm.amdgcn.module.lds. We cannot remove it for now. The main reason for above is - for a given use of LDS within some non-kernel function, we need to make sure that appropriate memory for LDS is being accessed based on which kernel is called that function. We have not yet solved this issue. The current available solution is this llvm.amdgcn.module.lds. But, we should attempt to avoid wasting of memory by llvm.amdgcn.module.lds, hence this pointer replacement pass. What this pass does is below: (1) create an i16 pointer "lds.ptr" to "lds" where "lds" is used in some non-kernel function. (2) Within kernel: lds.ptr = ptrtoint(lds) (3) Within non-kernel where lds is used: lds.addr = GEP(null, lds.ptr) // where null is the base address of the LDS memory, and we use GEP here to avoid inttoptr replace use of "lds" by "lds.addr" after apropriately bitcasting. What it means to LowerModuleLDS pass is below: (1) The use of "lds" is moved to kernel (pointer initialization), hence it is properly allocated within kernel as a part of kernel specific struct, and more important it is allocated only if there is call from kernel to function. Otherwise there is no this pointer initialization, and hence no usage, and hence no allocation. (2) It sees the use of "lds.ptr" within non-kernel function, and hence it lowers this ptr by putting it within module struct. Otherwise, we still allocate memory for llvm.amdgcn.module.lds within every kernel. But, now we are not wasting too much memory. In anycase, within every kernel, llvm.amdgcn.module.lds is allocated at address 0. all the other vars (belonging to kernel) are allocated at some non-zero addresses. In this way, I do not see any problem here, unless I am missing any corner, but serious case for conflcting with address 0. hsmhsm: As I mentioned in one my earlier comments - we still need llvm.amdgcn.module.lds. We cannot…
				; CHECK: %gep = getelementptr inbounds [2 x i32], [2 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [2 x i32], [2 x i32] addrspace(3)* @lds_used_within_function_2, i32 0, i32 0
				ret void
				}

				; Pointer replacement code should be added.
				define internal void @function_1() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [1 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds_used_within_function_1, i32 0, i32 0
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_calls_function_3_and_1() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([3 x i32] addrspace(3)* @lds_used_within_function_3 to i16), i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: store i16 ptrtoint ([1 x i32] addrspace(3)* @lds_used_within_function_1 to i16), i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @function_3()
				; CHECK: call void @function_1()
				; CHECK: ret void
				entry:
				call void @function_3()
				call void @function_1()
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_calls_function_2_and_3() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([3 x i32] addrspace(3)* @lds_used_within_function_3 to i16), i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: store i16 ptrtoint ([2 x i32] addrspace(3)* @lds_used_within_function_2 to i16), i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @function_2()
				; CHECK: call void @function_3()
				; CHECK: ret void
				entry:
				call void @function_2()
				call void @function_3()
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_calls_function_1_and_2() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([2 x i32] addrspace(3)* @lds_used_within_function_2 to i16), i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: store i16 ptrtoint ([1 x i32] addrspace(3)* @lds_used_within_function_1 to i16), i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @function_1()
				; CHECK: call void @function_2()
				; CHECK: ret void
				entry:
				call void @function_1()
				call void @function_2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-global-scope-use.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; None of lds are pointer-replaced since they are all used in global scope in one or the other way.
				;

				; CHECK: @lds = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @lds.1 = addrspace(3) global i16 undef, align 2
				; CHECK: @lds.2 = addrspace(3) global i32 undef, align 4
				; CHECK: @lds.3 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
				@lds = internal addrspace(3) global [4 x i32] undef, align 4
				@lds.1 = addrspace(3) global i16 undef, align 2
				@lds.2 = addrspace(3) global i32 undef, align 4
				@lds.3 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1

				; CHECK: @global_var = addrspace(1) global float* addrspacecast (float addrspace(3)* bitcast ([4 x i32] addrspace(3)* @lds to float addrspace(3)) to float), align 8
				; CHECK: @llvm.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (i16 addrspace(3)* @lds.1 to i8 addrspace(3)) to i8)], section "llvm.metadata"
				; CHECK: @llvm.compiler.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (i32 addrspace(3)* @lds.2 to i8 addrspace(3)) to i8)], section "llvm.metadata"
				; CHECK: @alias.to.lds.3 = alias [1 x i8], [1 x i8] addrspace(3)* @lds.3
				@global_var = addrspace(1) global float* addrspacecast ([4 x i32] addrspace(3)* @lds to float*), align 8
				@llvm.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (i16 addrspace(3)* @lds.1 to i8 addrspace(3)) to i8)], section "llvm.metadata"
				@llvm.compiler.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (i32 addrspace(3)* @lds.2 to i8 addrspace(3)) to i8)], section "llvm.metadata"
				@alias.to.lds.3 = alias [1 x i8], [1 x i8] addrspace(3)* @lds.3

				; CHECK-NOT: @lds.ptr
				; CHECK-NOT: @lds.1.ptr
				; CHECK-NOT: @lds.2.ptr
				; CHECK-NOT: @lds.3.ptr

				define void @f0() {
				; CHECK-LABEL: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds, i32 0, i32 0
				; CHECK: %ld1 = load i16, i16 addrspace(3)* @lds.1
				; CHECK: %ld2 = load i32, i32 addrspace(3)* @lds.2
				; CHECK: %gep2 = getelementptr inbounds [1 x i8], [1 x i8] addrspace(3)* @lds.3, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds, i32 0, i32 0
				%ld1 = load i16, i16 addrspace(3)* @lds.1
				%ld2 = load i32, i32 addrspace(3)* @lds.2
				%gep2 = getelementptr inbounds [1 x i8], [1 x i8] addrspace(3)* @lds.3, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: call void @f0()
				; CHECK: ret void
				entry:
				call void @f0()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-inline-asm-call.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; We do not know what to do with inline asm call, we ignore it, hence pointer replacement for
				; @used_only_within_func does not take place.
				;

				; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
				@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4

				; CHECK-NOT: @used_only_within_func.ptr

				define void @f0(i32 %x) {
				; CHECK-LABEL: entry:
				; CHECK: store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_func, i32 0, i32 0) to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_func, i32 0, i32 0) to i32) to i64)) to i32), align 4
				; CHECK: ret void
				entry:
				store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

				define amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: call i32 asm "s_mov_b32 $0, 0", "=s"()
				; CHECK: ret void
				entry:
				call i32 asm "s_mov_b32 $0, 0", "=s"()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-kernel-only-used-lds.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION ;
				;
				; LDS global @used_only_within_kern is used only within kernel @k0, hence pointer replacement
				; does not take place for @used_only_within_kern.
				;

				; CHECK: @used_only_within_kern = addrspace(3) global [4 x i32] undef, align 4
				@used_only_within_kern = addrspace(3) global [4 x i32] undef, align 4

				; CHECK-NOT: @used_only_within_kern.ptr

				define amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: %ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_kern, i32 0, i32 0) to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_kern, i32 0, i32 0) to i32) to i64)) to i32), align 4
				; CHECK: %mul = mul i32 %ld, 2
				; CHECK: store i32 %mul, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_kern, i32 0, i32 0) to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_kern, i32 0, i32 0) to i32) to i64)) to i32), align 4
				; CHECK: ret void
				entry:
				%ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_kern to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_kern to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				%mul = mul i32 %ld, 2
				store i32 %mul, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_kern to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_kern to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-not-reachable-lds.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION ;
				;
				; LDS global @not-reachable-lds is used within non-kernel function @f0, but @f0 is not
				; reachable from kernel @k, hence pointer replacement does not take place.
				;

				; CHECK: @not-reachable-lds = internal addrspace(3) global [4 x i32] undef, align 4
				@not-reachable-lds = internal addrspace(3) global [4 x i32] undef, align 4

				; CHECK-NOT: @not-reachable-lds.ptr

				define internal void @f0() {
				; CHECK-LABEL: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @not-reachable-lds, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @not-reachable-lds, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-small-lds.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION ;
				;
				; LDS global @small_lds is used within non-kernel function @f0, and @f0 is reachable
				; from kernel @k0, but since @small_lds too small for pointer replacement, pointer
				; replacement does not take place.
				;

				; CHECK: @small_lds = addrspace(3) global i8 undef, align 1
				@small_lds = addrspace(3) global i8 undef, align 1

				; CHECK-NOT: @small_lds.ptr

				define void @f0() {
				; CHECK-LABEL: entry:
				; CHECK: store i8 1, i8 addrspace(3)* @small_lds, align 1
				; CHECK: ret void
				entry:
				store i8 1, i8 addrspace(3)* @small_lds, align 1
				ret void
				}

				define amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: call void @f0()
				; CHECK: ret void
				entry:
				call void @f0()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-diamond-shape.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; The lds global @lds_used_within_func is used within non-kernel function @func_uses_lds
				; which is indirectly recheable from kernel @kernel_reaches_lds, hence pointer replacement
				; takes place for @lds_used_within_func.

				; Original LDS should exit.
				; CHECK: @lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4

				; Function pointer should exist as it is.
				; CHECK: @ptr_to_func = internal local_unnamed_addr externally_initialized global void ()* @func_uses_lds, align 8
				@ptr_to_func = internal local_unnamed_addr externally_initialized global void ()* @func_uses_lds, align 8

				; Pointer should be created.
				; CHECK: @lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define internal void @func_uses_lds() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_func.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_func, i32 0, i32 0
				ret void
				}

				; No change
				define internal void @func_does_not_use_lds_3() {
				; CHECK-LABEL: entry:
				; CHECK: %fptr = load void (), void ()* @ptr_to_func, align 8
				; CHECK: call void %fptr()
				; CHECK: ret void
				entry:
				%fptr = load void (), void ()* @ptr_to_func, align 8
				call void %fptr()
				ret void
				}

				; No change
				define internal void @func_does_not_use_lds_2() {
				; CHECK-LABEL: entry:
				; CHECK: %fptr = load void (), void ()* @ptr_to_func, align 8
				; CHECK: call void %fptr()
				; CHECK: ret void
				entry:
				%fptr = load void (), void ()* @ptr_to_func, align 8
				call void %fptr()
				ret void
				}

				; No change
				define internal void @func_does_not_use_lds_1() {
				; CHECK-LABEL: entry:
				; CHECK: call void @func_does_not_use_lds_2()
				; CHECK: call void @func_does_not_use_lds_3()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_2()
				call void @func_does_not_use_lds_3()
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_reaches_lds() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_func to i16), i16 addrspace(3)* @lds_used_within_func.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @func_does_not_use_lds_1()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_1()
				ret void
				}

				; No change here since this kernel does not reach @func_uses_lds which uses lds.
				define protected amdgpu_kernel void @kernel_does_not_reach_lds() {
				; CHECK-LABEL: entry:
				; CHECK: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-selected_functions.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There are three lds globals defined here, and these three lds are used respectively within
				; three non-kernel functions. There are three kernels, which indirectly call two of the
				; non-kernel functions. Hence pointer replacement should take place for all three lds, and
				; pointer initialization within kernel should selectively happen depending on which lds is
				; reachable from the kernel.
				;

				; Original LDS should exist.
				; CHECK: @lds_used_within_function_1 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @lds_used_within_function_2 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @lds_used_within_function_3 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_1 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_2 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_3 = internal addrspace(3) global [4 x i32] undef, align 4

				; Function pointers should exist.
				; CHECK: @ptr_to_func1 = internal local_unnamed_addr externally_initialized global void (float)* @function_1, align 8
				; CHECK: @ptr_to_func2 = internal local_unnamed_addr externally_initialized global void (i16)* @function_2, align 8
				; CHECK: @ptr_to_func3 = internal local_unnamed_addr externally_initialized global void (i8)* @function_3, align 8
				@ptr_to_func1 = internal local_unnamed_addr externally_initialized global void (float)* @function_1, align 8
				@ptr_to_func2 = internal local_unnamed_addr externally_initialized global void (i16)* @function_2, align 8
				@ptr_to_func3 = internal local_unnamed_addr externally_initialized global void (i8)* @function_3, align 8

				; Pointers should be created.
				; CHECK: @lds_used_within_function_1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_3.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define internal void @function_3(i8 %c) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_3, i32 0, i32 0
				ret void
				}

				; Pointer replacement code should be added.
				define internal void @function_2(i16 %i) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_2, i32 0, i32 0
				ret void
				}

				; Pointer replacement code should be added.
				define internal void @function_1(float %f) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_1, i32 0, i32 0
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_calls_function_3_and_1() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_3 to i16), i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_1 to i16), i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: %fptr3 = load void (i8), void (i8)* @ptr_to_func3, align 8
				; CHECK: %fptr1 = load void (float), void (float)* @ptr_to_func1, align 8
				; CHECK: call void %fptr3(i8 1)
				; CHECK: call void %fptr1(float 2.000000e+00)
				; CHECK: ret void
				entry:
				%fptr3 = load void (i8), void (i8)* @ptr_to_func3, align 8
				%fptr1 = load void (float), void (float)* @ptr_to_func1, align 8
				call void %fptr3(i8 1)
				call void %fptr1(float 2.0)
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_calls_function_2_and_3() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_3 to i16), i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_2 to i16), i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: %fptr2 = load void (i16), void (i16)* @ptr_to_func2, align 8
				; CHECK: %fptr3 = load void (i8), void (i8)* @ptr_to_func3, align 8
				; CHECK: call void %fptr2(i16 3)
				; CHECK: call void %fptr3(i8 4)
				; CHECK: ret void
				entry:
				%fptr2 = load void (i16), void (i16)* @ptr_to_func2, align 8
				%fptr3 = load void (i8), void (i8)* @ptr_to_func3, align 8
				call void %fptr2(i16 3)
				call void %fptr3(i8 4)
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_calls_function_1_and_2() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_2 to i16), i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_1 to i16), i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: %fptr1 = load void (float), void (float)* @ptr_to_func1, align 8
				; CHECK: %fptr2 = load void (i16), void (i16)* @ptr_to_func2, align 8
				; CHECK: call void %fptr1(float 5.000000e+00)
				; CHECK: call void %fptr2(i16 6)
				; CHECK: ret void
				entry:
				%fptr1 = load void (float), void (float)* @ptr_to_func1, align 8
				%fptr2 = load void (i16), void (i16)* @ptr_to_func2, align 8
				call void %fptr1(float 5.0)
				call void %fptr2(i16 6)
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-signature-match.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There are three lds globals defined here, and these three lds are used respectively within
				; three non-kernel functions. There is one kernel which indirectly calls one of the non-kernel
				; functions. But since all the three non-kernel functions have same signature, all three
				; non-kernel functions are resolved as potential callees for indirect call-site. Hence we land-up
				; pointer replacement for three lds globals.
				;

				; Original LDS should exist.
				; CHECK: @lds_used_within_function_1 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @lds_used_within_function_2 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @lds_used_within_function_3 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_1 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_2 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_3 = internal addrspace(3) global [4 x i32] undef, align 4

				; Function pointers should exist.
				; CHECK: @ptr_to_func1 = internal local_unnamed_addr externally_initialized global void (i16)* @function_1, align 8
				; CHECK: @ptr_to_func2 = internal local_unnamed_addr externally_initialized global void (i16)* @function_2, align 8
				; CHECK: @ptr_to_func3 = internal local_unnamed_addr externally_initialized global void (i16)* @function_3, align 8
				@ptr_to_func1 = internal local_unnamed_addr externally_initialized global void (i16)* @function_1, align 8
				@ptr_to_func2 = internal local_unnamed_addr externally_initialized global void (i16)* @function_2, align 8
				@ptr_to_func3 = internal local_unnamed_addr externally_initialized global void (i16)* @function_3, align 8

				; Pointers should be created.
				; CHECK: @lds_used_within_function_1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_3.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define internal void @function_3(i16 %i) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_3, i32 0, i32 0
				ret void
				}

				; Pointer replacement code should be added.
				define internal void @function_2(i16 %i) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_2, i32 0, i32 0
				ret void
				}

				; Pointer replacement code should be added.
				define internal void @function_1(i16 %i) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_1, i32 0, i32 0
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel_indirectly_calls_function_1() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_3 to i16), i16 addrspace(3)* @lds_used_within_function_3.ptr, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_2 to i16), i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_1 to i16), i16 addrspace(3)* @lds_used_within_function_1.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: %fptr1 = load void (i16), void (i16)* @ptr_to_func1, align 8
				; CHECK: call void %fptr1(i16 6)
				; CHECK: ret void
				entry:
				%fptr1 = load void (i16), void (i16)* @ptr_to_func1, align 8
				call void %fptr1(i16 6)
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-lds-offsets.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck --check-prefix=POINTER-REPLACE %s
				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-lower-module-lds < %s \| FileCheck --check-prefix=LOWER_LDS %s
				; RUN: llc -mtriple=amdgcn-- -mcpu=gfx900 < %s \| FileCheck --check-prefix=GCN %s

				;
				; DESCRIPTION:
				;
				; 1. There are three lds defined - @lds.1, @lds.2 and @lds.3, which are of types i32, i64, and [2 x i64].
				; @lds.3 is aliased to to @alias.to.lds.3
				; 2. @lds.1 is used in function @f1, and @lds.2 is used in function @f2, @alias.to.lds.3 is used in kernel @k1.

				; 3. Pointer-replacement pass replaces @lds.1 and @lds.2 by pointers @lds.1.ptr and @lds.2.ptr respectively.
				; However it does not touch @lds.3 since it is used in global scope (aliased).
				;
				; 4. LDS-lowering pass sees use of @lds.1.ptr in function @f1, use of @lds.2.ptr in function @f2, and use of
				; @lds.3 (via alias @alias.to.lds.3) in kernel @k1. Hence it module lowers these lds into struct instance
				; @llvm.amdgcn.module.lds.
				;
				; The struct member order is - [lds.3, lds.1.ptr, lds.2.ptr]. Since @llvm.amdgcn.module.lds itself is allocated
				; on address 0, lds.3 is allocated on address 0, lds.1.ptr is allocated on address 16, and lds.2.ptr is allocated
				; on address 18.
				;
				; Again LDS-lowering pass sees use of @lds.1 and @lds.2 in kernel. Hence it kernel lowers these lds into struct
				; instance @llvm.amdgcn.kernel.k1.lds.
				;
				; The struct member order is - [@lds.2, @lds.1]. By now, already (16 + 2 + 2) 20 byte of memory allocated, @lds.2
				; is allocated on address 24 since it needs to be allocated on 8 byte boundary, and @lds.1 is allocated on address
				; 32.
				;
				; 5. Hence the final GCN ISA looks as below:
				;
				; Within kernel @k1:
				; address 24 is stored in address 18.
				; address 32 is stored in address 16
				;
				; Within function @f1:
				; address 32 is loaded from address 16
				;
				; Within function @f2:
				; address 24 is loaded from address 18
				;


				; POINTER-REPLACE: @lds.1 = addrspace(3) global i32 undef, align 4
				; POINTER-REPLACE: @lds.2 = addrspace(3) global i64 undef, align 8
				; POINTER-REPLACE: @lds.3 = addrspace(3) global [2 x i64] undef, align 16
				; POINTER-REPLACE: @lds.1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER-REPLACE: @lds.2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; POINTER-REPLACE: @alias.to.lds.3 = alias [2 x i64], [2 x i64] addrspace(3)* @lds.3


				; LOWER_LDS-NOT: @lds.1
				; LOWER_LDS-NOT: @lds.2
				; LOWER_LDS-NOT: @lds.3
				; LOWER_LDS: %llvm.amdgcn.module.lds.t = type { [2 x i64], i16, i16 }
				; LOWER_LDS: %llvm.amdgcn.kernel.k1.lds.t = type { i64, i32 }
				; LOWER_LDS: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 16
				; LOWER_LDS: @llvm.compiler.used = appending global [1 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)) to i8)], section "llvm.metadata"
				; LOWER_LDS: @llvm.amdgcn.kernel.k1.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k1.lds.t undef, align 8
				; LOWER_LDS: @alias.to.lds.3 = alias [2 x i64], getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0)

				@lds.1 = addrspace(3) global i32 undef, align 4
				@lds.2 = addrspace(3) global i64 undef, align 8
				@lds.3 = addrspace(3) global [2 x i64] undef, align 16
				@alias.to.lds.3 = alias [2 x i64], [2 x i64] addrspace(3)* @lds.3

				; POINTER-REPLACE-LABEL: @f1
				; POINTER-REPLACE: %1 = load i16, i16 addrspace(3)* @lds.1.ptr, align 2
				; POINTER-REPLACE: %2 = getelementptr i8, i8 addrspace(3)* null, i16 %1
				; POINTER-REPLACE: %3 = bitcast i8 addrspace(3)* %2 to i32 addrspace(3)*
				; POINTER-REPLACE: store i32 7, i32 addrspace(3)* %3, align 4
				; POINTER-REPLACE: ret void


				; LOWER_LDS-LABEL: @f1
				; LOWER_LDS: %1 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
				; LOWER_LDS: %2 = getelementptr i8, i8 addrspace(3)* null, i16 %1
				; LOWER_LDS: %3 = bitcast i8 addrspace(3)* %2 to i32 addrspace(3)*
				; LOWER_LDS: store i32 7, i32 addrspace(3)* %3, align 4
				; LOWER_LDS: ret void


				; GCN-LABEL: f1:
				; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN: v_mov_b32_e32 v0, 0
				; GCN: ds_read_i16 v0, v0 offset:16
				; GCN: v_mov_b32_e32 v1, 7
				; GCN: s_waitcnt lgkmcnt(0)
				; GCN: ds_write_b32 v0, v1
				; GCN: s_waitcnt lgkmcnt(0)
				; GCN: s_setpc_b64 s[30:31]
				define void @f1() {
				store i32 7, i32 addrspace(3)* @lds.1
				ret void
				}

				; POINTER-REPLACE-LABEL: @f2
				; POINTER-REPLACE: %1 = load i16, i16 addrspace(3)* @lds.2.ptr, align 2
				; POINTER-REPLACE: %2 = getelementptr i8, i8 addrspace(3)* null, i16 %1
				; POINTER-REPLACE: %3 = bitcast i8 addrspace(3)* %2 to i64 addrspace(3)*
				; POINTER-REPLACE: store i64 15, i64 addrspace(3)* %3, align 4
				; POINTER-REPLACE: ret void


				; LOWER_LDS-LABEL: @f2
				; LOWER_LDS: %1 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 2), align 2
				; LOWER_LDS: %2 = getelementptr i8, i8 addrspace(3)* null, i16 %1
				; LOWER_LDS: %3 = bitcast i8 addrspace(3)* %2 to i64 addrspace(3)*
				; LOWER_LDS: store i64 15, i64 addrspace(3)* %3, align 4
				; LOWER_LDS: ret void


				; GCN-LABEL: f2:
				; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN: v_mov_b32_e32 v1, 0
				; GCN: ds_read_i16 v2, v1 offset:18
				; GCN: v_mov_b32_e32 v0, 15
				; GCN: s_waitcnt lgkmcnt(0)
				; GCN: ds_write_b64 v2, v[0:1]
				; GCN: s_waitcnt lgkmcnt(0)
				; GCN: s_setpc_b64 s[30:31]
				define void @f2() {
				store i64 15, i64 addrspace(3)* @lds.2
				ret void
				}

				; POINTER-REPLACE-LABEL: @k1
				; POINTER-REPLACE: %1 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; POINTER-REPLACE: %2 = icmp eq i32 %1, 0
				; POINTER-REPLACE: br i1 %2, label %3, label %4
				;
				; POINTER-REPLACE-LABEL: 3:
				; POINTER-REPLACE: store i16 ptrtoint (i64 addrspace(3)* @lds.2 to i16), i16 addrspace(3)* @lds.2.ptr, align 2
				; POINTER-REPLACE: store i16 ptrtoint (i32 addrspace(3)* @lds.1 to i16), i16 addrspace(3)* @lds.1.ptr, align 2
				; POINTER-REPLACE: br label %4
				;
				; POINTER-REPLACE-LABEL: 4:
				; POINTER-REPLACE: call void @llvm.amdgcn.wave.barrier()
				; POINTER-REPLACE: %bc = bitcast [2 x i64] addrspace(3)* @alias.to.lds.3 to i8 addrspace(3)*
				; POINTER-REPLACE: store i8 3, i8 addrspace(3)* %bc, align 2
				; POINTER-REPLACE: call void @f1()
				; POINTER-REPLACE: call void @f2()
				; POINTER-REPLACE: ret void


				; LOWER_LDS-LABEL: @k1
				; LOWER_LDS: call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
				; LOWER_LDS: %1 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; LOWER_LDS: %2 = icmp eq i32 %1, 0
				; LOWER_LDS: br i1 %2, label %3, label %6
				;
				; LOWER_LDS-LABEL: 3:
				; LOWER_LDS: %4 = ptrtoint i64 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k1.lds.t, %llvm.amdgcn.kernel.k1.lds.t addrspace(3)* @llvm.amdgcn.kernel.k1.lds, i32 0, i32 0) to i16
				; LOWER_LDS: store i16 %4, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 2), align 2
				; LOWER_LDS: %5 = ptrtoint i32 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.k1.lds.t, %llvm.amdgcn.kernel.k1.lds.t addrspace(3)* @llvm.amdgcn.kernel.k1.lds, i32 0, i32 1) to i16
				; LOWER_LDS: store i16 %5, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
				; LOWER_LDS: br label %6
				;
				; LOWER_LDS-LABEL: 6:
				; LOWER_LDS: call void @llvm.amdgcn.wave.barrier()
				; LOWER_LDS: %bc = bitcast [2 x i64] addrspace(3)* @alias.to.lds.3 to i8 addrspace(3)*
				; LOWER_LDS: store i8 3, i8 addrspace(3)* %bc, align 2
				; LOWER_LDS: call void @f1()
				; LOWER_LDS: call void @f2()
				; LOWER_LDS: ret void


				; GCN-LABEL: k1:
				; GCN: s_mov_b32 s8, SCRATCH_RSRC_DWORD0
				; GCN: s_mov_b32 s9, SCRATCH_RSRC_DWORD1
				; GCN: s_mov_b32 s10, -1
				; GCN: s_mov_b32 s11, 0xe00000
				; GCN: s_add_u32 s8, s8, s1
				; GCN: v_mbcnt_lo_u32_b32 v0, -1, 0
				; GCN: s_addc_u32 s9, s9, 0
				; GCN: v_cmp_eq_u32_e32 vcc, 0, v0
				; GCN: s_mov_b32 s32, 0
				; GCN: s_and_saveexec_b64 s[0:1], vcc
				; GCN: s_cbranch_execz BB2_2
				; GCN: v_mov_b32_e32 v0, 24
				; GCN: v_mov_b32_e32 v1, 0
				; GCN: ds_write_b16 v1, v0 offset:18
				; GCN: v_mov_b32_e32 v0, 32
				; GCN: ds_write_b16 v1, v0 offset:16
				; GCN-LABEL: BB2_2:
				; GCN: s_or_b64 exec, exec, s[0:1]
				; GCN: s_getpc_b64 s[0:1]
				; GCN: s_add_u32 s0, s0, f1@gotpcrel32@lo+4
				; GCN: s_addc_u32 s1, s1, f1@gotpcrel32@hi+12
				; GCN: s_load_dwordx2 s[4:5], s[0:1], 0x0
				; GCN: s_mov_b64 s[0:1], s[8:9]
				; GCN: s_mov_b64 s[2:3], s[10:11]
				; GCN: v_mov_b32_e32 v0, alias.to.lds.3@abs32@lo
				; GCN: v_mov_b32_e32 v1, 3
				; ; wave barrier
				; GCN: ds_write_b8 v0, v1
				; GCN: s_waitcnt lgkmcnt(0)
				; GCN: s_swappc_b64 s[30:31], s[4:5]
				; GCN: s_getpc_b64 s[0:1]
				; GCN: s_add_u32 s0, s0, f2@gotpcrel32@lo+4
				; GCN: s_addc_u32 s1, s1, f2@gotpcrel32@hi+12
				; GCN: s_load_dwordx2 s[4:5], s[0:1], 0x0
				; GCN: s_mov_b64 s[0:1], s[8:9]
				; GCN: s_mov_b64 s[2:3], s[10:11]
				; GCN: s_waitcnt lgkmcnt(0)
				; GCN: s_swappc_b64 s[30:31], s[4:5]
				; GCN: s_endpgm
				define amdgpu_kernel void @k1() {
				%bc = bitcast [2 x i64] addrspace(3)* @alias.to.lds.3 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %bc, align 2
				call void @f1()
				call void @f2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-multiple-lds.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There are three lds globals defined here, and these three lds are used within a single
				; non-kernel function, and this non-kernel function is reachable from kernel. Hence pointer
				; replacement is required for all three lds globals.
				;

				; Original LDS should exist.
				; CHECK: @lds1 = internal addrspace(3) global [1 x i32] undef, align 4
				; CHECK: @lds2 = internal addrspace(3) global [2 x i32] undef, align 4
				; CHECK: @lds3 = internal addrspace(3) global [3 x i32] undef, align 4
				@lds1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds2 = internal addrspace(3) global [2 x i32] undef, align 4
				@lds3 = internal addrspace(3) global [3 x i32] undef, align 4

				; Pointers should be created.
				; CHECK: @lds1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds3.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define internal void @function() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds3.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [3 x i32] addrspace(3)*
				; CHECK: %3 = load i16, i16 addrspace(3)* @lds2.ptr, align 2
				; CHECK: %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
				; CHECK: %5 = bitcast i8 addrspace(3)* %4 to [2 x i32] addrspace(3)*
				; CHECK: %6 = load i16, i16 addrspace(3)* @lds1.ptr, align 2
				; CHECK: %7 = getelementptr i8, i8 addrspace(3)* null, i16 %6
				; CHECK: %8 = bitcast i8 addrspace(3)* %7 to [1 x i32] addrspace(3)*
				; CHECK: %gep1 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %8, i32 0, i32 0
				; CHECK: %gep2 = getelementptr inbounds [2 x i32], [2 x i32] addrspace(3)* %5, i32 0, i32 0
				; CHECK: %gep3 = getelementptr inbounds [3 x i32], [3 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep1 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds1, i32 0, i32 0
				%gep2 = getelementptr inbounds [2 x i32], [2 x i32] addrspace(3)* @lds2, i32 0, i32 0
				%gep3 = getelementptr inbounds [3 x i32], [3 x i32] addrspace(3)* @lds3, i32 0, i32 0
				ret void
				}

				; Pointer initialization code shoud be added;
				define protected amdgpu_kernel void @kernel() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([3 x i32] addrspace(3)* @lds3 to i16), i16 addrspace(3)* @lds3.ptr, align 2
				; CHECK: store i16 ptrtoint ([2 x i32] addrspace(3)* @lds2 to i16), i16 addrspace(3)* @lds2.ptr, align 2
				; CHECK: store i16 ptrtoint ([1 x i32] addrspace(3)* @lds1 to i16), i16 addrspace(3)* @lds1.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @function()
				; CHECK: ret void
				entry:
				call void @function()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-same-lds.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There is one lds global defined here, and this lds is used within a single non-kernel
				; function multiple times, and this non-kernel function is reachable from kernel. Hence
				; pointer takes place. But important note is - store-to/load-from pointer should happen
				; only once irrespective of number of uses.
				;

				; Original LDS should exist.
				; CHECK: @lds1 = internal addrspace(3) global [1 x i32] undef, align 4
				@lds1 = internal addrspace(3) global [1 x i32] undef, align 4

				; Pointers should be created.
				; CHECK: @lds1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define internal void @function() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds1.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [1 x i32] addrspace(3)*
				; CHECK: %gep1 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: %gep2 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: %gep3 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep1 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds1, i32 0, i32 0
				%gep2 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds1, i32 0, i32 0
				%gep3 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(3)* @lds1, i32 0, i32 0
				ret void
				}

				; Pointer initialization code shoud be added;
				define protected amdgpu_kernel void @kernel() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([1 x i32] addrspace(3)* @lds1 to i16), i16 addrspace(3)* @lds1.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @function()
				; CHECK: ret void
				entry:
				call void @function()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr1.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There is one lds global defined here, and this lds is used within a single non-kernel
				; function, as an operand of nested constant expression, and this non-kernel function is
				; reachable from kernel. Hence nested constant expression should to be converted into a
				; series of instructons and pointer replacement should take place.
				;

				; Original LDS should exist.
				; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
				@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4

				; Pointers should be created.
				; CHECK: @used_only_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define void @f0(i32 %x) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @used_only_within_func.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i64
				; CHECK: %6 = add i64 %5, %5
				; CHECK: %7 = inttoptr i64 %6 to i32*
				; CHECK: store i32 %x, i32* %7, align 4
				; CHECK: ret void
				entry:
				store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

				; Pointer initialization code shoud be added
				define amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @used_only_within_func to i16), i16 addrspace(3)* @used_only_within_func.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @f0(i32 0)
				; CHECK: ret void
				entry:
				call void @f0(i32 0)
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr2.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				; There is one lds global defined here, and this lds is used within a single non-kernel
				; function, as an operand of nested constant expression, and this non-kernel function is
				; reachable from kernel. Hence nested constant expression should to be converted into a
				; series of instructons and pointer replacement should take place. But, important note
				; is - only constant expression operands which uses lds should be converted into
				; instructions, other constant expression operands which do not use lds should be left
				; untouched.
				;

				; Original LDS should exist.
				; CHECK: @lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4

				; Non-LDS global should exist as it is.
				; CHECK: @global_var = internal addrspace(1) global [4 x i32] undef, align 4
				@global_var = internal addrspace(1) global [4 x i32] undef, align 4

				; Pointer should be created.
				; CHECK: @lds_used_within_function.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define internal void @function() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 2
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i32
				; CHECK: %6 = add i32 %5, ptrtoint (i32 addrspace(1)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(1)* @global_var, i32 0, i32 2) to i32)
				; CHECK: ret void
				entry:
				%0 = add i32 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 2) to i32) to i32), ptrtoint (i32 addrspace(1) getelementptr inbounds ([4 x i32], [4 x i32] addrspace(1)* @global_var, i32 0, i32 2) to i32)
				ret void
				}

				; Pointer initialization code shoud be added
				define protected amdgpu_kernel void @kernel() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %1 = icmp eq i32 %0, 0
				; CHECK: br i1 %1, label %2, label %3
				;
				; CHECK-LABEL: 2:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function to i16), i16 addrspace(3)* @lds_used_within_function.ptr, align 2
				; CHECK: br label %3
				;
				; CHECK-LABEL: 3:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @function()
				; CHECK: ret void
				entry:
				call void @function()
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-phi-inst.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; Replace lds globals used within phi instruction.
				;

				; Original LDS should exist.
				; CHECK: @lds.1 = addrspace(3) global i32 undef, align 4
				; CHECK: @lds.2 = addrspace(3) global i32 undef, align 4
				@lds.1 = addrspace(3) global i32 undef, align 4
				@lds.2 = addrspace(3) global i32 undef, align 4

				; Pointers should be created.
				; CHECK: @lds.1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds.2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				define void @f0(i32 %arg) {
				; CHECK-LABEL: bb:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds.2.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to i32 addrspace(3)*
				; CHECK: %3 = load i16, i16 addrspace(3)* @lds.1.ptr, align 2
				; CHECK: %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
				; CHECK: %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)*
				; CHECK: %id = call i32 @llvm.amdgcn.workitem.id.x()
				; CHECK: %my.tmp = sub i32 %id, %arg
				; CHECK: br label %bb1
				bb:
				%id = call i32 @llvm.amdgcn.workitem.id.x()
				%my.tmp = sub i32 %id, %arg
				br label %bb1

				; CHECK-LABEL: bb1:
				; CHECK: %lsr.iv = phi i32 [ undef, %bb ], [ %my.tmp2, %Flow ]
				; CHECK: %6 = icmp ne i32 addrspace(3)* inttoptr (i32 4 to i32 addrspace(3)*), %5
				; CHECK: %lsr.iv.next = add i32 %lsr.iv, 1
				; CHECK: %cmp0 = icmp slt i32 %lsr.iv.next, 0
				; CHECK: br i1 %cmp0, label %bb4, label %Flow
				bb1:
				%lsr.iv = phi i32 [ undef, %bb ], [ %my.tmp2, %Flow ]
				%lsr.iv.next = add i32 %lsr.iv, 1
				%cmp0 = icmp slt i32 %lsr.iv.next, 0
				br i1 %cmp0, label %bb4, label %Flow

				; CHECK-LABEL: bb4:
				; CHECK: %load = load volatile i32, i32 addrspace(1)* undef, align 4
				; CHECK: %cmp1 = icmp sge i32 %my.tmp, %load
				; CHECK: br label %Flow
				bb4:
				%load = load volatile i32, i32 addrspace(1)* undef, align 4
				%cmp1 = icmp sge i32 %my.tmp, %load
				br label %Flow

				; CHECK-LABEL: Flow:
				; CHECK: %my.tmp2 = phi i32 [ %lsr.iv.next, %bb4 ], [ undef, %bb1 ]
				; CHECK: %my.tmp3 = phi i32 addrspace(3)* [ %2, %bb4 ], [ %5, %bb1 ]
				; CHECK: %my.tmp4 = phi i1 [ %cmp1, %bb4 ], [ %6, %bb1 ]
				; CHECK: br i1 %my.tmp4, label %bb9, label %bb1
				Flow:
				%my.tmp2 = phi i32 [ %lsr.iv.next, %bb4 ], [ undef, %bb1 ]
				%my.tmp3 = phi i32 addrspace(3)* [@lds.2, %bb4 ], [ @lds.1, %bb1 ]
				%my.tmp4 = phi i1 [ %cmp1, %bb4 ], [ icmp ne (i32 addrspace(3)* inttoptr (i32 4 to i32 addrspace(3)), i32 addrspace(3) @lds.1), %bb1 ]
				br i1 %my.tmp4, label %bb9, label %bb1

				; CHECK-LABEL: bb9:
				; CHECK: store volatile i32 7, i32 addrspace(3)* undef, align 4
				; CHECK: ret void
				bb9:
				store volatile i32 7, i32 addrspace(3)* undef
				ret void
				}

				; CHECK-LABEL: @k0
				; CHECK: %1 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %2 = icmp eq i32 %1, 0
				; CHECK: br i1 %2, label %3, label %4
				;
				; CHECK-LABEL: 3:
				; CHECK: store i16 ptrtoint (i32 addrspace(3)* @lds.2 to i16), i16 addrspace(3)* @lds.2.ptr, align 2
				; CHECK: store i16 ptrtoint (i32 addrspace(3)* @lds.1 to i16), i16 addrspace(3)* @lds.1.ptr, align 2
				; CHECK: br label %4
				;
				; CHECK-LABEL: 4:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: call void @f0(i32 %arg)
				; CHECK: ret void
				define amdgpu_kernel void @k0(i32 %arg) {
				call void @f0(i32 %arg)
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x()

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Replace non-kernel function uses of LDS globals by pointers.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 353276

llvm/lib/Target/AMDGPU/AMDGPU.h

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/lib/Target/AMDGPU/CMakeLists.txt

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-global-scope-use.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-inline-asm-call.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-kernel-only-used-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-not-reachable-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-small-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-diamond-shape.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-selected_functions.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-signature-match.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-lds-offsets.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-multiple-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-same-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr1.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr2.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-phi-inst.ll

[AMDGPU] Replace non-kernel function uses of LDS globals by pointers.
ClosedPublic