This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Replace non-kernel function uses of LDS globals by pointers.
ClosedPublic

Authored by hsmhsm on May 26 2021, 10:14 PM.

Details

Summary

The main motivation behind pointer replacement of LDS use within non-kernel
functions is - to *avoid* subsequent LDS lowering pass from directly packing
LDS (assume large LDS) into a struct type which would otherwise cause allocating
huge memory for struct instance within every kernel.

Diff Detail

Event Timeline

hsmhsm created this revision.May 26 2021, 10:14 PM
hsmhsm requested review of this revision.May 26 2021, 10:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2021, 10:14 PM

I am planning to take-up handling of *bit-casted function pointers* (while collecting reacheable callees for kernels) as a seperate patch on top of it.

The reason for this is :

(1) *not* handling of *bit-casted function pointers* does not break any functionality - when we fail to collect reacheable callees because of this, lowering phase directly lower LDS instead of pointers, so no break in any functionality from semantics persepctive
(2) I want this base patch to be upstreamed first before handling corner cases. It will also help us to mitigate LDS memory ovehead to greater extent when the ModuleLDSLowering code will be shipped in coming rocm releases.

hsmhsm updated this revision to Diff 348176.May 27 2021, 12:05 AM

Fixed few clang-tidy warnings.

foad added inline comments.May 27 2021, 2:34 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
44

You say "all reachable ... nodes" but then you only iterate over an SCC, which is different, and will miss some reachable nodes.

If you really want all reachable nodes then use depth_first_ext which will do it all for you. (See EliminateUnreachableBlocks in BasicBlockUtils.cpp for an example of how to use it.)

hsmhsm added inline comments.May 27 2021, 4:46 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
44

The comment is bit vague. I actually meant this - "In a call graph, for a given (caller) node, collect all reachable *callee* nodes".

Here, when I say reachable callee nodes, it includes those which are called within callees of caller, and so on. For example, say, f1 calls f2, f2 calls f3, and f3 calls f4. Then reachable callee set for f1 is {f2, f3, f4}

So, given f1, is not that SCC iterator will collect {f2, f3, f4} for me?

What are the cases that it cannot handle?

foad added inline comments.May 27 2021, 5:01 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
44

See https://en.wikipedia.org/wiki/Strongly_connected_component

f4 will only be in the same SCC as f1 if there is a path from f1 to f4 and back to f1 again.

hsmhsm added inline comments.May 27 2021, 5:45 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
44

But, that is fine, it does not matter. What we are fundamentally doing here is - We collect all reachable SCCs from given kernel, then we collect all the nodes within those collected SCCs.

Actually, I am not sure, if I get you here. Let me further think about it, in case if I am missing any obvious point here.

hsmhsm added inline comments.May 27 2021, 6:09 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
44

Further, I experimented for the graph that you mentioned:

f1 calls f2
f1 calls f4

f2 calls f3

f4 calls f1

For this graph, yes, we land-up with three SCCs and we collect nodes from all three SCCs. Hence we have collected all reachable callees from f1.

Start node: f1

SCC1:
        f3
SCC2:
        f2
SCC3:
        f4
        f1

Reachable nodes from f1:
        f3
        f2
        f4
        f1

Note the failure: LLVM.CodeGen/AMDGPU::llc-pipeline.ll

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
312

It is not strictly required. I would just add it from TM.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
123

auto UsedList. Less updates if return type will change.

136

It is local variable, no need to clean. It will be destroyed.

141

Require is too strong word here. Not beneficial.

152

return !LDSToNonKernels[GV].empty();

169

Technically it can be less. Some ASICs have 32K.

263

Can you name it somehow more readable? Entry, Entry2... Not speaking names.

310

That's a copy of set, you could use reference.

hsmhsm updated this revision to Diff 348724.May 30 2021, 9:43 PM

Fixed few of the review comments by Stas.

hsmhsm marked 8 inline comments as done.May 30 2021, 9:45 PM

Note the failure: LLVM.CodeGen/AMDGPU::llc-pipeline.ll

Yes, I monitoring it. Not understanding why it is failing. Need to further look into it.

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

hsmhsm updated this revision to Diff 348799.May 31 2021, 7:40 AM

Rebased.

hsmhsm updated this revision to Diff 348935.Jun 1 2021, 4:33 AM

Rebased.

hsmhsm updated this revision to Diff 348943.Jun 1 2021, 5:10 AM

Fixed clang-tidy warning.

hsmhsm updated this revision to Diff 349021.Jun 1 2021, 10:53 AM

Rebased.

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

I do not remember, but probably we can omit it since it is a singe store readonly memory. Anyway a confirmation from @t-tye would be nice.

arsenm added inline comments.Jun 1 2021, 2:58 PM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
29

Typo runnning

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
148–151

Why not check this first?

201

The compiler should never introduce ptrtoint. This should offset from a base address with getelementptr

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
243

This should really be a generic utility function

hsmhsm updated this revision to Diff 349175.Jun 1 2021, 10:03 PM

Fixed few of the review comments by Matt.

hsmhsm marked 3 inline comments as done.Jun 1 2021, 10:09 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
201

Actually I am wondering - how to do it. "inttoptr" can be replaced with getelementptr. On the other-hand, how to achieve it for "ptrtoint"?

getelementptr gives pointer value, not integer value, what I want here is - an integer value (pointer converted to integer). So, how to get integer value from getelementptr?

hsmhsm updated this revision to Diff 349301.Jun 2 2021, 9:32 AM

Fixed missing case of lds use within phi instruction.

hsmhsm updated this revision to Diff 349318.Jun 2 2021, 10:28 AM

When constant users becomes empty, drop references from constant.

hsmhsm updated this revision to Diff 349333.Jun 2 2021, 11:17 AM

Reverted previous fix (drop references from constant) and rebased.

hsmhsm updated this revision to Diff 349451.Jun 2 2021, 9:50 PM

Rebased.

rampitec added inline comments.Jun 3 2021, 1:32 PM
llvm/include/llvm/IR/ReplaceConstant.h
33

Should this utility be reviewed an submitted separately since it is a common code?

hsmhsm marked an inline comment as done.Jun 3 2021, 6:30 PM
hsmhsm added inline comments.
llvm/include/llvm/IR/ReplaceConstant.h
33
hsmhsm updated this revision to Diff 350176.Jun 6 2021, 11:04 PM
hsmhsm marked an inline comment as done.

Rebased.

JonChesterfield added inline comments.Jun 7 2021, 6:16 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
152

Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels.

Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit.

rampitec added a subscriber: tra.Jun 7 2021, 1:31 PM

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

I do not remember, but probably we can omit it since it is a singe store readonly memory. Anyway a confirmation from @t-tye would be nice.

For the record, the agreed way is to do a store from lane 0 of each wave and follow with a wave barrier.

t-tye added a comment.Jun 7 2021, 8:27 PM

You probably need to wrap all prologue LDS stores into a block to execute it only from lane 0 and add a barrier after. @t-tye correct me if I am wrong.

But, I remember that we had decided to avoid barrier here, and instead just make sure that each thread within each wave execute the store instructions? In anycase, let me clarify it with @t-tye and @b-sumner.

I do not remember, but probably we can omit it since it is a singe store readonly memory. Anyway a confirmation from @t-tye would be nice.

For the record, the agreed way is to do a store from lane 0 of each wave and follow with a wave barrier.

I agree with @rampitec, although we did suggest measuring to confirm that the work-group barrier and a single wave's lane 0 is not faster than multiple waves's lane 0 and a wave barrier.

We also observed that using a wave barrier is UB in the language memory model, although well defined in the AMDGPU hardware memory model. However, the current AMD GPU sync-scope definition implies the language rules and so would be an issue if any future atomic optimizations exploited that.

hsmhsm updated this revision to Diff 350487.Jun 7 2021, 9:24 PM

Rebased.

b-sumner added a comment.EditedJun 8 2021, 7:24 AM

Two approaches for limiting the stores to lane 0 of each wave:

  1. Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1
  2. Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)
rampitec added inline comments.Jun 8 2021, 12:39 PM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
151

If it is empty it probably will not be returned by findVariablesToLower()?

152

Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels.

Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit.

Sound like a good TODO at the very least.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42

We had problems with null before. Instruction selection will happily allocate it overlapping with first non-null variable. That's how AMDGPULowerModuleLDSPass turned to use struct instead of null. Did you check we are not introducing this again in the final ISA?

hsmhsm marked an inline comment as done.Jun 9 2021, 12:08 AM

Two approaches for limiting the stores to lane 0 of each wave:

  1. Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1
  2. Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)

OK, let's see, which one is more feasible from the implementation point of view.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
151

That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it. In this case, collectNonKernelAccessorsOfLDS() returns empty set.

@lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1

@gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64*), align 8

define void @f0() {
  %bc = bitcast [1 x i8] addrspace(3)* @lds.1 to i8 addrspace(3)*
  store i8 1, i8 addrspace(3)* %bc, align 1
  ret void
}

define amdgpu_kernel void @k0() {  
  call void @f0()
  ret void
}
152

It is in my TODO list. It requires some logical thinking, before carefully handling this case. That is the reason, I had not yet replied to Jon's comment.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42

The null represent the base address of LDS memory. My understanding is that, earlier, instruction selection had gone for toss because of the bug where we were not actually allocating memory at all for llvm.amdgcn.module.lds, but we were accessing it at address 0. And, instruction selection was allocating address 0 for some other lds (since this address was not occupied), so there was an issue.

But, nevertheless, let me look into ISA and see if we are fine or any issue because of null.

hsmhsm marked an inline comment as done.Jun 9 2021, 7:25 AM
hsmhsm updated this revision to Diff 350901.Jun 9 2021, 8:01 AM

Let only lane 0 from each wave do lds pointer initialization, followed by a wave barrier.

hsmhsm added a comment.Jun 9 2021, 8:05 AM

Two approaches for limiting the stores to lane 0 of each wave:

  1. Write 1 to exec mask, store, and write -1 to exec mask. This works since the exec mask at the start of the wave when this happens is -1
  2. Check for lane == 0 and branch. The lane can be computed by a) wave64: builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)) b) wave32: __builtin_amdgcn_mbcnt_lo(~0u, 0u)

OK, let's see, which one is more feasible from the implementation point of view.

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

hsmhsm marked an inline comment as done.Jun 9 2021, 8:20 AM
hsmhsm added inline comments.
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42

Further, I looked into ISA, it looks good to me. For example consider below input code.

@lds.1 = internal addrspace(3) global i32 undef, align 4
@lds.2 = internal addrspace(3) global i64 undef, align 4

define internal void @func_uses_lds() {
entry:
  store i32 7, i32 addrspace(3)* @lds.1
  store i64 31, i64 addrspace(3)* @lds.2
  ret void
}

define protected amdgpu_kernel void @kernel_reaches_lds() {
entry:
  call void @func_uses_lds()
  ret void
}

output from pointer-replacement is:

@lds.1 = internal addrspace(3) global i32 undef, align 4
@lds.2 = internal addrspace(3) global i64 undef, align 4
@lds.1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
@lds.2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

define internal void @func_uses_lds() {
entry:
  %0 = load i16, i16 addrspace(3)* @lds.2.ptr, align 2
  %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
  %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)*
  %3 = load i16, i16 addrspace(3)* @lds.1.ptr, align 2
  %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
  %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)*
  store i32 7, i32 addrspace(3)* %5, align 4
  store i64 31, i64 addrspace(3)* %2, align 4
  ret void
}

define protected amdgpu_kernel void @kernel_reaches_lds() {
entry:
  %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %1 = icmp eq i32 %0, 0
  br i1 %1, label %2, label %3

2:                                                ; preds = %entry
  store i16 ptrtoint (i64 addrspace(3)* @lds.2 to i16), i16 addrspace(3)* @lds.2.ptr, align 2
  store i16 ptrtoint (i32 addrspace(3)* @lds.1 to i16), i16 addrspace(3)* @lds.1.ptr, align 2
  br label %3

3:                                                ; preds = %entry, %2
  call void @llvm.amdgcn.wave.barrier()
  call void @func_uses_lds()
  ret void
}

output from module (kernel) lds lowering is:

%llvm.amdgcn.module.lds.t = type { i16, i16 }
%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t = type { i64, i32 }

@llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 2
@llvm.compiler.used = appending global [1 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*) to i8*)], section "llvm.metadata"
@llvm.amdgcn.kernel.kernel_reaches_lds.lds = internal addrspace(3) global %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t undef, align 8

define internal void @func_uses_lds() {
entry:
  %0 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
  %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
  %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)*
  %3 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
  %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
  %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)*
  store i32 7, i32 addrspace(3)* %5, align 4
  store i64 31, i64 addrspace(3)* %2, align 4
  ret void
}

define protected amdgpu_kernel void @kernel_reaches_lds() {
entry:
  call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
  %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %1 = icmp eq i32 %0, 0
  br i1 %1, label %2, label %5

2:                                                ; preds = %entry
  %3 = ptrtoint i64 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 0) to i16
  store i16 %3, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
  %4 = ptrtoint i32 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 1) to i16
  store i16 %4, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
  br label %5

5:                                                ; preds = %entry, %2
  call void @llvm.amdgcn.wave.barrier()
  call void @func_uses_lds()
  ret void
}

what it fundamentally means is:

  1. address 0 holds the address of @lds.1, we need to load address of @lds.1 from address 0, and then we need to store value 7 to @lds.1.
  2. address 0 + 2 (offset 2) holds the address of @lds.2, we need to load address of @lds.2 from address 0 + 2, and then we need to store value 31 to @lds.2. That is what happening in the below ISA which is produced for above input ir.
func_uses_lds:                          ; @func_uses_lds
; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        v_mov_b32_e32 v1, 0
        ds_read_i16 v2, v1
        ds_read_i16 v3, v1 offset:2
        v_mov_b32_e32 v4, 7
        v_mov_b32_e32 v0, 31
        s_waitcnt lgkmcnt(1)
        ds_write_b32 v2, v4
        s_waitcnt lgkmcnt(1)
        ds_write_b64 v3, v[0:1]
        s_waitcnt lgkmcnt(0)
        s_setpc_b64 s[30:31]

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.

@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
151

That is true. But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it. In this case, collectNonKernelAccessorsOfLDS() returns empty set.

@lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1

@gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64*), align 8

define void @f0() {
  %bc = bitcast [1 x i8] addrspace(3)* @lds.1 to i8 addrspace(3)*
  store i8 1, i8 addrspace(3)* %bc, align 1
  ret void
}

define amdgpu_kernel void @k0() {  
  call void @f0()
  ret void
}

Should not it be fixed by D103431?

152

You can leave a TODO comment here for now. Patch is already too big, so it will need a separate change anyway.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42

So the idea is llvm.amdgcn.module.lds will only contain converted pointers and since it is forced to be allocated at address zero these pointers will fall into the allocated space?

We need to be super careful here because if anything else will be injected into the llvm.amdgcn.module.lds by the module lds pass it will all break.

In fact there seems to be no work for module lds after this for the module, only for kernels. You probably can just create module structure right here. Then module lds should skip run on module (but not on kernels) if that variable already exists. It will also untangle dependency between passes. In addition you will resolve issue with null as you would just use that structure and Matt's concern about ptrtoint.

rampitec added inline comments.Jun 9 2021, 11:32 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
121

Do you really need this new map? You can get a pointer to block from KernelToLDSPointers by getting a parent of the first instruction.

rampitec added a comment.EditedJun 9 2021, 11:55 AM

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.

@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.

Anyhow, manual says:

Example to compute each thread's position in 0..63:
v_mbcnt_lo_u32_b32 v0, -1, 0
v_mbcnt_hi_u32_b32 v0, -1, v0
// v0 now contains ThreadPosition

So it tells to nest it. Plus will fail on lane 32 as far as I understand.

rampitec added inline comments.Jun 9 2021, 12:28 PM
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16

This likely should not be i16. You could do it this way:

@lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i8 addrspace(3)* undef, align 2

Then use bitcast to i8 addrspace(3)* instead of ptrtoint on store.

But I still think that creating module lds right here is better.

hsmhsm added a comment.Jun 9 2021, 8:44 PM

Implemented approach(2). Here we actually do not need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)). Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need builtin_amdgcn_mbcnt_hi(~0u, builtin_amdgcn_mbcnt_lo(~0u, 0u)).

As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.

Actually, in wave64 mode, for any thread >= 32, mbcnt_lo returns 32 not 0, practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64, I will make changes accordingly.

@b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (builtin_amdgcn_mbcnt_lo(~0u, 0u) + builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.

Anyhow, manual says:

Example to compute each thread's position in 0..63:
v_mbcnt_lo_u32_b32 v0, -1, 0
v_mbcnt_hi_u32_b32 v0, -1, v0
// v0 now contains ThreadPosition

So it tells to nest it. Plus will fail on lane 32 as far as I understand.

No it does not fail on lane 32. For lane 32, v_mbcnt_lo returns 32, and you pass this return value from v_mbcnt_lo to v_mbcnt_hi. v_mbcnt_hi will not add any additioncal count (on top of 32), hence it rerurns back 32 again, that is lane position for lane 32.

hsmhsm updated this revision to Diff 351054.Jun 9 2021, 9:00 PM

Rebased.

hsmhsm marked an inline comment as done.Jun 9 2021, 9:21 PM
hsmhsm updated this revision to Diff 351086.Jun 10 2021, 1:22 AM

Insert @llvm.amdgcn.mbcnt.hi() for wave64 mode.

foad added a comment.Jun 10 2021, 1:25 AM

practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64

I am baffled by this. Why would you add an extra instruction when you have just explained that it is not required?

hsmhsm marked 4 inline comments as done.Jun 10 2021, 2:31 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
121

I think we need it. Consider below transformed code. Here we need to track the newly introduced block st since this is where we need to insert store instructions. And I am not sure, how can I get it from KernelToLDSPointers.

define protected amdgpu_kernel void @k0() {
entry:
  %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %1 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %0)
  %2 = icmp eq i32 %1, 0
  br i1 %2, label %st, label %ft
st:
  store i16 ptrtoint ([4 x i32] addrspace(3)* @lds to i16),  i16 addrspace(3)* @lds.ptr, align 2
  br label %ft
ft:
  call void @llvm.amdgcn.wave.barrier()
  call void @f0()
  ret void
}
151

The patch https://reviews.llvm.org/D103431 has fixed cases where few globally used lds were not getting lowering. Now, here is the basic funda:

The pass LowerModuleLDS lowers LDS which are used either in global scope or in non-kernel function scope or in both. And it directly packs the LDS within struct which we want to avoid since it wastes memory.

So, in this pointer-replacement patch, we try to replace uses of LDS by pointers so that LowerModuleLDS pass ends-up packing pointers instead of LDS themselves.

However, we can only pointer replace those LDS which are used within non-kernel function scope, but not the ones which are used in global scope. Now in the above example, assume that we create a pointer to @lds.1 since it is used within non-kernel function scope, and replace its use within function by pointer. Then LowerModuleLDS pass sees use of @lds.1 (in global scope) and the use of pointer within function. Hence it lowers both lds and its pointer, means it packs both @lds.1 and its pointer within struct, which is waste.

Hence this pass does not touch the lds which are used both in global and non-kernel function scope, and leaves those lds to getting directly lowered by LowerModuleLDS pass, means directly getting packed within struct. And, I assume such cases are not frequent in practice.

152

I have kept is as TODO and come to it later in a separate patch.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16

The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using "i8 addrspace(3)*" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here.

Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass).

This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll
42

As I mentioned in one my earlier comments - we still need llvm.amdgcn.module.lds. We cannot remove it for now.

The main reason for above is - for a given use of LDS within some non-kernel function, we need to make sure that appropriate memory for LDS is being accessed based on which kernel is called that function. We have not yet solved this issue. The current available solution is this llvm.amdgcn.module.lds.

But, we should attempt to avoid wasting of memory by llvm.amdgcn.module.lds, hence this pointer replacement pass.

What this pass does is below:

(1) create an i16 pointer "lds.ptr" to "lds" where "lds" is used in some non-kernel function.
(2) Within kernel:

lds.ptr = ptrtoint(lds)

(3) Within non-kernel where lds is used:

lds.addr = GEP(null, lds.ptr)   // where null is the base address of the LDS memory, and we use GEP here to avoid inttoptr
replace use of "lds" by "lds.addr" after apropriately bitcasting.

What it means to LowerModuleLDS pass is below:

(1) The use of "lds" is moved to kernel (pointer initialization), hence it is properly allocated within kernel as a part of kernel specific struct, and more important it is allocated only if there is call from kernel to function. Otherwise there is no this pointer initialization, and hence no usage, and hence no allocation.

(2) It sees the use of "lds.ptr" within non-kernel function, and hence it lowers this ptr by putting it within module struct.

Otherwise, we still allocate memory for llvm.amdgcn.module.lds within every kernel. But, now we are not wasting too much memory.

In anycase, within every kernel, llvm.amdgcn.module.lds is allocated at address 0. all the other vars (belonging to kernel) are allocated at some non-zero addresses.

In this way, I do not see any problem here, unless I am missing any corner, but serious case for conflcting with address 0.

hsmhsm marked 4 inline comments as done.Jun 10 2021, 2:33 AM

practically speaking we can do without mbcnt_hi, but may be it is good and safe to have mbcnt_hi for wave64

I am baffled by this. Why would you add an extra instruction when you have just explained that it is not required?

Correct, but, I myself could not confidently prove that it is always safe to use only mbcnt_lo even for wave64 mode, hence had to insert mbcnt_hi for wave64 mode.

I believe that for the purposes of detecting lane 0 mbcnt_lo is sufficient.

Insert @llvm.amdgcn.mbcnt.hi() for wave64 mode.

Actually I think you were right. mbcnt_lo will always return -1 for any lane higher than 31.

rampitec added inline comments.Jun 10 2021, 11:31 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
121

Ok.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16

The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using "i8 addrspace(3)*" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here.

Should not sizeof local pointer be 2? Checked the DL: p3:32:32. Sigh!

Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass).

This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems.

My concern is that you assume these i16 pointers will start a null, but that might not be the case if you rely on the module LDS pass to pack them into the struct along with other LDS and let it sort it.

Can you add a test where you have these pointers and also another module LDS which will have higher alignment and size, so module LDS will pack them together? We could then inspect output of 2 passes and the ISA.

hsmhsm updated this revision to Diff 351785.Jun 13 2021, 11:32 PM

[1] Do not insert mbcnt_hi for wave64 mode, mbcnt_lo itself is enough to detect lane 0 even in wave64 mode.
[2] Add a new lit test as asked by Stas.

hsmhsm marked an inline comment as done.Jun 13 2021, 11:37 PM

I believe that for the purposes of detecting lane 0 mbcnt_lo is sufficient.

Patch is again updated accordingly.

Insert @llvm.amdgcn.mbcnt.hi() for wave64 mode.

Actually I think you were right. mbcnt_lo will always return -1 for any lane higher than 31.

Now I again updated the patch - mbcnt_lo itself is enough to detect lane 0 even in wave64 mode.

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16

No, what you concern here is not at all arise. Please have a look the new test case - replace-lds-by-ptr-lds-offsets.ll. I have given detailed description within this test, and I hope, it will clarify your above doubt.

hsmhsm marked 7 inline comments as done.Jun 14 2021, 4:08 AM
rampitec added inline comments.Jun 14 2021, 12:11 PM
llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll
16

Thanks, that helps. Tracked pointers and stores there, it seems to be consistent.

Matt raised a concern that IR passes may refuse to produce a 'noalias' answer in presence of null, but that is another concern which we may be able to solve later. In particular any operation on such pointer is invariant and noalias in a given function, essentially it is just a single load, so we may just annotate it accordingly. Not required for this patch though.

rampitec accepted this revision.Jun 14 2021, 12:36 PM

LGTM. Please allow at least 24 hours for others to comment too before submitting. In the meanwhile PSDB and ePSDB runs will be useful.

This revision is now accepted and ready to land.Jun 14 2021, 12:36 PM
foad added inline comments.Jun 21 2021, 1:28 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
48

There is no need to find strongly connected components. This should use a df_iterator instead.

foad added inline comments.Jun 22 2021, 5:24 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
48

This may be the root cause of recent crashes. Reported to me offline, posting the stack trace here. @ronlieb may have a repro from different code.

llvm-project/llvm/include/llvm/ADT/Optional.h:96: T& llvm::optional_detail::OptionalStorage<T, <anonymous> >::getValue() & [with T = llvm::WeakTrackingVH; bool <anonymous> = false]: Assertion `hasVal' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.	Program arguments: $HOME/clang/bin/llc /tmp/zgemm-bb1d9d-gfx908-linked-b8cb9f.bc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -filetype=obj -o /tmp/zgemm-bb1d9d-gfx908-2e0957.o
1.	Running pass 'Replace within non-kernel function use of LDS with pointer' on module '/tmp/zgemm-bb1d9d-gfx908-linked-b8cb9f.bc'.
 #0 0x00007f1e8ed35ebf PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x00007f1e8ed33879 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f1e93a112d0 __restore_rt (/lib64/libpthread.so.0+0x132d0)
 #3 0x00007f1e8dec5520 raise (/lib64/libc.so.6+0x39520)
 #4 0x00007f1e8dec6b01 abort (/lib64/libc.so.6+0x3ab01)
 #5 0x00007f1e8debdb1a __assert_fail_base (/lib64/libc.so.6+0x31b1a)
 #6 0x00007f1e8debdb92 (/lib64/libc.so.6+0x31b92)
 #7 0x00007f1e8d753ad4 llvm::AMDGPU::CollectReachableCallees::collectReachableCallees(llvm::Function*) ($HOME/clang/lib/libLLVMAMDGPUUtils.so.13git+0x1fad4)
 #8 0x00007f1e8d753e27 llvm::AMDGPU::collectReachableCallees(llvm::Module&, llvm::DenseMap<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u>, llvm::DenseMapInfo<llvm::Function*>, llvm::detail::DenseMapPair<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u> > >&) ($HOME/clang/lib/libLLVMAMDGPUUtils.so.13git+0x1fe27)
 #9 0x00007f1e9cc2bf67 (anonymous namespace)::ReplaceLDSUseImpl::replaceLDSUse() AMDGPUReplaceLDSUseWithPointer.cpp:0:0
#10 0x00007f1e9cc2eec8 (anonymous namespace)::AMDGPUReplaceLDSUseWithPointer::runOnModule(llvm::Module&) AMDGPUReplaceLDSUseWithPointer.cpp:0:0
#11 0x00007f1e8f40f4e8 llvm::legacy::PassManagerImpl::run(llvm::Module&) ($HOME/clang/lib/libLLVMCore.so.13git+0x2174e8)
#12 0x0000000000416615 compileModule(char**, llvm::LLVMContext&) llc.cpp:0:0
#13 0x000000000040d596 main ($HOME/clang/bin/llc+0x40d596)
#14 0x00007f1e8deb034a __libc_start_main (/lib64/libc.so.6+0x2434a)

seeing this assert in trunk today in pass

llvm::AMDGPU::CollectReachableCallees::collectReachableCallees

rlieberm@r4:~/git/aomp12/aomp/test/smoke/helloworld$ /home/rlieberm/rocm/trunk_1.0bin/clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 helloworld.c -o helloworld
llc: /work/rlieberm/mono-repo/llvm-project/llvm/include/llvm/ADT/Optional.h:96: T& llvm::optional_detail::OptionalStorage<T, <anonymous> >::getValue() & [with T = llvm::WeakTrackingVH; bool <anonymous> = false]: Assertion `hasVal' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0. Program arguments: /home/rlieberm/rocm/trunk_1.0
bin/llc /tmp/helloworld-9e3621-gfx906-linked-3dda69.bc -O2 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -filetype=obj -o /tmp/helloworld-9e3621-gfx906-1499ac.o

  1. Running pass 'Replace within non-kernel function use of LDS with pointer' on module '/tmp/helloworld-9e3621-gfx906-linked-3dda69.bc'. #0 0x000055e8771bf08f PrintStackTraceSignalHandler(void*) Signals.cpp:0:0 #1 0x000055e8771bc8dd SignalHandler(int) Signals.cpp:0:0 #2 0x00007efe2440d980 restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980) #3 0x00007efe230befb7 raise /build/glibc-S7xCS9/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0 #4 0x00007efe230c0921 abort /build/glibc-S7xCS9/glibc-2.27/stdlib/abort.c:81:0 #5 0x00007efe230b048a assert_fail_base /build/glibc-S7xCS9/glibc-2.27/assert/assert.c:89:0 #6 0x00007efe230b0502 (/lib/x86_64-linux-gnu/libc.so.6+0x30502) #7 0x000055e8774c87c2 llvm::AMDGPU::CollectReachableCallees::collectReachableCallees(llvm::Function*) (/home/rlieberm/rocm/trunk_1.0bin/llc+0x20b27c2) #8 0x000055e8774c8b1f llvm::AMDGPU::collectReachableCallees(llvm::Module&, llvm::DenseMap<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u>, llvm::DenseMapInfo<llvm::Function*>, llvm::detail::DenseMapPair<llvm::Function*, llvm::SmallPtrSet<llvm::Function*, 8u> > >&) (/home/rlieberm/rocm/trunk_1.0bin/llc+0x20b2b1f)

instructions to build aomp from trunk and run test
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/tmp/trunk_1.0 \

-DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_TARGETS_TO_BUILD="X86;AMDGPU" \
-DLLVM_ENABLE_ASSERTIONS=ON                        \
-DLLVM_CCACHE_BUILD=$OPTION_CCACHE                 \
-DLLVM_ENABLE_RUNTIMES="openmp" \
-DOPENMP_ENABLE_LIBOMPTARGET_HSA=1 \
-DCLANG_DEFAULT_LINKER=lld                         \
../llvm -GNinja

ninja install

git clone http://github.com/rocm-developer-tools/aomp
cd aomp/test/smoke/helloworld
/tmp/trunk_1.0//bin/clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 helloworld.c -o helloworld

those clone instructions will yield a bunch of code, including some makefile machinery and the following C:

// helloworld.c
#include <stdio.h>
#include <omp.h>

int main(void) {
  int isHost = 1;

#pragma omp target map(tofrom: isHost)
  {
    isHost = omp_is_initial_device();
    printf("Hello world. %d\n", 100);
    for (int i =0; i<5; i++) {
      printf("Hello world. iteration %d\n", i);
    }
  }

  printf("Target region executed on the %s\n", isHost ? "host" : "device");

  return isHost;
}
JonChesterfield added a comment.EditedJun 25 2021, 4:51 PM

I am going to revert this change as the author is unlikely to be available at present (early morning Saturday), in accordance with the guidelines at https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy. It'll reapply easily once the assert is run to ground.

It breaks a lot of the smoke tests in aomp. It's also been hit externally. It reproduces on a minimal openmp example (presumably because our runtime does things with shared memory). I don't see how this change could have passed epsdb, so reverting it also helps with the merge from trunk into rocm.

edit: doesn't revert cleanly. Considering whether to leave trunk broken or not.

Git revert is unclean, gone with D104962 instead. That changes the pass to disabled by default, changes the test header for these tests to re-enable it, deletes one test where that strategy failed.

I am seeing a run-time failure as well in addition to the compile-time failure reported earlier. Here's a test case, veccopy.c.

#include <stdio.h>
#include <omp.h>

int main()
{

int N = 100000;

int a[N];
int b[N];

int i;

for (i=0; i<N; i++){
  a[i]=0;
  b[i]=i;
}

#pragma omp target teams distribute parallel for map(from: a[0:N]) map(to: b[0:N])

{
  for (int j = 0; j< N; j++)
    a[j]=b[j];
}

int rc = 0;
for (i=0; i<N; i++)
  if (a[i] != b[i] ) {
    rc++;
    printf ("Wrong value: a[%d]=%d\n", i, a[i]);
  }

if (!rc){
  printf("Success\n");
  return EXIT_SUCCESS;
} else{
  printf("Failure\n");
  return EXIT_FAILURE;
}

}

Compile/run:
clang -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 veccopy.c -o veccopy
./veccopy
[GPU Memory Error] Addr: 0x7f60a4210000 Reason: Page not present or supervisor privilege.
Memory access fault by GPU node-1 (Agent handle: 0x168c5d0) on address 0x7f60a4210000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Adding -mllvm -amdgpu-enable-lds-replace-with-pointer=false gets rid of the problem.
Running with env var LIBOMPTARGET_KERNEL_TRACE=1
shows LDS went down from 16400B to 32B but it triggered the above problem.

W.r.t compile time failure, it might be because of the patch D104704 which is added on top of this patch.

W.r.t run time issues, I am surprised that our internal psdb job did not catch it. We really need to look into it.

psdb (our classic testing) has zero OpenMP tests.
The ePSDB does have OpenMP testing.

a jenkins request to run ePSDB on a phab patch would be needed to get ePSDB testing.

After the patch has landed in trunk, the twice-daily merges from trunk to amd-stg-open will run both a psdb and ePSDB to validate the set of patches which might get merged.
if the merge testing in psdb/ePSDB fails to pass, merge does not land in amd-stg-open.
i think our merge process is about to try and digest the LDS patch.

Jon, Dhruva, i added your two tests to our smoke-fails this morning. i see both failing as you described with trunk build. I see both passing in this mornings amd-stg-open with the pass enabled, and with the pass disabled.

W.r.t compile time failure, it might be because of the patch D104704 which is added on top of this patch.

W.r.t run time issues, I am surprised that our internal psdb job did not catch it. We really need to look into it.

Given the call stack that is a very good chance it is D104704. Can we try to run that w/o D104704 and with LDS pointer replacement enabled?

foad added a comment.Jun 29 2021, 1:46 AM

Here's a .ll testcase reduced from the openmp helloworld thing:

; ModuleID = 'z.ll'
source_filename = "llvm-link"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn-amd-amdhsa"

@usedSlotIdx = internal local_unnamed_addr addrspace(3) global i32 undef, align 4

define weak amdgpu_kernel void @__omp_offloading_10302_1a22e53_main_l7() {
entry:
  %nvptx_num_threads = call i32 @__kmpc_amdgcn_gpu_num_threads()
  ret void
}

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #0

declare i32 @__kmpc_amdgcn_gpu_num_threads()

define internal void @__kmpc_kernel_init() {
entry:
  store i32 undef, i32 addrspace(3)* @usedSlotIdx, align 4, !tbaa !0
  ret void
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p5i8(i64 immarg, i8 addrspace(5)* nocapture) #1

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p5i8(i64 immarg, i8 addrspace(5)* nocapture) #1

attributes #0 = { nounwind readnone speculatable willreturn }
attributes #1 = { argmemonly nofree nosync nounwind willreturn }

!0 = !{!1, !1, i64 0}
!1 = !{!"int", !2, i64 0}
!2 = !{!"omnipotent char", !3, i64 0}
!3 = !{!"Simple C++ TBAA"}

Compile with: llc -march=amdgcn -mcpu=gfx906 -o /dev/null z.ll

It bisects to this patch D103225, not D104704.