This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Initialize LDS pointers after alloca, but before call.
AbandonedPublic

Authored by hsmhsm on Sep 10 2021, 4:50 AM.

Download Raw Diff

Details

Reviewers

JonChesterfield
rampitec
arsenm
ronlieb
foad
t-tye
b-sumner

Summary

The LDS pointers need to be initialized within the entry basic block of
kernel(s) after all alloca, but before any call instruction. If this is
not possible, then we skip running this pass for now.

Ideally alloca can appear anywhere within the function, and the AMDGPU
backend should be able to handle it, but at the moment it cannot. Once
AMDGPU backend is able to robustly handle alloca inserted anywhere, then
this hack is no longer required.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hsmhsm created this revision.Sep 10 2021, 4:50 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald TranscriptSep 10 2021, 4:50 AM

hsmhsm requested review of this revision.Sep 10 2021, 4:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 10 2021, 4:50 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B123419: Diff 371881.Sep 10 2021, 5:22 AM

I don't understand the precondition. Why after every alloca but before every call?

It seems like you want to introduce N stores to addrspace(3) globals, where each store needs to precede any loads from that global. I can see 'call' as a rough approximation for 'might load', but why is alloca involved at all?

edit: commit message should change to state why this precondition is required, so that later people trying to work out why we're transforming variables like this can determine if the constraint still holds. E.g.

why are we splitting the basic block instead of inserting a relaxed atomic store
if an alloca outside of the entry BB is broken and this transform moves calls out of the entry block, and those calls are then inlined introducing alloca, then this transform is broken, please document whatever makes that OK

It seems likely that allowing alloca outside of the entry bb would be a trivial change (just hoist them to entry), simplify this work, and unbreak other things

In D109594#2997091, @JonChesterfield wrote:

I don't understand the precondition. Why after every alloca but before every call?

It seems like you want to introduce N stores to addrspace(3) globals, where each store needs to precede any loads from that global. I can see 'call' as a rough approximation for 'might load', but why is alloca involved at all?

It's a QOI thing. You want to try hard to leave allocas in the entry block if possible, because LLVM convention is that allocas in the entry block are static (allocated very efficiently on entry to the function) and allocas elsewhere are dynamic (allocated less efficiently, has knock-on effects like forcing you to reserve a frame pointer register, possibly not even supported by the AMDGPU backend?). See also D108971 for some prior discussion.

In D109594#2997110, @foad wrote:

It's a QOI thing. You want to try hard to leave allocas in the entry block if possible, because LLVM convention is that allocas in the entry block are static...

That sounds right. However, this transform, by moving calls out of the entry block, will itself have that effect if those calls are inlined.

Alternatives are to emit the stores (probably as relaxed atomic) in the entry block, such that every lane executes it but we don't split the CFG, or to add a fairly late pass that hoists alloca into the entry bb.

I'd be inclined to do both. Hoisting 'dynamic' alloca into entry will fix some miscompilation (I haven't looked recently, but ~ 6 months ago alloca outside of entry was an error in the backend) and/or make things faster. Emitting the store from all lanes instead of branching means, well, less branching, but also we don't rearrange the entry block into a CFG.

If an atomic store of a uniform value is better expressed as masking off all lanes but one, I suspect we're better off doing that transform once exec is available for manipulation. Somewhere in MIR.

In D109594#2997132, @JonChesterfield wrote:

In D109594#2997110, @foad wrote:

It's a QOI thing. You want to try hard to leave allocas in the entry block if possible, because LLVM convention is that allocas in the entry block are static...

That sounds right. However, this transform, by moving calls out of the entry block, will itself have that effect if those calls are inlined.

Alternatives are to emit the stores (probably as relaxed atomic) in the entry block, such that every lane executes it but we don't split the CFG, or to add a fairly late pass that hoists alloca into the entry bb.

I'd be inclined to do both. Hoisting 'dynamic' alloca into entry will fix some miscompilation (I haven't looked recently, but ~ 6 months ago alloca outside of entry was an error in the backend) and/or make things faster. Emitting the store from all lanes instead of branching means, well, less branching, but also we don't rearrange the entry block into a CFG.

If an atomic store of a uniform value is better expressed as masking off all lanes but one, I suspect we're better off doing that transform once exec is available for manipulation. Somewhere in MIR.

Often, (probably because of too many things on the one's plate) we tend to forget what we discussed and finalized in the past except one who actually works on it.

After lengthy discussion of possible ways of initializing LDS pointers at the entry block of kernels (including relaxed atomic one) we finalized about 0th lane from each wave to do initialization (which is nothing but the stores that you indicated above). So I did not take the route of relaxed atomic approach.

Yes store to LDS pointers should happen within the entry basic block of the kernel, before loading back them within non-kernel functions (that is before any call within kernel)

Now, we need to do these stores after alloca, otherwise alloca will be moved out of entry basic block into the newly split block which will have adverse side effect.

By the time this pass is run, inlining pass is already run, but, I am still safely checking for the possibility of alloca being living somewhere after call (in the entry block) and avoid running this pass in that case, which is a safe bet from the correctness point of view.

As Jay mentioned, though in theory it is true that alloca can live anywhere within the function, it is better and safe to cluster them at the beginning of the entry block and that is what usually happen most of the time. Please refer - (1) https://lists.llvm.org/pipermail/llvm-dev/2015-September/090168.html (2) https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

arsenm added inline comments.Sep 13 2021, 10:58 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
443	I don't like this arbitrary condition. Why not just collect all allocas and re-insert them at the start of the entry block?

hsmhsm marked an inline comment as done.Sep 13 2021, 6:01 PM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
443	Even I do not like it. But I do not know, how neat and how quick the implementation of moving all allocas to entry block. While researching and googling I found in some llvm discussion that there was an effort (not sure if it was a pass) to do that generically, but later it was removed due to some (genuine) reason. Considering the necessity of keeping allocas at the entry block from the LLVM point of view, I think someone must have done it if it was feasible. So, I do not want to deviate again from my main task of LDS stuff and keep it hold for unknown time as it is already delayed so long due to one or the other reason. That said, I am having it on my plate to investigate alloca stuff in generic (for LLVM) and some constraints towards it in AMDGPU backend, and to fix it as neat as possible. But, it will not be one or two days of task and I cannot hold on LDS for this alloca task to complete. Finally, as said, it is a temporary hack, and I will be removing it once I fix alloca stuff.

hsmhsm marked an inline comment as done.Sep 13 2021, 7:13 PM

Since this isn’t required for correctness, this should just skip the allocas clustered at the beginning. There’s no need to scan beyond or consider calls

In D109594#2998933, @arsenm wrote:

Since this isn’t required for correctness, this should just skip the allocas clustered at the beginning. There’s no need to scan beyond or consider calls

That is what I did in the patch https://reviews.llvm.org/D108971 , and then after enabling this LDS pass at https://reviews.llvm.org/D109062, we got some OpenMP build-bot test failures, and I had to revert the patches. So, I added code to scan alloca beyond calls to avoid surprising failures. We really need to take this alloca stuff as a separate task, and I do not think we can do any quick fix for it. Until then, for LDS we need to live with this hack.

In D109594#2997752, @hsmhsm wrote:

Often, (probably because of too many things on the one's plate) we tend to forget what we discussed and finalized in the past except one who actually works on it.

After lengthy discussion of possible ways of initializing LDS pointers at the entry block of kernels (including relaxed atomic one) we finalized about 0th lane from each wave to do initialization (which is nothing but the stores that you indicated above). So I did not take the route of relaxed atomic approach.

This doesn't sound right. I remember a preference for dropping to lane 0 on power efficiency grounds. I also remember a proposal to do that by modifying the exec mask in IR. I don't recall any discussion of the costs of splitting the entry block, or a discussion of the consequences of that if calls later in the entry block are subsequently inlined. The previous round of apply/fail/revert, and how you're trying to work around it here, suggests that is indeed a hazard.

Instead of adding, in your words, "temporary hacks", let's fix the alloca lowering so it cannot break us here or elsewhere. By iterating over the function and moving all alloca into the entry block, or by doing the call frame setup thing in the back end. Just start from the error message from llc and work backwards.

I'd also prefer we make the branch vs power tradeoff in MIR instead of IR as we generally try to avoid splitting basic blocks (since doing so usually blocks optimisations), but care less about that than about not miscompiling openmp.

In D109594#2999182, @JonChesterfield wrote:

In D109594#2997752, @hsmhsm wrote:

Often, (probably because of too many things on the one's plate) we tend to forget what we discussed and finalized in the past except one who actually works on it.

After lengthy discussion of possible ways of initializing LDS pointers at the entry block of kernels (including relaxed atomic one) we finalized about 0th lane from each wave to do initialization (which is nothing but the stores that you indicated above). So I did not take the route of relaxed atomic approach.

This doesn't sound right. I remember a preference for dropping to lane 0 on power efficiency grounds. I also remember a proposal to do that by modifying the exec mask in IR. I don't recall any discussion of the costs of splitting the entry block, or a discussion of the consequences of that if calls later in the entry block are subsequently inlined. The previous round of apply/fail/revert, and how you're trying to work around it here, suggests that is indeed a hazard.

Instead of adding, in your words, "temporary hacks", let's fix the alloca lowering so it cannot break us here or elsewhere. By iterating over the function and moving all alloca into the entry block, or by doing the call frame setup thing in the back end. Just start from the error message from llc and work backwards.

I'd also prefer we make the branch vs power tradeoff in MIR instead of IR as we generally try to avoid splitting basic blocks (since doing so usually blocks optimisations), but care less about that than about not miscompiling openmp.

Proposal to do lds pointer initialization by modifying the exec mask in (LLVM) IR is completely ruled out because it is a hacky stuff. At the IR level, there is no neat way of modifying and maintaining exec mask (please check with Tony), he has a clear opposition for it.
I did not have an a priori understanding of splitting entry basic block (because we want only lane 0 to do initialization) will result in surprising failures like this. Only after implementation and test failures, I realized it. So, discussing it a priori is ruled out.
Fixing alloca lowering will be a completely separate task. And, I even do not know the complexity of this work at the moment. I also have my own doubt that if moving all alloca to the beginning of the entry basic block is so simple as you guys are mentioning here, why not anyone from LLVM community did not attempt it till date considering the importance of keeping alloca at the beginning of the entry basic block for better optimization?
Moving to work on alloca means this LDS pass need to wait until then, and I am not sure if we really have that much luxury of time considering the importance of this pass. So, I came-up with this temporary patch until we clean-up alloca, which is a valid attempt in my opinion considering the importance of this pass. That said, if we all of us including higher technical management team is okay for this patch to wait for arbitrary time, I am fine with that too.
This is an late LLVM IR pass, I am not getting how it is linked to MIR.

Modifying exec from IR is right out. I was referring back to the internal discussion on how to implement this where that was briefly considered.

Given we now know more than when the original decision was taken, let us revisit that decision and not split the block. MIR is a point where this and all other uniform stores could be optimised for power consumption by masking exec, without changing IR optimisations.

Moving all alloca to entry was proposed on the mailing list some years ago and implemented in at least one out of tree target. I suspect it is not already done in tree because codegen is more efficient on most CPU targets without it. I doubt it is especially complicated to implement - the entry block dominates the other blocks in the function.

I'm not convinced the transform being enabled here is necessary, and have previously outlined a variety of alternatives which you remain unwilling to consider. My interest here is solely in avoiding your "temporary hacks" breaking openmp.

In D109594#2999421, @JonChesterfield wrote:

Modifying exec from IR is right out. I was referring back to the internal discussion on how to implement this where that was briefly considered.

Given we now know more than when the original decision was taken, let us revisit that decision and not split the block. MIR is a point where this and all other uniform stores could be optimised for power consumption by masking exec, without changing IR optimisations.

If you think so, then let's have an internal discussion by involving all the key folks, and take a decision, I am fine with whatever decision is being taken.

Moving all alloca to entry was proposed on the mailing list some years ago and implemented in at least one out of tree target. I suspect it is not already done in tree because codegen is more efficient on most CPU targets without it. I doubt it is especially complicated to implement - the entry block dominates the other blocks in the function.

This is just only your suspect as I have my own suspect. We will not be having clear answer to it, until we ask community itself. Let me try that too. We do not know how the out of tree target has implemented it, it could also be just hacky stuff with some serious constraints, but it just work for them.

I'm not convinced the transform being enabled here is necessary, and have previously outlined a variety of alternatives which you remain unwilling to consider. My interest here is solely in avoiding your "temporary hacks" breaking openmp.

I am not sure what are the *variety* of alternatives discussed here which are quickly and neatly implementable. Can you again outline them here?

hsmhsm added reviewers: t-tye, b-sumner.Sep 14 2021, 5:02 AM

arsenm added inline comments.Sep 14 2021, 5:50 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
195	All this patch should do is skip the insertion point past the allocas clustered at the start of the entry block. If there are further allocas in the program which may be broken, you should not be concerning yourself with them. I don't want to add a workaround for non-entry allocas here

Abandoning this patch, since it is not relevant anymore as per internal email discussion. Submitted new patch to enable lds pointer replacement pass at https://reviews.llvm.org/D109870

hsmhsm mentioned this in D109870: [AMDGPU] Enable the pass "amdgpu-replace-lds-use-with-pointer".Sep 21 2021, 2:19 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUReplaceLDSUseWithPointer.cpp

48 lines

test/

CodeGen/

AMDGPU/

replace-lds-by-ptr-ignore-alloca-after-call.ll

35 lines

replace-lds-by-ptr-split-entry-bb-after-alloca.ll

61 lines

Diff 371881

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

Show First 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
#include <vector>		#include <vector>

#define DEBUG_TYPE "amdgpu-replace-lds-use-with-pointer"		#define DEBUG_TYPE "amdgpu-replace-lds-use-with-pointer"

using namespace llvm;		using namespace llvm;

namespace {		namespace {

		static bool hasAllocaAfterCall(Function *K) {
		auto &EBB = K->getEntryBlock();
		auto *TI = EBB.getTerminator();
		BasicBlock::iterator Iter(&(*(EBB.getFirstInsertionPt())));

		while (!isa<CallBase>(Iter) && (&Iter != TI))
		++Iter;
		if (&*Iter == TI)
		return false;

		while (!isa<AllocaInst>(Iter) && (&Iter != TI))
		++Iter;
		if (&*Iter == TI)
		return false;

		return true;
		}

class ReplaceLDSUseImpl {		class ReplaceLDSUseImpl {
Module &M;		Module &M;
LLVMContext &Ctx;		LLVMContext &Ctx;
const DataLayout &DL;		const DataLayout &DL;
Constant *LDSMemBaseAddr;		Constant *LDSMemBaseAddr;

DenseMap<GlobalVariable , GlobalVariable > LDSToPointer;		DenseMap<GlobalVariable , GlobalVariable > LDSToPointer;
DenseMap<GlobalVariable , SmallPtrSet<Function , 8>> LDSToNonKernels;		DenseMap<GlobalVariable , SmallPtrSet<Function , 8>> LDSToNonKernels;
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	class ReplaceLDSUseImpl {
// the LDS pointer initialization, and return newly created basic block.		// the LDS pointer initialization, and return newly created basic block.
BasicBlock activateLaneZero(Function K) {		BasicBlock activateLaneZero(Function K) {
// If the entry basic block of kernel K is already splitted, then return		// If the entry basic block of kernel K is already splitted, then return
// newly created basic block.		// newly created basic block.
auto BasicBlockEntry = KernelToInitBB.insert(std::make_pair(K, nullptr));		auto BasicBlockEntry = KernelToInitBB.insert(std::make_pair(K, nullptr));
if (!BasicBlockEntry.second)		if (!BasicBlockEntry.second)
return BasicBlockEntry.first->second;		return BasicBlockEntry.first->second;

// Split entry basic block of kernel K.		// Split entry basic block of kernel K just after alloca.
auto EI = &((K->getEntryBlock().getFirstInsertionPt()));		//
arsenmUnsubmitted Not Done Reply Inline Actions All this patch should do is skip the insertion point past the allocas clustered at the start of the entry block. If there are further allocas in the program which may be broken, you should not be concerning yourself with them. I don't want to add a workaround for non-entry allocas here arsenm: All this patch should do is skip the insertion point past the allocas clustered at the start of…
IRBuilder<> Builder(EI);		// Find the split point just after alloca.
		auto &EBB = K->getEntryBlock();
		auto EI = &((EBB.getFirstInsertionPt()));
		BasicBlock::reverse_iterator RIT(EBB.getTerminator());
		while (!isa<AllocaInst>(RIT) && (&RIT != EI))
		++RIT;
		if (isa<AllocaInst>(*RIT))
		--RIT;

		// Split entry basic block.
		IRBuilder<> Builder(&*RIT);

Value *Mbcnt =		Value *Mbcnt =
Builder.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},		Builder.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},
{Builder.getInt32(-1), Builder.getInt32(0)});		{Builder.getInt32(-1), Builder.getInt32(0)});
Value *Cond = Builder.CreateICmpEQ(Mbcnt, Builder.getInt32(0));		Value *Cond = Builder.CreateICmpEQ(Mbcnt, Builder.getInt32(0));
Instruction *WB = cast<Instruction>(		Instruction *WB = cast<Instruction>(
Builder.CreateIntrinsic(Intrinsic::amdgcn_wave_barrier, {}, {}));		Builder.CreateIntrinsic(Intrinsic::amdgcn_wave_barrier, {}, {}));

▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	bool ReplaceLDSUseImpl::replaceLDSUse() {
if (KernelToCallees.empty()) {		if (KernelToCallees.empty()) {
// Either module does not have any kernel definitions, or none of the kernel		// Either module does not have any kernel definitions, or none of the kernel
// has a call to non-kernel functions, or we could not resolve any of the		// has a call to non-kernel functions, or we could not resolve any of the
// call sites to proper non-kernel functions, because of the situations like		// call sites to proper non-kernel functions, because of the situations like
// inline asm calls. Nothing to replace.		// inline asm calls. Nothing to replace.
return false;		return false;
}		}

		// The LDS pointers need to be initialized within the entry basic block of
		// kernel(s) after all alloca, but before any call instruction. If this is not
		// possible, then we skip running this pass.
		//
		// FIXME: Ideally alloca can appear anywhere within the function, and the
		// AMDGPU backend should be able to handle it, but at the moment it cannot.
		// Once AMDGPU backend is able to robustly handle alloca inserted anywhere,
		// then this hack is no longer required.
		for (auto KI = KernelToCallees.begin(), KE = KernelToCallees.end(); KI != KE;
		++KI) {
		if (hasAllocaAfterCall(KI->first))
		arsenmUnsubmitted Done Reply Inline Actions I don't like this arbitrary condition. Why not just collect all allocas and re-insert them at the start of the entry block? arsenm: I don't like this arbitrary condition. Why not just collect all allocas and re-insert them at…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Even I do not like it. But I do not know, how neat and how quick the implementation of moving all allocas to entry block. While researching and googling I found in some llvm discussion that there was an effort (not sure if it was a pass) to do that generically, but later it was removed due to some (genuine) reason. Considering the necessity of keeping allocas at the entry block from the LLVM point of view, I think someone must have done it if it was feasible. So, I do not want to deviate again from my main task of LDS stuff and keep it hold for unknown time as it is already delayed so long due to one or the other reason. That said, I am having it on my plate to investigate alloca stuff in generic (for LLVM) and some constraints towards it in AMDGPU backend, and to fix it as neat as possible. But, it will not be one or two days of task and I cannot hold on LDS for this alloca task to complete. Finally, as said, it is a temporary hack, and I will be removing it once I fix alloca stuff. hsmhsm: Even I do not like it. But I do not know, how neat and how quick the implementation of…
		return false;
		}

// For every LDS from collected LDS globals set, replace its non-kernel		// For every LDS from collected LDS globals set, replace its non-kernel
// function scope use by pointer.		// function scope use by pointer.
bool Changed = false;		bool Changed = false;
for (auto *GV : LDSGlobals)		for (auto *GV : LDSGlobals)
Changed \|= replaceLDSUse(GV);		Changed \|= replaceLDSUse(GV);

return Changed;		return Changed;
}		}
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-alloca-after-call.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There is one lds global defined here, and this lds is used within a single non-kernel
				; function, as an operand of nested constant expression, and this non-kernel function is
				; reachable from kernel. But, call to non-kernel function happens after alloca. Hence,
				; pointer replacement is not triggered.
				;
				; Original LDS should exist.
				; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
				@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4

				; Pointer should not be created.
				; CHECK-NOT: @used_only_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				define void @f0(i32 %x) {
				; CHECK-LABEL: entry:
				; CHECK: store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_func, i32 0, i32 0) to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_func, i32 0, i32 0) to i32) to i64)) to i32), align 4
				; CHECK: ret void
				entry:
				store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

				define amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: call void @f0(i32 0)
				; CHECK: %a = alloca i64, align 8, addrspace(5)
				; CHECK: ret void
				entry:
				call void @f0(i32 0)
				%a = alloca i64, align 8, addrspace(5)
				ret void
				}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-split-entry-bb-after-alloca.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There is one lds global defined here, and this lds is used within a single non-kernel
				; function, as an operand of nested constant expression, and this non-kernel function is
				; reachable from kernel. Hence nested constant expression should to be converted into a
				; series of instructons and pointer replacement should take place.
				;
				; Further the entry basic block of the kernel @k0 contains alloca instruction. Hence the
				; entry basic splitting for pointer initialization should happen after alloca.
				;

				; Original LDS should exist.
				; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
				@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4

				; Pointers should be created.
				; CHECK: @used_only_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define void @f0(i32 %x) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @used_only_within_func.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i64
				; CHECK: %6 = add i64 %5, %5
				; CHECK: %7 = inttoptr i64 %6 to i32*
				; CHECK: store i32 %x, i32* %7, align 4
				; CHECK: ret void
				entry:
				store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

				; Pointer initialization code shoud be added
				define amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = alloca i64, align 8, addrspace(5)
				; CHECK: %1 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %2 = icmp eq i32 %1, 0
				; CHECK: br i1 %2, label %3, label %4
				;
				; CHECK-LABEL: 3:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @used_only_within_func to i16), i16 addrspace(3)* @used_only_within_func.ptr, align 2
				; CHECK: br label %4

				; CHECK-LABEL: 4:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: %5 = addrspacecast i64 addrspace(5)* %0 to i64*
				; CHECK: call void @f0(i32 0)
				; CHECK: ret void
				entry:
				%0 = alloca i64, align 8, addrspace(5)
				%1 = addrspacecast i64 addrspace(5)* %0 to i64*
				call void @f0(i32 0)
				ret void
				}