This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Handle BitCast of GEP in promoting alloca to vector
AbandonedPublic

Authored by cfang on Apr 3 2018, 1:50 PM.

Download Raw Diff

Details

Reviewers

arsenm
msearles
kzhuravl
rampitec

Summary

This patch handles the case that the pointer is a BITCAST of GEP to change the data type. Before this patch, the check stops
at the BITCAST (pointer) and the load/store are not actually handled at all.

We also remove an invalid LIT test because we haven't handled the case yet, and check always pass with or without the promotion because
the index is an immediate and the load/store could be easily optimized away in other optimizations.

Diff Detail

Event Timeline

cfang created this revision.Apr 3 2018, 1:50 PM

Herald added subscribers: t-tye, tpr, dstuttard and 4 others. · View Herald TranscriptApr 3 2018, 1:50 PM

ping

Add reviewers and ping.

Thanks;

arsenm added inline comments.Apr 23 2018, 11:43 AM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
330	I think there's another bug here (which should be fixed in a separate patch). This probably also needs to check if it's atomic
340–341	This shouldn't need to special case GEP users. You really want something like stripPointerCasts, but not one that looks through addrspacecast. Could use a test with 0 index GEPs which should show the same problem.
348	Typo othereise
354	This looks broken. addrspacecast should forbid the vector promotion. It should only be allowed for the LDS promotion. Separate patch?
357	Same as with the load
433–436	Use dyn_cast instead of separate isa and cast
452	Same as the load case
test/CodeGen/AMDGPU/vector-alloca.ll
154	needs some more tests with multiple uses of the bitcast source, with accesses through different types as well as non-bitcast users

cfang added inline comments.Apr 26 2018, 2:08 PM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
340–341	I could not understand this comment. Here we are handling the case of load(bitcast(gep))! We are checking the bitcase and collect the load. We will transfer to vector load later, which depends on the gep for the index. So we have to have the gep, and collect the load/store.

cfang marked 8 inline comments as done.May 9 2018, 2:50 PM

cfang added inline comments.

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
330	Done in a separate patch under review.
354	Done in a separate patch under review.

Update based on Matt's comments.

arsenm added a reviewer: rampitec.May 29 2020, 9:13 AM

Herald added subscribers: kerbowa, jvesely. · View Herald TranscriptMay 29 2020, 9:13 AM

Is this still needed?

I have run these new tests with ToT opt and there were no allocas remaining. I assume this change is not needed anymore.

This is no longer needed.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUPromoteAlloca.cpp

73 lines

test/

CodeGen/

AMDGPU/

vector-alloca.ll

105 lines

Diff 146011

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Context not available.
	if (GEP->getNumOperands() != 3)	if (GEP->getNumOperands() != 3)
	return nullptr;	return nullptr;

	ConstantInt *I0 = dyn_cast<ConstantInt>(GEP->getOperand(1));	ConstantInt *I0 = dyn_cast<ConstantInt>(GEP->getOperand(1));
	if (!I0 \|\| !I0->isZero())	if (!I0 \|\| !I0->isZero())
	return nullptr;	return nullptr;

	return GEP->getOperand(2);	return GEP->getOperand(2);
	}	}

	// Not an instruction handled below to turn into a vector.	// NOTE: We mainly check whether a load or a store is vectorizable here.
		// A special case here is BITCAST of a GEP, in which case we check
		// whether all users of the BITCAST is vectorizable.
	//	//
	// TODO: Check isTriviallyVectorizable for calls and handle other	// TODO: Check isTriviallyVectorizable for calls and handle other
	// instructions.	// instructions.
	static bool canVectorizeInst(Instruction Inst, User User) {	static bool canVectorizeInst(Instruction Inst, User Used,
		std::vector<Value*> &WorkList) {
	switch (Inst->getOpcode()) {	switch (Inst->getOpcode()) {
	case Instruction::Load: {	case Instruction::Load: {
	LoadInst *LI = cast<LoadInst>(Inst);	LoadInst *LI = cast<LoadInst>(Inst);
	// Currently only handle the case where the Pointer Operand is a GEP so check for that case.	if (LI->isVolatile())
		arsenmUnsubmitted Done Reply Inline Actions I think there's another bug here (which should be fixed in a separate patch). This probably also needs to check if it's atomic arsenm: I think there's another bug here (which should be fixed in a separate patch). This probably…
		cfangAuthorUnsubmitted Not Done Reply Inline Actions Done in a separate patch under review. cfang: Done in a separate patch under review.
	return isa<GetElementPtrInst>(LI->getPointerOperand()) && !LI->isVolatile();	return false;
		// Currently only handle the case where the Pointer Operand is a GEP
		// or a BITCAST.
		if (LI->getPointerOperand() != Used \|\|
		(!isa<GetElementPtrInst>(Used) && !isa<BitCastInst>(Used)))
		return false;
		WorkList.push_back(Inst);
		return true;
		}
		case Instruction::BitCast: {
		if (isa<GetElementPtrInst>(Used)) {
		arsenmUnsubmitted Done Reply Inline Actions This shouldn't need to special case GEP users. You really want something like stripPointerCasts, but not one that looks through addrspacecast. Could use a test with 0 index GEPs which should show the same problem. arsenm: This shouldn't need to special case GEP users. You really want something like stripPointerCasts…
		cfangAuthorUnsubmitted Not Done Reply Inline Actions I could not understand this comment. Here we are handling the case of load(bitcast(gep))! We are checking the bitcase and collect the load. We will transfer to vector load later, which depends on the gep for the index. So we have to have the gep, and collect the load/store. cfang: I could not understand this comment. Here we are handling the case of load(bitcast(gep))! We…
		for (User *BCUser : Inst->users()) {
		if (!canVectorizeInst(cast<Instruction>(BCUser), Inst, WorkList))
		return false;
		}
		return true;
		}
		// Fallthrough otherwise.
		arsenmUnsubmitted Done Reply Inline Actions Typo othereise arsenm: Typo othereise
		// TODO: we do not actually have logic to handle general bitcast and
		// addrspacecast. We may have to be conservative here to avoid
		// unexpected results.
	}	}
	case Instruction::BitCast:
	case Instruction::AddrSpaceCast:	case Instruction::AddrSpaceCast:
	return true;	return true;
		arsenmUnsubmitted Done Reply Inline Actions This looks broken. addrspacecast should forbid the vector promotion. It should only be allowed for the LDS promotion. Separate patch? arsenm: This looks broken. addrspacecast should forbid the vector promotion. It should only be allowed…
		cfangAuthorUnsubmitted Not Done Reply Inline Actions Done in a separate patch under review. cfang: Done in a separate patch under review.
	case Instruction::Store: {	case Instruction::Store: {
	// Must be the stored pointer operand, not a stored value, plus
	// since it should be canonical form, the User should be a GEP.
	StoreInst *SI = cast<StoreInst>(Inst);	StoreInst *SI = cast<StoreInst>(Inst);
	return (SI->getPointerOperand() == User) && isa<GetElementPtrInst>(User) && !SI->isVolatile();	if (SI->isVolatile())
		arsenmUnsubmitted Done Reply Inline Actions Same as with the load arsenm: Same as with the load
		return false;
		// Currently only handle the case where the Pointer Operand is a GEP
		// or a BITCAST.
		if (SI->getPointerOperand() != Used \|\|
		(!isa<GetElementPtrInst>(Used) && !isa<BitCastInst>(Used)))
		return false;
		WorkList.push_back(Inst);
		return true;
	}	}
	default:	default:
	return false;	return false;
	}	}
	}	}

	static bool tryPromoteAllocaToVector(AllocaInst *Alloca, AMDGPUAS AS) {	static bool tryPromoteAllocaToVector(AllocaInst *Alloca, AMDGPUAS AS) {

	if (DisablePromoteAllocaToVector) {	if (DisablePromoteAllocaToVector) {
	DEBUG(dbgs() << " Promotion alloca to vector is disabled\n");	DEBUG(dbgs() << " Promotion alloca to vector is disabled\n");
Context not available.
	!VectorType::isValidElementType(AllocaTy->getElementType())) {	!VectorType::isValidElementType(AllocaTy->getElementType())) {
	DEBUG(dbgs() << " Cannot convert type to vector\n");	DEBUG(dbgs() << " Cannot convert type to vector\n");
	return false;	return false;
	}	}

	std::map<GetElementPtrInst, Value> GEPVectorIdx;	std::map<GetElementPtrInst, Value> GEPVectorIdx;
	std::vector<Value*> WorkList;	std::vector<Value*> WorkList;
	for (User *AllocaUser : Alloca->users()) {	for (User *AllocaUser : Alloca->users()) {
	GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(AllocaUser);	GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(AllocaUser);
	if (!GEP) {	if (!GEP) {
	if (!canVectorizeInst(cast<Instruction>(AllocaUser), Alloca))	if (!canVectorizeInst(cast<Instruction>(AllocaUser), Alloca, WorkList))
	return false;	return false;

	WorkList.push_back(AllocaUser);
	continue;	continue;
	}	}

	Value *Index = GEPToVectorIndex(GEP);	Value *Index = GEPToVectorIndex(GEP);

	// If we can't compute a vector index from this GEP, then we can't	// If we can't compute a vector index from this GEP, then we can't
	// promote this alloca to vector.	// promote this alloca to vector.
	if (!Index) {	if (!Index) {
	DEBUG(dbgs() << " Cannot compute vector index for GEP " << *GEP << '\n');	DEBUG(dbgs() << " Cannot compute vector index for GEP " << *GEP << '\n');
	return false;	return false;
	}	}

	GEPVectorIdx[GEP] = Index;	GEPVectorIdx[GEP] = Index;
	for (User *GEPUser : AllocaUser->users()) {	for (User *GEPUser : AllocaUser->users()) {
	if (!canVectorizeInst(cast<Instruction>(GEPUser), AllocaUser))	if (!canVectorizeInst(cast<Instruction>(GEPUser), AllocaUser, WorkList))
	return false;	return false;

	WorkList.push_back(GEPUser);
	}	}
	}	}

	VectorType *VectorTy = arrayTypeToVecType(AllocaTy);	VectorType *VectorT = arrayTypeToVecType(AllocaTy);

	DEBUG(dbgs() << " Converting alloca to vector "	DEBUG(dbgs() << " Converting alloca to vector "
	<< AllocaTy << " -> " << VectorTy << '\n');	<< AllocaTy << " -> " << VectorT << '\n');

	for (Value *V : WorkList) {	for (Value *V : WorkList) {
	Instruction *Inst = cast<Instruction>(V);	Instruction *Inst = cast<Instruction>(V);
	IRBuilder<> Builder(Inst);	IRBuilder<> Builder(Inst);
	switch (Inst->getOpcode()) {	switch (Inst->getOpcode()) {
	case Instruction::Load: {	case Instruction::Load: {
	Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);
	Value *Ptr = cast<LoadInst>(Inst)->getPointerOperand();	Value *Ptr = cast<LoadInst>(Inst)->getPointerOperand();
		VectorType *VectorTy = VectorT;
		if (BitCastInst *BC = dyn_cast<BitCastInst>(Ptr)) {
		VectorTy = VectorType::get(Ptr->getType()->getPointerElementType(),
		AllocaTy->getNumElements());
		Ptr = BC->getOperand(0);
		arsenmUnsubmitted Done Reply Inline Actions Use dyn_cast instead of separate isa and cast arsenm: Use dyn_cast instead of separate isa and cast
		}
		Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);
	Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);	Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);

	Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);	Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);
	Value *VecValue = Builder.CreateLoad(BitCast);	Value *VecValue = Builder.CreateLoad(BitCast);
	Value *ExtractElement = Builder.CreateExtractElement(VecValue, Index);	Value *ExtractElement = Builder.CreateExtractElement(VecValue, Index);
	Inst->replaceAllUsesWith(ExtractElement);	Inst->replaceAllUsesWith(ExtractElement);
	Inst->eraseFromParent();	Inst->eraseFromParent();
	break;	break;
	}	}
	case Instruction::Store: {	case Instruction::Store: {
	Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);

	StoreInst *SI = cast<StoreInst>(Inst);	StoreInst *SI = cast<StoreInst>(Inst);
	Value *Ptr = SI->getPointerOperand();	Value *Ptr = SI->getPointerOperand();
		VectorType *VectorTy = VectorT;
		if (BitCastInst *BC = dyn_cast<BitCastInst>(Ptr)) {
		arsenmUnsubmitted Done Reply Inline Actions Same as the load case arsenm: Same as the load case
		VectorTy = VectorType::get(Ptr->getType()->getPointerElementType(),
		AllocaTy->getNumElements());
		Ptr = BC->getOperand(0);
		}
		Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);
	Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);	Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);
	Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);	Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);
	Value *VecValue = Builder.CreateLoad(BitCast);	Value *VecValue = Builder.CreateLoad(BitCast);
	Value *NewVecValue = Builder.CreateInsertElement(VecValue,	Value *NewVecValue = Builder.CreateInsertElement(VecValue,
	SI->getValueOperand(),	SI->getValueOperand(),
	Index);	Index);
	Builder.CreateStore(NewVecValue, BitCast);	Builder.CreateStore(NewVecValue, BitCast);
	Inst->eraseFromParent();	Inst->eraseFromParent();
	break;	break;
	}	}
Context not available.

test/CodeGen/AMDGPU/vector-alloca.ll

Context not available.
	store i32 0, i32 addrspace(5)* %z	store i32 0, i32 addrspace(5)* %z
	store i32 0, i32 addrspace(5)* %w	store i32 0, i32 addrspace(5)* %w
	%tmp1 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 %w_index	%tmp1 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 %w_index
	store i32 1, i32 addrspace(5)* %tmp1	store i32 1, i32 addrspace(5)* %tmp1
	%tmp2 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 %r_index	%tmp2 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 %r_index
	%tmp3 = load i32, i32 addrspace(5)* %tmp2	%tmp3 = load i32, i32 addrspace(5)* %tmp2
	store i32 %tmp3, i32 addrspace(1)* %out	store i32 %tmp3, i32 addrspace(1)* %out
	ret void	ret void
	}	}

	; This test should be optimize to:
	; store i32 0, i32 addrspace(1)* %out

	; OPT-LABEL: @bitcast_gep(
	; OPT-LABEL: store i32 0, i32 addrspace(1)* %out, align 4

	; FUNC-LABEL: {{^}}bitcast_gep:
	; EG: STORE_RAW
	define amdgpu_kernel void @bitcast_gep(i32 addrspace(1)* %out, i32 %w_index, i32 %r_index) {
	entry:
	%tmp = alloca [4 x i32], addrspace(5)
	%x = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 0
	%y = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 1
	%z = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 2
	%w = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 3
	store i32 0, i32 addrspace(5)* %x
	store i32 0, i32 addrspace(5)* %y
	store i32 0, i32 addrspace(5)* %z
	store i32 0, i32 addrspace(5)* %w
	%tmp1 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 1
	%tmp2 = bitcast i32 addrspace(5)* %tmp1 to [4 x i32] addrspace(5)*
	%tmp3 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp2, i32 0, i32 0
	%tmp4 = load i32, i32 addrspace(5)* %tmp3
	store i32 %tmp4, i32 addrspace(1)* %out
	ret void
	}

	; OPT-LABEL: @vector_read_bitcast_gep(	; OPT-LABEL: @vector_read_bitcast_gep(
	; OPT: %0 = extractelement <4 x i32> <i32 1065353216, i32 1, i32 2, i32 3>, i32 %index	; OPT: %0 = extractelement <4 x i32> <i32 1065353216, i32 1, i32 2, i32 3>, i32 %index
	; OPT: store i32 %0, i32 addrspace(1)* %out, align 4	; OPT: store i32 %0, i32 addrspace(1)* %out, align 4
	define amdgpu_kernel void @vector_read_bitcast_gep(i32 addrspace(1)* %out, i32 %index) {	define amdgpu_kernel void @vector_read_bitcast_gep(i32 addrspace(1)* %out, i32 %index) {
	entry:	entry:
	%tmp = alloca [4 x i32], addrspace(5)	%tmp = alloca [4 x i32], addrspace(5)
	%x = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 0	%x = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 0
	%y = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 1	%y = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 1
	%z = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 2	%z = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 2
	%w = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 3	%w = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 3
Context not available.
	%w = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 3	%w = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 3
	store i32 0, i32 addrspace(5)* %x	store i32 0, i32 addrspace(5)* %x
	store i32 1, i32 addrspace(5)* %y	store i32 1, i32 addrspace(5)* %y
	store i32 2, i32 addrspace(5)* %z	store i32 2, i32 addrspace(5)* %z
	store i32 3, i32 addrspace(5)* %w	store i32 3, i32 addrspace(5)* %w
	%tmp1 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 %index	%tmp1 = getelementptr [4 x i32], [4 x i32] addrspace(5)* %tmp, i32 0, i32 %index
	%tmp2 = load i32, i32 addrspace(5)* %tmp1	%tmp2 = load i32, i32 addrspace(5)* %tmp1
	store i32 %tmp2, i32 addrspace(1)* %out	store i32 %tmp2, i32 addrspace(1)* %out
	ret void	ret void
	}	}

		; OPT-LABEL: @write_bitcast_gep_read(
		; OPT: %0 = insertelement <3 x i32> zeroinitializer, i32 12, i32 %w_index
		; OPT: %1 = bitcast <3 x i32> %0 to <3 x float>
		; OPT: %2 = extractelement <3 x float> %1, i32 %r_index
		; OPT: store float %2, float addrspace(1)* %out, align 4
		define amdgpu_kernel void @write_bitcast_gep_read(float addrspace(1)* %out, i32 %w_index, i32 %r_index) {
		entry:
		%scratch = alloca [3 x i32], addrspace(5)
		%x = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 0
		%y = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 1
		%z = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 2
		store i32 0, i32 addrspace(5)* %x
		store i32 0, i32 addrspace(5)* %y
		store i32 0, i32 addrspace(5)* %z

		%gep_write = getelementptr inbounds [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 %w_index
		store i32 12, i32 addrspace(5)* %gep_write, align 4
		arsenmUnsubmitted Done Reply Inline Actions needs some more tests with multiple uses of the bitcast source, with accesses through different types as well as non-bitcast users arsenm: needs some more tests with multiple uses of the bitcast source, with accesses through different…

		%gep_read = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 %r_index
		%bc_read = bitcast i32 addrspace(5)* %gep_read to float addrspace(5)*
		%result = load float, float addrspace(5)* %bc_read
		store float %result, float addrspace(1)* %out

		ret void
		}

		; OPT-LABEL: @bitcast_gep_write_read(
		; OPT: %0 = insertelement <3 x float> zeroinitializer, float 1.200000e+01, i32 %w_index
		; OPT: %1 = bitcast <3 x float> %0 to <3 x i32>
		; OPT: %2 = extractelement <3 x i32> %1, i32 %r_index
		; OPT: store i32 %2, i32 addrspace(1)* %out, align 4
		define amdgpu_kernel void @bitcast_gep_write_read(i32 addrspace(1)* %out, i32 %w_index, i32 %r_index) {
		entry:
		%scratch = alloca [3 x i32], addrspace(5)
		%x = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 0
		%y = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 1
		%z = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 2
		store i32 0, i32 addrspace(5)* %x
		store i32 0, i32 addrspace(5)* %y
		store i32 0, i32 addrspace(5)* %z

		%gep_write = getelementptr inbounds [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 %w_index
		%bc_write = bitcast i32 addrspace(5)* %gep_write to float addrspace(5)*
		store float 12.0, float addrspace(5)* %bc_write, align 4

		%gep_read = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 %r_index
		%result = load i32, i32 addrspace(5)* %gep_read
		store i32 %result, i32 addrspace(1)* %out

		ret void
		}

		; OPT-LABEL: @bitcast_gep_write_bitcast_gep_read(
		; OPT: %0 = insertelement <3 x float> zeroinitializer, float 1.200000e+01, i32 %w_index
		; OPT: %1 = extractelement <3 x float> %0, i32 %r_index
		; OPT: store float %1, float addrspace(1)* %out, align 4
		define amdgpu_kernel void @bitcast_gep_write_bitcast_gep_read(float addrspace(1)* %out, i32 %w_index, i32 %r_index) {
		entry:
		%scratch = alloca [3 x i32], addrspace(5)
		%x = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 0
		%y = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 1
		%z = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 2
		store i32 0, i32 addrspace(5)* %x
		store i32 0, i32 addrspace(5)* %y
		store i32 0, i32 addrspace(5)* %z

		%gep_write = getelementptr inbounds [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 %w_index
		%bc_write = bitcast i32 addrspace(5)* %gep_write to float addrspace(5)*
		store float 12.0, float addrspace(5)* %bc_write, align 4

		%gep_read = getelementptr [3 x i32], [3 x i32] addrspace(5)* %scratch, i32 0, i32 %r_index
		%bc_read = bitcast i32 addrspace(5)* %gep_read to float addrspace(5)*
		%result = load float, float addrspace(5)* %bc_read
		store float %result, float addrspace(1)* %out

		ret void
		}
Context not available.