This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Support non-entry block static sized allocas
ClosedPublic

Authored by arsenm on May 27 2020, 8:41 AM.

Download Raw Diff

Details

Reviewers

rampitec
scott.linder
cdevadas
saiislam
jdoerfert

Summary

OpenMP emits these for some reason, so handle them. Assume these use
4096 bytes by default, with a flag to override this. Also change the
related stack assumption for calls to have a flag.

Diff Detail

Event Timeline

arsenm created this revision.May 27 2020, 8:41 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptMay 27 2020, 8:41 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: sstefan1, kerbowa, hiraditya and 8 others. · View Herald Transcript

arsenm added a parent revision: D80638: AMDGPU: Set StackPointerRegisterToSaveRestore.May 27 2020, 8:42 AM

Do you happen to have the input for which OpenMP emitted them?

Can you add a tests showing a kernel along with resulting ScratchSize please?

Fix stack growth direction and scale amount by wavefront size

In D80639#2057705, @rampitec wrote:

Can you add a tests showing a kernel along with resulting ScratchSize please?

I wasn't sure what to report for the size, so this just misses it entirely. The old code object has is_dynamic_callstack = 1, which I'm not sure actually did anything. I guess we could just pick a big number here like is already done for the external call case? I guess I could pick a smaller, large number?

In D80639#2057740, @arsenm wrote:

In D80639#2057705, @rampitec wrote:

Can you add a tests showing a kernel along with resulting ScratchSize please?

I wasn't sure what to report for the size, so this just misses it entirely. The old code object has is_dynamic_callstack = 1, which I'm not sure actually did anything. I guess we could just pick a big number here like is already done for the external call case? I guess I could pick a smaller, large number?

Probably yes. We need to allocate it somehow. A large number does not seem unreasonable unless until we have something better.

Assume 4096 bytes for dynamic sized objects

Added execution test here: https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/296

LGTM

This revision is now accepted and ready to land.May 27 2020, 12:50 PM

5e007fe9980cc44e9c4a14c9baf3bdfb012d2c18

Johannes: here is a reduce source test case, let me know what else you might need?

#include <stdio.h>

int main (void)
{

int ng =12;
int nxyz = 5000;
#pragma omp target teams distribute 
for (int gid = 0; gid < nxyz; gid++) {
  #pragma omp parallel for
  for (unsigned int g = 0; g < ng; g++) {
      int a = 0;
  }  
}
return 0;

}

In D80639#2070155, @ronlieb wrote:
Johannes: here is a reduce source test case, let me know what else you might need?

#include <stdio.h>

int main (void)
{
int ng =12;
int nxyz = 5000;
#pragma omp target teams distribute 
for (int gid = 0; gid < nxyz; gid++) {
  #pragma omp parallel for
  for (unsigned int g = 0; g < ng; g++) {
      int a = 0;
  }  
}
return 0;
}

@ronlieb Thx. This is a "bug" in OpenMPOpt we need to address there. I'll put it on my (mental) to do list.

Is this still happening, I might have fixed the IRBuilder.

In D80639#2200184, @jdoerfert wrote:

Is this still happening, I might have fixed the IRBuilder.

I think so, plus I also just actually fixed the actual codegen in ec8c172d01eb14eba890f36205da0613dda7f742

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUAsmPrinter.cpp

27 lines

SIISelLowering.h

3 lines

SIISelLowering.cpp

63 lines

test/

CodeGen/

AMDGPU/

non-entry-alloca.ll

271 lines

Diff 266620

llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
#include "llvm/Support/TargetParser.h"		#include "llvm/Support/TargetParser.h"
#include "llvm/Support/TargetRegistry.h"		#include "llvm/Support/TargetRegistry.h"
#include "llvm/Target/TargetLoweringObjectFile.h"		#include "llvm/Target/TargetLoweringObjectFile.h"

using namespace llvm;		using namespace llvm;
using namespace llvm::AMDGPU;		using namespace llvm::AMDGPU;
using namespace llvm::AMDGPU::HSAMD;		using namespace llvm::AMDGPU::HSAMD;

		// We need to tell the runtime some amount ahead of time if we don't know the
		// true stack size. Assume a smaller number if this is only due to dynamic /
		// non-entry block allocas.
		static cl::opt<uint32_t> AssumedStackSizeForExternalCall(
		"amdgpu-assume-external-call-stack-size",
		cl::desc("Assumed stack use of any external call (in bytes)"),
		cl::Hidden,
		cl::init(16384));

		static cl::opt<uint32_t> AssumedStackSizeForDynamicSizeObjects(
		"amdgpu-assume-dynamic-stack-object-size",
		cl::desc("Assumed extra stack use if there are any "
		"variable sized objects (in bytes)"),
		cl::Hidden,
		cl::init(4096));

// This should get the default rounding mode from the kernel. We just set the		// This should get the default rounding mode from the kernel. We just set the
// default here, but this could change if the OpenCL rounding mode pragmas are		// default here, but this could change if the OpenCL rounding mode pragmas are
// used.		// used.
//		//
// The denormal mode here should match what is reported by the OpenCL runtime		// The denormal mode here should match what is reported by the OpenCL runtime
// for the CL_FP_DENORM bit from CL_DEVICE_{HALF\|SINGLE\|DOUBLE}_FP_CONFIG, but		// for the CL_FP_DENORM bit from CL_DEVICE_{HALF\|SINGLE\|DOUBLE}_FP_CONFIG, but
// can also be override to flush with the -cl-denorms-are-zero compiler flag.		// can also be override to flush with the -cl-denorms-are-zero compiler flag.
//		//
▲ Show 20 Lines • Show All 572 Lines • ▼ Show 20 Lines	AMDGPUAsmPrinter::SIFunctionResourceInfo AMDGPUAsmPrinter::analyzeResourceUsage(
// really needed.		// really needed.
if (Info.UsesFlatScratch && !MFI->hasFlatScratchInit() &&		if (Info.UsesFlatScratch && !MFI->hasFlatScratchInit() &&
(!hasAnyNonFlatUseOfReg(MRI, *TII, AMDGPU::FLAT_SCR) &&		(!hasAnyNonFlatUseOfReg(MRI, *TII, AMDGPU::FLAT_SCR) &&
!hasAnyNonFlatUseOfReg(MRI, *TII, AMDGPU::FLAT_SCR_LO) &&		!hasAnyNonFlatUseOfReg(MRI, *TII, AMDGPU::FLAT_SCR_LO) &&
!hasAnyNonFlatUseOfReg(MRI, *TII, AMDGPU::FLAT_SCR_HI))) {		!hasAnyNonFlatUseOfReg(MRI, *TII, AMDGPU::FLAT_SCR_HI))) {
Info.UsesFlatScratch = false;		Info.UsesFlatScratch = false;
}		}

Info.HasDynamicallySizedStack = FrameInfo.hasVarSizedObjects();
Info.PrivateSegmentSize = FrameInfo.getStackSize();		Info.PrivateSegmentSize = FrameInfo.getStackSize();

		// Assume a big number if there are any unknown sized objects.
		Info.HasDynamicallySizedStack = FrameInfo.hasVarSizedObjects();
		if (Info.HasDynamicallySizedStack)
		Info.PrivateSegmentSize += AssumedStackSizeForDynamicSizeObjects;

if (MFI->isStackRealigned())		if (MFI->isStackRealigned())
Info.PrivateSegmentSize += FrameInfo.getMaxAlign().value();		Info.PrivateSegmentSize += FrameInfo.getMaxAlign().value();

Info.UsesVCC = MRI.isPhysRegUsed(AMDGPU::VCC_LO) \|\|		Info.UsesVCC = MRI.isPhysRegUsed(AMDGPU::VCC_LO) \|\|
MRI.isPhysRegUsed(AMDGPU::VCC_HI);		MRI.isPhysRegUsed(AMDGPU::VCC_HI);

// If there are no calls, MachineRegisterInfo can tell us the used register		// If there are no calls, MachineRegisterInfo can tell us the used register
// count easily.		// count easily.
▲ Show 20 Lines • Show All 252 Lines • ▼ Show 20 Lines	for (const MachineInstr &MI : MBB) {

// 48 SGPRs - vcc, - flat_scr, -xnack		// 48 SGPRs - vcc, - flat_scr, -xnack
int MaxSGPRGuess =		int MaxSGPRGuess =
47 - IsaInfo::getNumExtraSGPRs(&ST, true, ST.hasFlatAddressSpace());		47 - IsaInfo::getNumExtraSGPRs(&ST, true, ST.hasFlatAddressSpace());
MaxSGPR = std::max(MaxSGPR, MaxSGPRGuess);		MaxSGPR = std::max(MaxSGPR, MaxSGPRGuess);
MaxVGPR = std::max(MaxVGPR, 23);		MaxVGPR = std::max(MaxVGPR, 23);
MaxAGPR = std::max(MaxAGPR, 23);		MaxAGPR = std::max(MaxAGPR, 23);

CalleeFrameSize = std::max(CalleeFrameSize, UINT64_C(16384));		CalleeFrameSize = std::max(CalleeFrameSize,
		static_cast<uint64_t>(AssumedStackSizeForExternalCall));

Info.UsesVCC = true;		Info.UsesVCC = true;
Info.UsesFlatScratch = ST.hasFlatAddressSpace();		Info.UsesFlatScratch = ST.hasFlatAddressSpace();
Info.HasDynamicallySizedStack = true;		Info.HasDynamicallySizedStack = true;
} else {		} else {
// We force CodeGen to run in SCC order, so the callee's register		// We force CodeGen to run in SCC order, so the callee's register
// usage etc. should be the cumulative usage of all callees.		// usage etc. should be the cumulative usage of all callees.

auto I = CallGraphResourceInfo.find(Callee);		auto I = CallGraphResourceInfo.find(Callee);
▲ Show 20 Lines • Show All 438 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 331 Lines • ▼ Show 20 Lines	bool isEligibleForTailCallOptimization(
SDValue Callee, CallingConv::ID CalleeCC, bool isVarArg,		SDValue Callee, CallingConv::ID CalleeCC, bool isVarArg,
const SmallVectorImpl<ISD::OutputArg> &Outs,		const SmallVectorImpl<ISD::OutputArg> &Outs,
const SmallVectorImpl<SDValue> &OutVals,		const SmallVectorImpl<SDValue> &OutVals,
const SmallVectorImpl<ISD::InputArg> &Ins, SelectionDAG &DAG) const;		const SmallVectorImpl<ISD::InputArg> &Ins, SelectionDAG &DAG) const;

SDValue LowerCall(CallLoweringInfo &CLI,		SDValue LowerCall(CallLoweringInfo &CLI,
SmallVectorImpl<SDValue> &InVals) const override;		SmallVectorImpl<SDValue> &InVals) const override;

		SDValue lowerDYNAMIC_STACKALLOCImpl(SDValue Op, SelectionDAG &DAG) const;
		SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;

Register getRegisterByName(const char* RegName, LLT VT,		Register getRegisterByName(const char* RegName, LLT VT,
const MachineFunction &MF) const override;		const MachineFunction &MF) const override;

MachineBasicBlock *splitKillBlock(MachineInstr &MI,		MachineBasicBlock *splitKillBlock(MachineInstr &MI,
MachineBasicBlock *BB) const;		MachineBasicBlock *BB) const;

void bundleInstWithWaitcnt(MachineInstr &MI) const;		void bundleInstWithWaitcnt(MachineInstr &MI) const;
MachineBasicBlock *emitGWSMemViolTestLoop(MachineInstr &MI,		MachineBasicBlock *emitGWSMemViolTestLoop(MachineInstr &MI,
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,083 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,

// Handle result values, copying them out of physregs into vregs that we		// Handle result values, copying them out of physregs into vregs that we
// return.		// return.
return LowerCallResult(Chain, InFlag, CallConv, IsVarArg, Ins, DL, DAG,		return LowerCallResult(Chain, InFlag, CallConv, IsVarArg, Ins, DL, DAG,
InVals, IsThisReturn,		InVals, IsThisReturn,
IsThisReturn ? OutVals[0] : SDValue());		IsThisReturn ? OutVals[0] : SDValue());
}		}

		// This is identical to the default implementation in ExpandDYNAMIC_STACKALLOC,
		// except for applying the wave size scale to the increment amount.
		SDValue SITargetLowering::lowerDYNAMIC_STACKALLOCImpl(
		SDValue Op, SelectionDAG &DAG) const {
		const MachineFunction &MF = DAG.getMachineFunction();
		const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();

		SDLoc dl(Op);
		EVT VT = Op.getValueType();
		SDValue Tmp1 = Op;
		SDValue Tmp2 = Op.getValue(1);
		SDValue Tmp3 = Op.getOperand(2);
		SDValue Chain = Tmp1.getOperand(0);

		Register SPReg = Info->getStackPtrOffsetReg();

		// Chain the dynamic stack allocation so that it doesn't modify the stack
		// pointer when other instructions are using the stack.
		Chain = DAG.getCALLSEQ_START(Chain, 0, 0, dl);

		SDValue Size = Tmp2.getOperand(1);
		SDValue SP = DAG.getCopyFromReg(Chain, dl, SPReg, VT);
		Chain = SP.getValue(1);
		unsigned Align = cast<ConstantSDNode>(Tmp3)->getZExtValue();
		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
		const TargetFrameLowering *TFL = ST.getFrameLowering();
		unsigned Opc =
		TFL->getStackGrowthDirection() == TargetFrameLowering::StackGrowsUp ?
		ISD::ADD : ISD::SUB;

		SDValue ScaledSize = DAG.getNode(
		ISD::SHL, dl, VT, Size,
		DAG.getConstant(ST.getWavefrontSizeLog2(), dl, MVT::i32));

		unsigned StackAlign = TFL->getStackAlignment();
		Tmp1 = DAG.getNode(Opc, dl, VT, SP, ScaledSize); // Value
		if (Align > StackAlign)
		Tmp1 = DAG.getNode(ISD::AND, dl, VT, Tmp1,
		DAG.getConstant(-(uint64_t)Align, dl, VT));
		Chain = DAG.getCopyToReg(Chain, dl, SPReg, Tmp1); // Output chain
		Tmp2 = DAG.getCALLSEQ_END(
		Chain, DAG.getIntPtrConstant(0, dl, true),
		DAG.getIntPtrConstant(0, dl, true), SDValue(), dl);

		return DAG.getMergeValues({Tmp1, Tmp2}, dl);
		}

		SDValue SITargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op,
		SelectionDAG &DAG) const {
		// We only handle constant sizes here to allow non-entry block, static sized
		// allocas. A truly dynamic value is more difficult to support because we
		// don't know if the size value is uniform or not. If the size isn't uniform,
		// we would need to do a wave reduction to get the maximum size to know how
		// much to increment the uniform stack pointer.
		SDValue Size = Op.getOperand(1);
		if (isa<ConstantSDNode>(Size))
		return lowerDYNAMIC_STACKALLOCImpl(Op, DAG); // Use "generic" expansion.

		return AMDGPUTargetLowering::LowerDYNAMIC_STACKALLOC(Op, DAG);
		}

Register SITargetLowering::getRegisterByName(const char* RegName, LLT VT,		Register SITargetLowering::getRegisterByName(const char* RegName, LLT VT,
const MachineFunction &MF) const {		const MachineFunction &MF) const {
Register Reg = StringSwitch<Register>(RegName)		Register Reg = StringSwitch<Register>(RegName)
.Case("m0", AMDGPU::M0)		.Case("m0", AMDGPU::M0)
.Case("exec", AMDGPU::EXEC)		.Case("exec", AMDGPU::EXEC)
.Case("exec_lo", AMDGPU::EXEC_LO)		.Case("exec_lo", AMDGPU::EXEC_LO)
.Case("exec_hi", AMDGPU::EXEC_HI)		.Case("exec_hi", AMDGPU::EXEC_HI)
.Case("flat_scratch", AMDGPU::FLAT_SCR)		.Case("flat_scratch", AMDGPU::FLAT_SCR)
▲ Show 20 Lines • Show All 1,200 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
case ISD::SMAX:		case ISD::SMAX:
case ISD::UMIN:		case ISD::UMIN:
case ISD::UMAX:		case ISD::UMAX:
case ISD::FADD:		case ISD::FADD:
case ISD::FMUL:		case ISD::FMUL:
case ISD::FMINNUM_IEEE:		case ISD::FMINNUM_IEEE:
case ISD::FMAXNUM_IEEE:		case ISD::FMAXNUM_IEEE:
return splitBinaryVectorOp(Op, DAG);		return splitBinaryVectorOp(Op, DAG);
		case ISD::DYNAMIC_STACKALLOC:
		return LowerDYNAMIC_STACKALLOC(Op, DAG);
}		}
return SDValue();		return SDValue();
}		}

static SDValue adjustLoadValueTypeImpl(SDValue Result, EVT LoadVT,		static SDValue adjustLoadValueTypeImpl(SDValue Result, EVT LoadVT,
const SDLoc &DL,		const SDLoc &DL,
SelectionDAG &DAG, bool Unpacked) {		SelectionDAG &DAG, bool Unpacked) {
if (!LoadVT.isVector())		if (!LoadVT.isVector())
▲ Show 20 Lines • Show All 7,001 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/non-entry-alloca.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,DEFAULTSIZE %s
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -amdgpu-assume-dynamic-stack-object-size=1024 < %s \| FileCheck -check-prefixes=GCN,ASSUME1024 %s

				; FIXME: Generated test checks do not check metadata at the end of the
				; function, so this also includes manually added checks.

				; Test that we can select a statically sized alloca outside of the
				; entry block.

				; FIXME: FunctionLoweringInfo unhelpfully doesn't preserve an
				; alignment less than the stack alignment.
				define amdgpu_kernel void @kernel_non_entry_block_static_alloca_uniformly_reached_align4(i32 addrspace(1)* %out, i32 %arg.cond0, i32 %arg.cond1, i32 %in) {
				; GCN-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align4:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_add_u32 flat_scratch_lo, s6, s9
				; GCN-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
				; GCN-NEXT: s_add_u32 s0, s0, s9
				; GCN-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x8
				; GCN-NEXT: s_addc_u32 s1, s1, 0
				; GCN-NEXT: s_mov_b32 s33, 0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_cmp_lg_u32 s8, 0
				; GCN-NEXT: s_cbranch_scc1 BB0_3
				; GCN-NEXT: ; %bb.1: ; %bb.0
				; GCN-NEXT: s_cmp_lg_u32 s9, 0
				; GCN-NEXT: s_cbranch_scc1 BB0_3
				; GCN-NEXT: ; %bb.2: ; %bb.1
				; GCN-NEXT: s_add_i32 s6, s32, 0x1000
				; GCN-NEXT: s_lshl_b32 s7, s10, 2
				; GCN-NEXT: s_mov_b32 s32, s6
				; GCN-NEXT: v_mov_b32_e32 v2, s6
				; GCN-NEXT: v_mov_b32_e32 v1, 0
				; GCN-NEXT: s_add_i32 s6, s6, s7
				; GCN-NEXT: v_mov_b32_e32 v3, 1
				; GCN-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v3, v2, s[0:3], 0 offen offset:4
				; GCN-NEXT: v_mov_b32_e32 v1, s6
				; GCN-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_add_u32_e32 v2, v1, v0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: v_mov_b32_e32 v0, s4
				; GCN-NEXT: v_mov_b32_e32 v1, s5
				; GCN-NEXT: global_store_dword v[0:1], v2, off
				; GCN-NEXT: BB0_3: ; %bb.2
				; GCN-NEXT: v_mov_b32_e32 v0, 0
				; GCN-NEXT: global_store_dword v[0:1], v0, off
				; GCN-NEXT: s_endpgm

				entry:
				%cond0 = icmp eq i32 %arg.cond0, 0
				br i1 %cond0, label %bb.0, label %bb.2

				bb.0:
				%alloca = alloca [16 x i32], align 4, addrspace(5)
				%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
				%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1
				%cond1 = icmp eq i32 %arg.cond1, 0
				br i1 %cond1, label %bb.1, label %bb.2

				bb.1:
				; Use the alloca outside of the defining block.
				store i32 0, i32 addrspace(5)* %gep0
				store i32 1, i32 addrspace(5)* %gep1
				%gep2 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 %in
				%load = load i32, i32 addrspace(5)* %gep2
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%add = add i32 %load, %tid
				store i32 %add, i32 addrspace(1)* %out
				br label %bb.2

				bb.2:
				store volatile i32 0, i32 addrspace(1)* undef
				ret void
				}
				; DEFAULTSIZE: .amdhsa_private_segment_fixed_size 4112
				; DEFAULTSIZE: ; ScratchSize: 4112

				; ASSUME1024: .amdhsa_private_segment_fixed_size 1040
				; ASSUME1024: ; ScratchSize: 1040

				define amdgpu_kernel void @kernel_non_entry_block_static_alloca_uniformly_reached_align64(i32 addrspace(1)* %out, i32 %arg.cond, i32 %in) {
				; GCN-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align64:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_add_u32 flat_scratch_lo, s6, s9
				; GCN-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
				; GCN-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x8
				; GCN-NEXT: s_add_u32 s0, s0, s9
				; GCN-NEXT: s_addc_u32 s1, s1, 0
				; GCN-NEXT: s_mov_b32 s33, 0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: s_cmp_lg_u32 s6, 0
				; GCN-NEXT: s_cbranch_scc1 BB1_2
				; GCN-NEXT: ; %bb.1: ; %bb.0
				; GCN-NEXT: s_add_i32 s6, s32, 0x1000
				; GCN-NEXT: s_andn2_b32 s6, s6, 63
				; GCN-NEXT: s_lshl_b32 s7, s7, 2
				; GCN-NEXT: s_mov_b32 s32, s6
				; GCN-NEXT: v_mov_b32_e32 v2, s6
				; GCN-NEXT: v_mov_b32_e32 v1, 0
				; GCN-NEXT: s_add_i32 s6, s6, s7
				; GCN-NEXT: v_mov_b32_e32 v3, 1
				; GCN-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v3, v2, s[0:3], 0 offen offset:4
				; GCN-NEXT: v_mov_b32_e32 v1, s6
				; GCN-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen
				; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_add_u32_e32 v2, v1, v0
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: v_mov_b32_e32 v0, s4
				; GCN-NEXT: v_mov_b32_e32 v1, s5
				; GCN-NEXT: global_store_dword v[0:1], v2, off
				; GCN-NEXT: BB1_2: ; %bb.1
				; GCN-NEXT: v_mov_b32_e32 v0, 0
				; GCN-NEXT: global_store_dword v[0:1], v0, off
				; GCN-NEXT: s_endpgm
				entry:
				%cond = icmp eq i32 %arg.cond, 0
				br i1 %cond, label %bb.0, label %bb.1

				bb.0:
				%alloca = alloca [16 x i32], align 64, addrspace(5)
				%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
				%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1
				store i32 0, i32 addrspace(5)* %gep0
				store i32 1, i32 addrspace(5)* %gep1
				%gep2 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 %in
				%load = load i32, i32 addrspace(5)* %gep2
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%add = add i32 %load, %tid
				store i32 %add, i32 addrspace(1)* %out
				br label %bb.1

				bb.1:
				store volatile i32 0, i32 addrspace(1)* undef
				ret void
				}

				; DEFAULTSIZE: .amdhsa_private_segment_fixed_size 4160
				; DEFAULTSIZE: ; ScratchSize: 4160

				; ASSUME1024: .amdhsa_private_segment_fixed_size 1088
				; ASSUME1024: ; ScratchSize: 1088


				define void @func_non_entry_block_static_alloca_align4(i32 addrspace(1)* %out, i32 %arg.cond0, i32 %arg.cond1, i32 %in) {
				; GCN-LABEL: func_non_entry_block_static_alloca_align4:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_mov_b32 s7, s33
				; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
				; GCN-NEXT: s_mov_b32 s33, s32
				; GCN-NEXT: s_add_u32 s32, s32, 0x400
				; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc
				; GCN-NEXT: s_cbranch_execz BB2_3
				; GCN-NEXT: ; %bb.1: ; %bb.0
				; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v3
				; GCN-NEXT: s_and_b64 exec, exec, vcc
				; GCN-NEXT: s_cbranch_execz BB2_3
				; GCN-NEXT: ; %bb.2: ; %bb.1
				; GCN-NEXT: s_add_i32 s6, s32, 0x1000
				; GCN-NEXT: v_mov_b32_e32 v2, 0
				; GCN-NEXT: v_mov_b32_e32 v3, s6
				; GCN-NEXT: v_mov_b32_e32 v6, 1
				; GCN-NEXT: buffer_store_dword v2, v3, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v6, v3, s[0:3], 0 offen offset:4
				; GCN-NEXT: v_lshl_add_u32 v2, v4, 2, s6
				; GCN-NEXT: buffer_load_dword v2, v2, s[0:3], 0 offen
				; GCN-NEXT: v_and_b32_e32 v3, 0x3ff, v5
				; GCN-NEXT: s_mov_b32 s32, s6
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_add_u32_e32 v2, v2, v3
				; GCN-NEXT: global_store_dword v[0:1], v2, off
				; GCN-NEXT: BB2_3: ; %bb.2
				; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
				; GCN-NEXT: v_mov_b32_e32 v0, 0
				; GCN-NEXT: global_store_dword v[0:1], v0, off
				; GCN-NEXT: s_sub_u32 s32, s32, 0x400
				; GCN-NEXT: s_mov_b32 s33, s7
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]

				entry:
				%cond0 = icmp eq i32 %arg.cond0, 0
				br i1 %cond0, label %bb.0, label %bb.2

				bb.0:
				%alloca = alloca [16 x i32], align 4, addrspace(5)
				%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
				%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1
				%cond1 = icmp eq i32 %arg.cond1, 0
				br i1 %cond1, label %bb.1, label %bb.2

				bb.1:
				; Use the alloca outside of the defining block.
				store i32 0, i32 addrspace(5)* %gep0
				store i32 1, i32 addrspace(5)* %gep1
				%gep2 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 %in
				%load = load i32, i32 addrspace(5)* %gep2
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%add = add i32 %load, %tid
				store i32 %add, i32 addrspace(1)* %out
				br label %bb.2

				bb.2:
				store volatile i32 0, i32 addrspace(1)* undef
				ret void
				}

				define void @func_non_entry_block_static_alloca_align64(i32 addrspace(1)* %out, i32 %arg.cond, i32 %in) {
				; GCN-LABEL: func_non_entry_block_static_alloca_align64:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: s_add_u32 s4, s32, 0xfc0
				; GCN-NEXT: s_mov_b32 s7, s33
				; GCN-NEXT: s_and_b32 s33, s4, 0xfffff000
				; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
				; GCN-NEXT: s_add_u32 s32, s32, 0x2000
				; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc
				; GCN-NEXT: s_cbranch_execz BB3_2
				; GCN-NEXT: ; %bb.1: ; %bb.0
				; GCN-NEXT: s_add_i32 s6, s32, 0x1000
				; GCN-NEXT: s_andn2_b32 s6, s6, 63
				; GCN-NEXT: v_mov_b32_e32 v2, 0
				; GCN-NEXT: v_mov_b32_e32 v5, s6
				; GCN-NEXT: v_mov_b32_e32 v6, 1
				; GCN-NEXT: buffer_store_dword v2, v5, s[0:3], 0 offen
				; GCN-NEXT: buffer_store_dword v6, v5, s[0:3], 0 offen offset:4
				; GCN-NEXT: v_lshl_add_u32 v2, v3, 2, s6
				; GCN-NEXT: buffer_load_dword v2, v2, s[0:3], 0 offen
				; GCN-NEXT: v_and_b32_e32 v3, 0x3ff, v4
				; GCN-NEXT: s_mov_b32 s32, s6
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_add_u32_e32 v2, v2, v3
				; GCN-NEXT: global_store_dword v[0:1], v2, off
				; GCN-NEXT: BB3_2: ; %bb.1
				; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
				; GCN-NEXT: v_mov_b32_e32 v0, 0
				; GCN-NEXT: global_store_dword v[0:1], v0, off
				; GCN-NEXT: s_sub_u32 s32, s32, 0x2000
				; GCN-NEXT: s_mov_b32 s33, s7
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: s_setpc_b64 s[30:31]
				entry:
				%cond = icmp eq i32 %arg.cond, 0
				br i1 %cond, label %bb.0, label %bb.1

				bb.0:
				%alloca = alloca [16 x i32], align 64, addrspace(5)
				%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
				%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1
				store i32 0, i32 addrspace(5)* %gep0
				store i32 1, i32 addrspace(5)* %gep1
				%gep2 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 %in
				%load = load i32, i32 addrspace(5)* %gep2
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%add = add i32 %load, %tid
				store i32 %add, i32 addrspace(1)* %out
				br label %bb.1

				bb.1:
				store volatile i32 0, i32 addrspace(1)* undef
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #0

				attributes #0 = { nounwind readnone speculatable }