This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add optional bounds checking for scratch accesses
AbandonedPublic

Authored by critson on Apr 16 2019, 6:12 AM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle

Summary

Implement a pass, enabled by -amdgpu-scratch-bounds-checking,
which adds bounds checks to scratch accesses.
When this pass is enabled, out-of-bounds writes have no effect
and out-of-bounds reads return zero.
This is useful for GFX9 where hardware no longer performs
bounds checking on scratch accesses and hence page faults
result for out-of-bounds accesses by generated by shaders.

Change-Id: Id2ee4b1f32e70b6bde2541db755727b6a407721b

Diff Detail

Repository

rL LLVM

Build Status

Buildable 32617
Build 32616: arc lint + arc unit

Event Timeline

critson created this revision.Apr 16 2019, 6:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 16 2019, 6:12 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald Transcript

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined. I don't want to start trying to define in at an arbitrary point in the backend. Crashing on the invalid access is much easier problem to debug. This seems to be a partial replacement for asan? If this is intended as some user visible semantic fix, that requires more thought and probably needs to be a new address space.

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
211–229	You can use TII::getAddNoCarry
244–245	I'm working on only allowing instructions that are terminators to modify exec, but this is introducing a new exec write in the middle of a block
329–330	This isn't going to catch scratch accesses without an MMO, you need to check the opcodes
332–333	I dislike building a worklist of all instructions in the function to handle. This shouldn't be hard to handle as you go through the function
342–343	This isn't going to be strong enough to guarantee that there will never be an out of bounds access. This approach also isn't going to work in a callable function, in the presence of calls, or variable sized stack objects (which I'm planning on implementing)

If the goal is to have a semantically always dereferencable stack pointer, I think we need to create a new addrspace. It would then be a no-op addrspacecast from an alloca, which the frontend desiring safe stack access would be responsible for inserting. We would then need to track the current global stack size in the ABI somewhere, and selection would need to insert this kind of bounds check code based on that

critson added inline comments.May 7 2019, 7:55 AM

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
342–343	On that basis do you have any opinions on the appropriate method for establishing stack size going forward?

In D60772#1470073, @arsenm wrote:

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined.

Graphics APIs have "robust (buffer) access" settings in which out of bounds accesses must be guaranteed not to crash the program.

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
342–343	You could emit the stack size as a TargetGlobalAddress referring to a special symbol, and ask PAL to fixup the relocations for that symbol at load time. Details would have to be figured out, but that works as a rough approach.

Remove worklist usage.
Move from global to subtarget option.
Add second pass to resolve scratch size later in backend.
Add missing lit tests.

Harbormaster completed remote builds in B32617: Diff 201929.May 29 2019, 8:33 AM

critson marked 6 inline comments as done.May 29 2019, 8:38 AM

critson added inline comments.

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
244–245	This code inserts new basic blocks such that only terminators modify exec, unless I missed something?
329–330	I don't think we can do anything sensible with instructions without MMO. They could be accessing any address space, so it would not be appropriate to apply scratch bounds checks to them.

Ping.
I was wondering if I can get a second reading on this?
I believe I've addressed everything except the introduction of a new address space, as this seems somewhat heavyweight to me.
If the address space is considered absolutely necessary then I'll work on that next.

In D60772#1501198, @nhaehnle wrote:

In D60772#1470073, @arsenm wrote:

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined.

Graphics APIs have "robust (buffer) access" settings in which out of bounds accesses must be guaranteed not to crash the program.

And this is allowed to be set for stack objects?

lib/Target/AMDGPU/SIFixScratchSize.cpp
71	No c string functions

arsenm added inline comments.Jun 6 2019, 7:29 PM

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
899	I would expect this kind of handling to be done as part of selection, not a pretty late pass
lib/Target/AMDGPU/SIFixScratchSize.cpp
62	This whole pass doesn't work when there's dynamic stack usage?
65	Can early exit if there is no stack usage
68	Can't assume the instruction use

In D60772#1501198, @nhaehnle wrote:

In D60772#1470073, @arsenm wrote:

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined.

Graphics APIs have "robust (buffer) access" settings in which out of bounds accesses must be guaranteed not to crash the program.

There must be some restrictions on this? I don't think any kind of codegen pass like this will be strong enough to guarantee this. An IR optimization could have introduced a trap or something based on a detected invalid access. This probably won't happen today since things are overly conservative with non-0 address spaces, but this is something I want to fix

This has been superseded by front-end work in graphics compiler.

Herald added a subscriber: kerbowa. · View Herald TranscriptOct 11 2020, 10:34 PM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

9 lines

7 lines

5 lines

1 line

AMDGPUTargetMachine.cpp

2 lines

CMakeLists.txt

2 lines

SIFixScratchSize.cpp

82 lines

SIInsertScratchBounds.cpp

354 lines

test/

CodeGen/

AMDGPU/

scratch-bounds.ll

226 lines

Diff 201929

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	FunctionPass *createSIAddIMGInitPass();			FunctionPass *createSIAddIMGInitPass();
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
	FunctionPass *createSILoadStoreOptimizerPass();			FunctionPass *createSILoadStoreOptimizerPass();
	FunctionPass *createSIWholeQuadModePass();			FunctionPass *createSIWholeQuadModePass();
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
	FunctionPass *createSIOptimizeExecMaskingPreRAPass();			FunctionPass *createSIOptimizeExecMaskingPreRAPass();
	FunctionPass *createSIFixSGPRCopiesPass();			FunctionPass *createSIFixSGPRCopiesPass();
	FunctionPass *createSIMemoryLegalizerPass();			FunctionPass *createSIMemoryLegalizerPass();
				FunctionPass *createSIInsertScratchBoundsPass();
				FunctionPass *createSIFixScratchSizePass();
	FunctionPass *createSIInsertWaitcntsPass();			FunctionPass *createSIInsertWaitcntsPass();
	FunctionPass *createSIPreAllocateWWMRegsPass();			FunctionPass *createSIPreAllocateWWMRegsPass();
	FunctionPass *createSIFormMemoryClausesPass();			FunctionPass *createSIFormMemoryClausesPass();
	FunctionPass *createAMDGPUSimplifyLibCallsPass(const TargetOptions &);			FunctionPass *createAMDGPUSimplifyLibCallsPass(const TargetOptions &);
	FunctionPass *createAMDGPUUseNativeCallsPass();			FunctionPass *createAMDGPUUseNativeCallsPass();
	FunctionPass *createAMDGPUCodeGenPreparePass();			FunctionPass *createAMDGPUCodeGenPreparePass();
	FunctionPass *createAMDGPUMachineCFGStructurizerPass();			FunctionPass *createAMDGPUMachineCFGStructurizerPass();
	FunctionPass *createAMDGPURewriteOutArgumentsPass();			FunctionPass *createAMDGPURewriteOutArgumentsPass();
	▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines
	extern char &SILoadStoreOptimizerID;			extern char &SILoadStoreOptimizerID;

	void initializeSIWholeQuadModePass(PassRegistry &);			void initializeSIWholeQuadModePass(PassRegistry &);
	extern char &SIWholeQuadModeID;			extern char &SIWholeQuadModeID;

	void initializeSILowerControlFlowPass(PassRegistry &);			void initializeSILowerControlFlowPass(PassRegistry &);
	extern char &SILowerControlFlowID;			extern char &SILowerControlFlowID;

				void initializeSIInsertScratchBoundsPass(PassRegistry &);
				extern char &SIInsertScratchBoundsID;

				void initializeSIFixScratchSizePass(PassRegistry &);
				extern char &SIFixScratchSizeID;
				extern const char *const SIScratchSizeSymbol;

	void initializeSIInsertSkipsPass(PassRegistry &);			void initializeSIInsertSkipsPass(PassRegistry &);
	extern char &SIInsertSkipsPassID;			extern char &SIInsertSkipsPassID;

	void initializeSIOptimizeExecMaskingPass(PassRegistry &);			void initializeSIOptimizeExecMaskingPass(PassRegistry &);
	extern char &SIOptimizeExecMaskingID;			extern char &SIOptimizeExecMaskingID;

	void initializeSIPreAllocateWWMRegsPass(PassRegistry &);			void initializeSIPreAllocateWWMRegsPass(PassRegistry &);
	extern char &SIPreAllocateWWMRegsID;			extern char &SIPreAllocateWWMRegsID;
	▲ Show 20 Lines • Show All 146 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPU.td

	Show First 20 Lines • Show All 483 Lines • ▼ Show 20 Lines
	// for the base pointer.			// for the base pointer.
	def FeatureEnableUnsafeDSOffsetFolding : SubtargetFeature <			def FeatureEnableUnsafeDSOffsetFolding : SubtargetFeature <
	"unsafe-ds-offset-folding",			"unsafe-ds-offset-folding",
	"EnableUnsafeDSOffsetFolding",			"EnableUnsafeDSOffsetFolding",
	"true",			"true",
	"Force using DS instruction immediate offsets on SI"			"Force using DS instruction immediate offsets on SI"
	>;			>;

				def FeatureEnableScratchBoundsChecks : SubtargetFeature<
				"enable-scratch-bounds-checks",
				"EnableScratchBoundsChecks",
				"true",
				"Enable insertion of bounds checks on scratch accesses"
				>;

	def FeatureEnableSIScheduler : SubtargetFeature<"si-scheduler",			def FeatureEnableSIScheduler : SubtargetFeature<"si-scheduler",
	"EnableSIScheduler",			"EnableSIScheduler",
	"true",			"true",
	"Enable SI Machine Scheduler"			"Enable SI Machine Scheduler"
	>;			>;

	def FeatureEnableDS128 : SubtargetFeature<"enable-ds128",			def FeatureEnableDS128 : SubtargetFeature<"enable-ds128",
	"EnableDS128",			"EnableDS128",
	▲ Show 20 Lines • Show All 519 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 295 Lines • ▼ Show 20 Lines	protected:
bool EnableXNACK;		bool EnableXNACK;
bool DoesNotSupportXNACK;		bool DoesNotSupportXNACK;
bool EnableCuMode;		bool EnableCuMode;
bool TrapHandler;		bool TrapHandler;

// Used as options.		// Used as options.
bool EnableLoadStoreOpt;		bool EnableLoadStoreOpt;
bool EnableUnsafeDSOffsetFolding;		bool EnableUnsafeDSOffsetFolding;
		bool EnableScratchBoundsChecks;
bool EnableSIScheduler;		bool EnableSIScheduler;
bool EnableDS128;		bool EnableDS128;
bool EnablePRTStrictNull;		bool EnablePRTStrictNull;
bool DumpCode;		bool DumpCode;

// Subtarget statically properties set by tablegen		// Subtarget statically properties set by tablegen
bool FP64;		bool FP64;
bool FMA;		bool FMA;
▲ Show 20 Lines • Show All 529 Lines • ▼ Show 20 Lines	public:
}		}

bool hasNSAEncoding() const {		bool hasNSAEncoding() const {
return HasNSAEncoding;		return HasNSAEncoding;
}		}

bool hasMadF16() const;		bool hasMadF16() const;

		bool enableScratchBoundsChecks() const {
		return EnableScratchBoundsChecks;
		}

bool enableSIScheduler() const {		bool enableSIScheduler() const {
return EnableSIScheduler;		return EnableSIScheduler;
}		}

bool loadStoreOptEnabled() const {		bool loadStoreOptEnabled() const {
return EnableLoadStoreOpt;		return EnableLoadStoreOpt;
}		}

▲ Show 20 Lines • Show All 338 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSubtarget.cpp

Show First 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	GCNSubtarget::GCNSubtarget(const Triple &TT, StringRef GPU, StringRef FS,
HasApertureRegs(false),		HasApertureRegs(false),
EnableXNACK(false),		EnableXNACK(false),
DoesNotSupportXNACK(false),		DoesNotSupportXNACK(false),
EnableCuMode(false),		EnableCuMode(false),
TrapHandler(false),		TrapHandler(false),

EnableLoadStoreOpt(false),		EnableLoadStoreOpt(false),
EnableUnsafeDSOffsetFolding(false),		EnableUnsafeDSOffsetFolding(false),
		EnableScratchBoundsChecks(false),
EnableSIScheduler(false),		EnableSIScheduler(false),
EnableDS128(false),		EnableDS128(false),
EnablePRTStrictNull(false),		EnablePRTStrictNull(false),
DumpCode(false),		DumpCode(false),

FP64(false),		FP64(false),
GCN3Encoding(false),		GCN3Encoding(false),
CIInsts(false),		CIInsts(false),
▲ Show 20 Lines • Show All 561 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 890 Lines • ▼ Show 20 Lines	bool GCNPassConfig::addGlobalInstructionSelect() {
addPass(new InstructionSelect());		addPass(new InstructionSelect());
return false;		return false;
}		}

void GCNPassConfig::addPreRegAlloc() {		void GCNPassConfig::addPreRegAlloc() {
if (LateCFGStructurize) {		if (LateCFGStructurize) {
addPass(createAMDGPUMachineCFGStructurizerPass());		addPass(createAMDGPUMachineCFGStructurizerPass());
}		}
		addPass(createSIInsertScratchBoundsPass());
		arsenmUnsubmitted Not Done Reply Inline Actions I would expect this kind of handling to be done as part of selection, not a pretty late pass arsenm: I would expect this kind of handling to be done as part of selection, not a pretty late pass
addPass(createSIWholeQuadModePass());		addPass(createSIWholeQuadModePass());
}		}

void GCNPassConfig::addFastRegAlloc() {		void GCNPassConfig::addFastRegAlloc() {
// FIXME: We have to disable the verifier here because of PHIElimination +		// FIXME: We have to disable the verifier here because of PHIElimination +
// TwoAddressInstructions disabling it.		// TwoAddressInstructions disabling it.

// This must be run immediately after phi elimination and before		// This must be run immediately after phi elimination and before
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
void GCNPassConfig::addPreSched2() {		void GCNPassConfig::addPreSched2() {
}		}

void GCNPassConfig::addPreEmitPass() {		void GCNPassConfig::addPreEmitPass() {
addPass(createSIMemoryLegalizerPass());		addPass(createSIMemoryLegalizerPass());
addPass(createSIInsertWaitcntsPass());		addPass(createSIInsertWaitcntsPass());
addPass(createSIShrinkInstructionsPass());		addPass(createSIShrinkInstructionsPass());
addPass(createSIModeRegisterPass());		addPass(createSIModeRegisterPass());
		addPass(createSIFixScratchSizePass());

// The hazard recognizer that runs as part of the post-ra scheduler does not		// The hazard recognizer that runs as part of the post-ra scheduler does not
// guarantee to be able handle all hazards correctly. This is because if there		// guarantee to be able handle all hazards correctly. This is because if there
// are multiple scheduling regions in a basic block, the regions are scheduled		// are multiple scheduling regions in a basic block, the regions are scheduled
// bottom up, so when we begin to schedule a region we don't know what		// bottom up, so when we begin to schedule a region we don't know what
// instructions were emitted directly before it.		// instructions were emitted directly before it.
//		//
// Here we add a stand-alone hazard recognizer pass which can handle all		// Here we add a stand-alone hazard recognizer pass which can handle all
▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
R600MachineFunctionInfo.cpp		R600MachineFunctionInfo.cpp
R600MachineScheduler.cpp		R600MachineScheduler.cpp
R600OpenCLImageTypeLoweringPass.cpp		R600OpenCLImageTypeLoweringPass.cpp
R600OptimizeVectorRegisters.cpp		R600OptimizeVectorRegisters.cpp
R600Packetizer.cpp		R600Packetizer.cpp
R600RegisterInfo.cpp		R600RegisterInfo.cpp
SIAddIMGInit.cpp		SIAddIMGInit.cpp
SIAnnotateControlFlow.cpp		SIAnnotateControlFlow.cpp
		SIFixScratchSize.cpp
SIFixSGPRCopies.cpp		SIFixSGPRCopies.cpp
SIFixupVectorISel.cpp		SIFixupVectorISel.cpp
SIFixVGPRCopies.cpp		SIFixVGPRCopies.cpp
SIPreAllocateWWMRegs.cpp		SIPreAllocateWWMRegs.cpp
SIFoldOperands.cpp		SIFoldOperands.cpp
SIFormMemoryClauses.cpp		SIFormMemoryClauses.cpp
SIFrameLowering.cpp		SIFrameLowering.cpp
		SIInsertScratchBounds.cpp
SIInsertSkips.cpp		SIInsertSkips.cpp
SIInsertWaitcnts.cpp		SIInsertWaitcnts.cpp
SIInstrInfo.cpp		SIInstrInfo.cpp
SIISelLowering.cpp		SIISelLowering.cpp
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
SILowerControlFlow.cpp		SILowerControlFlow.cpp
SILowerI1Copies.cpp		SILowerI1Copies.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
Show All 20 Lines

lib/Target/AMDGPU/SIFixScratchSize.cpp

This file was added.

				//===- SIFixScratchSize.cpp - resolve scratch size symbols -===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// This pass replaces references with to the scratch size symbol with the
				/// actual scratch size. This pass should be run late, i.e. when the scratch
				/// size for a given machine function is known.
				///
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "SIInstrInfo.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"

				#include <set>

				using namespace llvm;

				#define DEBUG_TYPE "si-fix-scratch-size"

				namespace {

				class SIFixScratchSize : public MachineFunctionPass {
				public:
				static char ID;

				SIFixScratchSize() : MachineFunctionPass(ID) {}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.setPreservesAll();
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				bool runOnMachineFunction(MachineFunction &MF) override;
				};

				} // end anonymous namespace

				INITIALIZE_PASS(SIFixScratchSize, DEBUG_TYPE,
				"SI Resolve Scratch Size Symbols",
				false, false)

				char SIFixScratchSize::ID = 0;

				char &llvm::SIFixScratchSizeID = SIFixScratchSize::ID;

				const char *const llvm::SIScratchSizeSymbol = "___SCRATCH_SIZE";

				FunctionPass *llvm::createSIFixScratchSizePass() {
				return new SIFixScratchSize;
				}

				bool SIFixScratchSize::runOnMachineFunction(MachineFunction &MF) {
				const MachineFrameInfo &FrameInfo = MF.getFrameInfo();
				const uint64_t StackSize = FrameInfo.getStackSize();
				arsenmUnsubmitted Not Done Reply Inline Actions This whole pass doesn't work when there's dynamic stack usage? arsenm: This whole pass doesn't work when there's dynamic stack usage?

				bool Changed = false;

				arsenmUnsubmitted Not Done Reply Inline Actions Can early exit if there is no stack usage arsenm: Can early exit if there is no stack usage
				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				if (MI.getOpcode() == AMDGPU::S_MOV_B32) {
				arsenmUnsubmitted Not Done Reply Inline Actions Can't assume the instruction use arsenm: Can't assume the instruction use
				MachineOperand& Src = MI.getOperand(1);
				if (Src.isSymbol()) {
				if (strcmp(Src.getSymbolName(), SIScratchSizeSymbol) == 0) {
				arsenmUnsubmitted Not Done Reply Inline Actions No c string functions arsenm: No c string functions
				LLVM_DEBUG(dbgs() << "Fixing: " << MI << "\n");
				Src.ChangeToImmediate(StackSize);
				Changed = true;
				}
				}
				}
				}
				}

				return Changed;
				}

lib/Target/AMDGPU/SIInsertScratchBounds.cpp

This file was added.

				//===- SIInsertScratchBounds.cpp - insert scratch bounds checks -===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// This pass inserts bounds checks on scratch accesses.
				/// Out-of-bounds reads return zero, and out-of-bounds writes have no effect.
				/// This is intended to be used on GFX9 where bounds checking is no longer
				/// performed by hardware and hence page faults can results from out-of-bounds
				/// accesses by shaders.
				///
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "SIInstrInfo.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"

				#include <set>

				using namespace llvm;

				#define DEBUG_TYPE "si-insert-scratch-bounds"

				namespace {

				class SIInsertScratchBounds : public MachineFunctionPass {
				private:
				const GCNSubtarget *ST;
				const SIInstrInfo *TII;
				MachineRegisterInfo *MRI;
				const SIRegisterInfo *RI;

				public:
				static char ID;

				SIInsertScratchBounds() : MachineFunctionPass(ID) {}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				bool insertBoundsCheck(MachineFunction &MF, MachineInstr *MI,
				const int64_t SizeEstimate,
				const unsigned SizeReg,
				MachineBasicBlock **NextBB);

				bool runOnMachineFunction(MachineFunction &MF) override;
				};

				static void zeroReg(MachineBasicBlock &MBB, MachineRegisterInfo *MRI,
				const SIRegisterInfo RI, const SIInstrInfo TII,
				MachineBasicBlock::iterator &I, const DebugLoc &DL,
				unsigned Reg) {

				auto EndDstRC = MRI->getRegClass(Reg);
				uint32_t RegSize = RI->getRegSizeInBits(*EndDstRC) / 32;

				assert(RI->isVGPR(*MRI, Reg) && "can only zero VGPRs");

				if (RegSize == 1)
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_MOV_B32_e32), Reg).addImm(0);
				else {
				SmallVector<unsigned, 8> TRegs;
				for (unsigned i = 0; i < RegSize; ++i) {
				unsigned TReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_MOV_B32_e32), TReg).addImm(0);
				TRegs.push_back(TReg);
				}
				MachineInstrBuilder MIB =
				BuildMI(MBB, I, DL, TII->get(AMDGPU::REG_SEQUENCE), Reg);
				for (unsigned i = 0; i < RegSize; ++i) {
				MIB.addReg(TRegs[i]);
				MIB.addImm(RI->getSubRegFromChannel(i));
				}
				}
				}

				static void cndmask0Reg(MachineBasicBlock &MBB, MachineRegisterInfo *MRI,
				const SIRegisterInfo RI, const SIInstrInfo TII,
				MachineBasicBlock::iterator &I, const DebugLoc &DL,
				unsigned SrcReg, unsigned MaskReg, bool KillMask,
				unsigned DstReg) {

				auto EndDstRC = MRI->getRegClass(DstReg);
				uint32_t RegSize = RI->getRegSizeInBits(*EndDstRC) / 32;

				assert(RI->isVGPR(*MRI, DstReg) && "can only cndmask VGPRs");

				if (RegSize == 1)
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_CNDMASK_B32_e64), DstReg)
				.addImm(0)
				.addImm(0)
				.addImm(0)
				.addReg(SrcReg)
				.addReg(MaskReg, getKillRegState(KillMask));
				else {
				SmallVector<unsigned, 8> TRegs;
				for (unsigned i = 0; i < RegSize; ++i) {
				unsigned TReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_CNDMASK_B32_e64), TReg)
				.addImm(0)
				.addImm(0)
				.addImm(0)
				.addReg(SrcReg, 0, AMDGPU::sub0 + i)
				.addReg(MaskReg, getKillRegState(KillMask && (i == (RegSize - 1))));
				TRegs.push_back(TReg);
				}
				MachineInstrBuilder MIB =
				BuildMI(MBB, I, DL, TII->get(AMDGPU::REG_SEQUENCE), DstReg);
				for (unsigned i = 0; i < RegSize; ++i) {
				MIB.addReg(TRegs[i]);
				MIB.addImm(RI->getSubRegFromChannel(i));
				}
				}
				}

				} // end anonymous namespace

				INITIALIZE_PASS(SIInsertScratchBounds, DEBUG_TYPE,
				"SI Insert Scratch Bounds Checks",
				false, false)

				char SIInsertScratchBounds::ID = 0;

				char &llvm::SIInsertScratchBoundsID = SIInsertScratchBounds::ID;

				FunctionPass *llvm::createSIInsertScratchBoundsPass() {
				return new SIInsertScratchBounds;
				}

				bool SIInsertScratchBounds::insertBoundsCheck(MachineFunction &MF,
				MachineInstr *MI,
				const int64_t SizeEstimate,
				const unsigned SizeReg,
				MachineBasicBlock **NextBB) {
				const bool IsLoad = MI->mayLoad();
				DebugLoc DL = MI->getDebugLoc();

				const MachineOperand *Offset =
				TII->getNamedOperand(*MI, AMDGPU::OpName::offset);
				const MachineOperand *VAddr =
				TII->getNamedOperand(*MI, AMDGPU::OpName::vaddr);
				const MachineOperand *Addr =
				VAddr ? VAddr : TII->getNamedOperand(*MI, AMDGPU::OpName::saddr);

				if (!Addr \|\| !Addr->isReg()) {
				// Constant offset -> determine bounds check statically
				if (Offset->getImm() < SizeEstimate) {
				// Statically in bounds
				return false;
				}
				// Else: estimate may be revised upward so we cannot statically delete
				}

				// Setup new block structure
				MachineBasicBlock *PreAccessBB = MI->getParent();
				MachineBasicBlock *ScratchAccessBB = MF.CreateMachineBasicBlock();
				MachineBasicBlock *PostAccessBB = MF.CreateMachineBasicBlock();
				*NextBB = PostAccessBB;

				MachineFunction::iterator MBBI(*PreAccessBB);
				++MBBI;

				MF.insert(MBBI, ScratchAccessBB);
				MF.insert(MBBI, PostAccessBB);

				ScratchAccessBB->addSuccessor(PostAccessBB);

				// Move instructions following scratch access to new basic block
				MachineBasicBlock::iterator SuccI(*MI);
				++SuccI;
				PostAccessBB->transferSuccessorsAndUpdatePHIs(PreAccessBB);
				PostAccessBB->splice(
				PostAccessBB->begin(), PreAccessBB, SuccI, PreAccessBB->end()
				);

				PreAccessBB->addSuccessor(ScratchAccessBB);

				// Move scratch access to its own basic block
				MI->removeFromParent();
				ScratchAccessBB->insertAfter(ScratchAccessBB->begin(), MI);

				MachineBasicBlock::iterator PreI = PreAccessBB->end();
				MachineBasicBlock::iterator PostI = PostAccessBB->begin();
				MachineBasicBlock::iterator ScratchI = ScratchAccessBB->end();
				unsigned AddrReg;
				bool KillAddr = false;

				if (Offset && (Offset->getImm() > 0)) {
				AddrReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
				KillAddr = true;

				if (Addr && Addr->isReg()) {
				TII->getAddNoCarry(*PreAccessBB, PreI, DL, AddrReg)
				.addImm(Offset->getImm())
				.addReg(Addr->getReg())
				.addImm(0); // clamp bit
				} else {
				BuildMI(*PreAccessBB, PreI, DL,
				TII->get(AMDGPU::V_MOV_B32_e32), AddrReg)
				.addImm(Offset->getImm());
				}
				} else {
				assert(Addr);
				AddrReg = Addr->getReg();
				}

				if (RI->isVGPR(*MRI, AddrReg)) {
				const unsigned CondReg
				= MRI->createVirtualRegister(&AMDGPU::SReg_64_XEXECRegClass);
				const unsigned ExecReg
				= MRI->createVirtualRegister(&AMDGPU::SReg_64_XEXECRegClass);

				BuildMI(*PreAccessBB, PreI, DL,
				TII->get(AMDGPU::V_CMP_LT_U32_e64), CondReg)
				.addReg(AddrReg, getKillRegState(KillAddr))
				.addReg(SizeReg);
				BuildMI(*PreAccessBB, PreI, DL,
				TII->get(AMDGPU::S_AND_SAVEEXEC_B64), ExecReg)
				.addReg(CondReg, getKillRegState(!IsLoad));
				BuildMI(*ScratchAccessBB, ScratchI, DL,
				TII->get(AMDGPU::S_MOV_B64), AMDGPU::EXEC)
				.addReg(ExecReg, RegState::Kill);
				arsenmUnsubmitted Done Reply Inline Actions You can use TII::getAddNoCarry arsenm: You can use TII::getAddNoCarry

				if (IsLoad) {
				MachineOperand &Dst = MI->getOperand(0);
				const unsigned DstReg = Dst.getReg();
				const TargetRegisterClass *DstRC = MRI->getRegClass(DstReg);
				const unsigned LoadDstReg = MRI->createVirtualRegister(DstRC);

				Dst.setReg(LoadDstReg);

				cndmask0Reg(*PostAccessBB, MRI, RI, TII, PostI, DL,
				LoadDstReg, CondReg, true, DstReg);
				}
				} else {
				if (MI->mayLoad()) {
				// Load -> scalar comparison, then load, else load zero
				MachineBasicBlock *OutOfBoundsBB = MF.CreateMachineBasicBlock();
				arsenmUnsubmitted Done Reply Inline Actions I'm working on only allowing instructions that are terminators to modify exec, but this is introducing a new exec write in the middle of a block arsenm: I'm working on only allowing instructions that are terminators to modify exec, but this is…
				critsonAuthorUnsubmitted Not Done Reply Inline Actions This code inserts new basic blocks such that only terminators modify exec, unless I missed something? critson: This code inserts new basic blocks such that only terminators modify exec, unless I missed…
				MachineBasicBlock::iterator OOBI = OutOfBoundsBB->end();

				MBBI--;
				MF.insert(MBBI, OutOfBoundsBB);
				OutOfBoundsBB->addSuccessor(PostAccessBB);
				PreAccessBB->addSuccessor(OutOfBoundsBB);

				// TODO: mark SCC as clobbered?
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CMP_LT_U32))
				.addReg(AddrReg, getKillRegState(KillAddr))
				.addReg(SizeReg);
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CBRANCH_SCC0))
				.addMBB(OutOfBoundsBB);

				BuildMI(*ScratchAccessBB, ScratchI, DL, TII->get(AMDGPU::S_BRANCH))
				.addMBB(PostAccessBB);

				MachineOperand &Dst = MI->getOperand(0);
				const unsigned DstReg = Dst.getReg();

				const TargetRegisterClass *DstRC = MRI->getRegClass(DstReg);
				const unsigned LoadDstReg = MRI->createVirtualRegister(DstRC);
				const unsigned ZeroDstReg = MRI->createVirtualRegister(DstRC);

				zeroReg(*OutOfBoundsBB, MRI, RI, TII, OOBI, DL, ZeroDstReg);

				BuildMI(*PostAccessBB, PostI, DL, TII->get(TargetOpcode::PHI), DstReg)
				.addReg(LoadDstReg)
				.addMBB(ScratchAccessBB)
				.addReg(ZeroDstReg)
				.addMBB(OutOfBoundsBB);

				Dst.setReg(LoadDstReg);
				} else {
				// Store -> scalar comparison and skip store
				// TODO: mark SCC as clobbered?
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CMP_LT_U32))
				.addReg(AddrReg, getKillRegState(KillAddr))
				.addReg(SizeReg);
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CBRANCH_SCC0))
				.addMBB(PostAccessBB);
				PreAccessBB->addSuccessor(PostAccessBB);
				}
				}

				return true;
				}

				bool SIInsertScratchBounds::runOnMachineFunction(MachineFunction &MF) {
				ST = &MF.getSubtarget<GCNSubtarget>();
				TII = ST->getInstrInfo();
				MRI = &MF.getRegInfo();
				RI = ST->getRegisterInfo();

				if (!ST->enableScratchBoundsChecks())
				return false;

				const MachineFrameInfo &FrameInfo = MF.getFrameInfo();
				const int64_t ScratchSizeEstimate =
				(int64_t) FrameInfo.estimateStackSize(MF);

				bool Changed = false;
				unsigned SizeReg = 0; // defer assigning a register until required

				MachineFunction::iterator NextBB;
				for (MachineFunction::iterator BI = MF.begin();
				BI != MF.end(); BI = NextBB) {
				NextBB = std::next(BI);
				MachineBasicBlock &MBB = *BI;
				MachineBasicBlock *NewNextBB = nullptr;

				for (MachineInstr &MI : MBB) {
				if (MI.mayLoad() \|\| MI.mayStore()) {
				for (const auto &MMO : MI.memoperands()) {
				const unsigned AddrSpace = MMO->getPointerInfo().getAddrSpace();
				if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
				// uses scratch; needs to be processed
				if (!SizeReg)
				SizeReg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);

				Changed \|= insertBoundsCheck(
				MF, &MI, ScratchSizeEstimate, SizeReg,
				&NewNextBB
				);
				break;
				arsenmUnsubmitted Not Done Reply Inline Actions This isn't going to catch scratch accesses without an MMO, you need to check the opcodes arsenm: This isn't going to catch scratch accesses without an MMO, you need to check the opcodes
				critsonAuthorUnsubmitted Not Done Reply Inline Actions I don't think we can do anything sensible with instructions without MMO. They could be accessing any address space, so it would not be appropriate to apply scratch bounds checks to them. critson: I don't think we can do anything sensible with instructions without MMO. They could be…
				}
				}
				}
				arsenmUnsubmitted Done Reply Inline Actions I dislike building a worklist of all instructions in the function to handle. This shouldn't be hard to handle as you go through the function arsenm: I dislike building a worklist of all instructions in the function to handle. This shouldn't be…
				if (NewNextBB) {
				// Restart at the newly created next BB
				NextBB = MachineFunction::iterator(*NewNextBB);
				break;
				}
				}
				}

				// If scratch size is required then add to prelude
				if (Changed) {
				arsenmUnsubmitted Done Reply Inline Actions This isn't going to be strong enough to guarantee that there will never be an out of bounds access. This approach also isn't going to work in a callable function, in the presence of calls, or variable sized stack objects (which I'm planning on implementing) arsenm: This isn't going to be strong enough to guarantee that there will never be an out of bounds…
				critsonAuthorUnsubmitted Done Reply Inline Actions On that basis do you have any opinions on the appropriate method for establishing stack size going forward? critson: On that basis do you have any opinions on the appropriate method for establishing stack size…
				nhaehnleUnsubmitted Done Reply Inline Actions You could emit the stack size as a TargetGlobalAddress referring to a special symbol, and ask PAL to fixup the relocations for that symbol at load time. Details would have to be figured out, but that works as a rough approach. nhaehnle: You could emit the stack size as a TargetGlobalAddress referring to a special symbol, and ask…
				MachineBasicBlock *PreludeBB = &MF.front();
				MachineBasicBlock::iterator PreludeI = PreludeBB->begin();
				DebugLoc UnknownDL;

				BuildMI(*PreludeBB, PreludeI, UnknownDL,
				TII->get(AMDGPU::S_MOV_B32), SizeReg)
				.addExternalSymbol(SIScratchSizeSymbol);
				}

				return Changed;
				}

test/CodeGen/AMDGPU/scratch-bounds.ll

This file was added.

				; RUN: llc -verify-machineinstrs -march=amdgcn -mcpu=gfx900 -mattr=+max-private-element-size-16,+enable-scratch-bounds-checks < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s

				; GCN-LABEL: {{^}}bounds_check_load_i32:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8004
				; GCN: v_cmp_gt_u32_e64 [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDS]], [[OFFSET:v[0-9]+]]
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK]]
				; GCN: buffer_load_dword [[LOADVALUE:v[0-9]+]], [[OFFSET]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]
				; GCN: v_cndmask_b32_e64 v{{[0-9]+}}, 0, [[LOADVALUE]], [[BOUNDSMASK]]

				define amdgpu_kernel void @bounds_check_load_i32(i32 addrspace(1)* %out, i32 %offset) {
				entry:
				%scratch = alloca [8192 x i32], addrspace(5)

				%ptr = getelementptr [8192 x i32], [8192 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				%value = load i32, i32 addrspace(5)* %ptr
				store i32 %value, i32 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_store_i32:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8004
				; GCN: v_cmp_gt_u32_e64 [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDS]], [[OFFSET:v[0-9]+]]
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK]]
				; GCN: buffer_store_dword v{{[0-9]+}}, [[OFFSET]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]

				define amdgpu_kernel void @bounds_check_store_i32(i32 addrspace(1)* %out, i32 %value, i32 %offset) {
				entry:
				%scratch = alloca [8192 x i32], addrspace(5)

				%ptr = getelementptr [8192 x i32], [8192 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				store i32 %value, i32 addrspace(5)* %ptr
				store i32 %value, i32 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_load_i64:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8008
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]]
				; GCN: buffer_load_dwordx2 v{{\[}}[[LOADLO:[0-9]+]]:[[LOADHI:[0-9]+]]{{\]}}, [[OFFSET:v[0-9]+]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]
				; GCN-DAG: v_cndmask_b32_e64 v{{[0-9]+}}, 0, v[[LOADLO]], [[BOUNDSMASK]]
				; GCN-DAG: v_cndmask_b32_e64 v{{[0-9]+}}, 0, v[[LOADHI]], [[BOUNDSMASK]]

				define amdgpu_kernel void @bounds_check_load_i64(i64 addrspace(1)* %out, i32 %offset) {
				entry:
				%scratch = alloca [4096 x i64], addrspace(5)

				%ptr = getelementptr [4096 x i64], [4096 x i64] addrspace(5)* %scratch, i32 0, i32 %offset
				%value = load i64, i64 addrspace(5)* %ptr
				store i64 %value, i64 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_store_i64:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8008
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]]
				; GCN: buffer_store_dwordx2 v[{{[0-9]+}}:{{[0-9]+}}], [[OFFSET:v[0-9]+]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]

				define amdgpu_kernel void @bounds_check_store_i64(i64 addrspace(1)* %out, i64 %value, i32 %offset) {
				entry:
				%scratch = alloca [4096 x i64], addrspace(5)

				%ptr = getelementptr [4096 x i64], [4096 x i64] addrspace(5)* %scratch, i32 0, i32 %offset
				store i64 %value, i64 addrspace(5)* %ptr
				store i64 %value, i64 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_load_i128:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8008
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]]
				; GCN: buffer_load_dwordx4 v{{\[}}[[LOADLO:[0-9]+]]:[[LOADHI:[0-9]+]]{{\]}}, [[OFFSET:v[0-9]+]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]
				; GCN-DAG: v_cndmask_b32_e64 v{{[0-9]+}}, 0, v[[LOADLO]], [[BOUNDSMASK]]
				; GCN-DAG: v_cndmask_b32_e64 v{{[0-9]+}}, 0, v[[LOADHI]], [[BOUNDSMASK]]

				define amdgpu_kernel void @bounds_check_load_i128(i128 addrspace(1)* %out, i32 %offset) {
				entry:
				%scratch = alloca [2048 x i128], addrspace(5)

				%ptr = getelementptr [2048 x i128], [2048 x i128] addrspace(5)* %scratch, i32 0, i32 %offset
				%value = load i128, i128 addrspace(5)* %ptr
				store i128 %value, i128 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_store_i128:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8008
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]]
				; GCN: buffer_store_dwordx4 v[{{[0-9]+}}:{{[0-9]+}}], [[OFFSET:v[0-9]+]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]

				define amdgpu_kernel void @bounds_check_store_i128(i128 addrspace(1)* %out, i128 %value, i32 %offset) {
				entry:
				%scratch = alloca [2048 x i128], addrspace(5)

				%ptr = getelementptr [2048 x i128], [2048 x i128] addrspace(5)* %scratch, i32 0, i32 %offset
				store i128 %value, i128 addrspace(5)* %ptr
				store i128 %value, i128 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_static_valid_store_i32:
				; GCN-NOT: s_and_saveexec_b64
				; GCN: buffer_store_dword v{{[0-9]+}}, off, s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offset:20

				define amdgpu_kernel void @bounds_check_static_valid_store_i32(i32 addrspace(1)* %out, i32 %value, i32 %offset) {
				entry:
				%scratch = alloca [256 x i32], addrspace(5)

				%ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 4
				store i32 %value, i32 addrspace(5)* %ptr

				%load_ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				%val = load i32, i32 addrspace(5)* %load_ptr
				store i32 %val, i32 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_static_oob_store_i32:
				; GCN: buffer_store_dword v{{[0-9]+}}, off, s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offset:2052

				define amdgpu_kernel void @bounds_check_static_oob_store_i32(i32 addrspace(1)* %out, i32 %value, i32 %offset) {
				entry:
				%scratch = alloca [256 x i32], addrspace(5)

				%ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 512
				store i32 %value, i32 addrspace(5)* %ptr

				%load_ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				%val = load i32, i32 addrspace(5)* %load_ptr
				store i32 %val, i32 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_static_valid_load_i32:
				; GCN: buffer_store_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN-NOT: s_and_saveexec_b64
				; GCN: buffer_load_dword v{{[0-9]+}}, off, s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offset:20
				; GCN-NOT: v_cndmask_b32_e64

				define amdgpu_kernel void @bounds_check_static_valid_load_i32(i32 addrspace(1)* %out, i32 %value, i32 %offset) {
				entry:
				%scratch = alloca [256 x i32], addrspace(5)

				%store_ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				store i32 %value, i32 addrspace(5)* %store_ptr

				%ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 4
				%val = load i32, i32 addrspace(5)* %ptr

				store i32 %val, i32 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_static_oob_load_i32:
				; GCN: buffer_store_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen{{$}}
				; GCN: buffer_load_dword v{{[0-9]+}}, off, s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offset:2052

				define amdgpu_kernel void @bounds_check_static_oob_load_i32(i32 addrspace(1)* %out, i32 %value, i32 %offset) {
				entry:
				%scratch = alloca [256 x i32], addrspace(5)

				%store_ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				store i32 %value, i32 addrspace(5)* %store_ptr

				%ptr = getelementptr [256 x i32], [256 x i32] addrspace(5)* %scratch, i32 0, i32 512
				%val = load i32, i32 addrspace(5)* %ptr

				store i32 %val, i32 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_load_offset_i32:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8004
				; GCN: v_add_u32_e32 [[CMPOFFSET:v[0-9]+]], 16, [[OFFSET:v[0-9]+]]
				; GCN: v_cmp_gt_u32_e64 [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDS]], [[CMPOFFSET]]
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK]]
				; GCN: buffer_load_dword [[LOADVALUE:v[0-9]+]], [[OFFSET]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen offset:16{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]
				; GCN: v_cndmask_b32_e64 v{{[0-9]+}}, 0, [[LOADVALUE]], [[BOUNDSMASK]]

				define amdgpu_kernel void @bounds_check_load_offset_i32(i32 addrspace(1)* %out, i32 %offset) {
				entry:
				%scratch = alloca [8192 x i32], addrspace(5)

				%ptr.0 = getelementptr [8192 x i32], [8192 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				%ptr.1 = getelementptr i32, i32 addrspace(5)* %ptr.0, i32 4
				%value = load i32, i32 addrspace(5)* %ptr.1
				store i32 %value, i32 addrspace(1)* %out

				ret void
				}

				; GCN-LABEL: {{^}}bounds_check_store_offset_i32:
				; GCN: s_mov_b32 [[BOUNDS:s[0-9]+]], 0x8004
				; GCN: v_add_u32_e32 [[CMPOFFSET:v[0-9]+]], 16, [[OFFSET:v[0-9]+]]
				; GCN: v_cmp_gt_u32_e64 [[BOUNDSMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDS]], [[CMPOFFSET]]
				; GCN: s_and_saveexec_b64 [[EXECMASK:s\[[0-9]+:[0-9]+\]]], [[BOUNDSMASK]]
				; GCN: buffer_store_dword v{{[0-9]+}}, [[OFFSET]], s[{{[0-9]+}}:{{[0-9]+}}], s{{[0-9]+}} offen offset:16{{$}}
				; GCN: s_mov_b64 exec, [[EXECMASK]]

				define amdgpu_kernel void @bounds_check_store_offset_i32(i32 addrspace(1)* %out, i32 %value, i32 %offset) {
				entry:
				%scratch = alloca [8192 x i32], addrspace(5)

				%ptr.0 = getelementptr [8192 x i32], [8192 x i32] addrspace(5)* %scratch, i32 0, i32 %offset
				%ptr.1 = getelementptr i32, i32 addrspace(5)* %ptr.0, i32 4
				store i32 %value, i32 addrspace(5)* %ptr.1
				store i32 %value, i32 addrspace(1)* %out

				ret void
				}