This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add optional bounds checking for scratch accesses
AbandonedPublic

Authored by critson on Apr 16 2019, 6:12 AM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle

Summary

Implement a pass, enabled by -amdgpu-scratch-bounds-checking,
which adds bounds checks to scratch accesses.
When this pass is enabled, out-of-bounds writes have no effect
and out-of-bounds reads return zero.
This is useful for GFX9 where hardware no longer performs
bounds checking on scratch accesses and hence page faults
result for out-of-bounds accesses by generated by shaders.

Change-Id: Id2ee4b1f32e70b6bde2541db755727b6a407721b

Diff Detail

Event Timeline

critson created this revision.Apr 16 2019, 6:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 16 2019, 6:12 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald Transcript

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined. I don't want to start trying to define in at an arbitrary point in the backend. Crashing on the invalid access is much easier problem to debug. This seems to be a partial replacement for asan? If this is intended as some user visible semantic fix, that requires more thought and probably needs to be a new address space.

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
210–228	You can use TII::getAddNoCarry
243–244	I'm working on only allowing instructions that are terminators to modify exec, but this is introducing a new exec write in the middle of a block
328–329	This isn't going to catch scratch accesses without an MMO, you need to check the opcodes
331–332	I dislike building a worklist of all instructions in the function to handle. This shouldn't be hard to handle as you go through the function
341–342	This isn't going to be strong enough to guarantee that there will never be an out of bounds access. This approach also isn't going to work in a callable function, in the presence of calls, or variable sized stack objects (which I'm planning on implementing)

If the goal is to have a semantically always dereferencable stack pointer, I think we need to create a new addrspace. It would then be a no-op addrspacecast from an alloca, which the frontend desiring safe stack access would be responsible for inserting. We would then need to track the current global stack size in the ABI somewhere, and selection would need to insert this kind of bounds check code based on that

critson added inline comments.May 7 2019, 7:55 AM

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
341–342	On that basis do you have any opinions on the appropriate method for establishing stack size going forward?

In D60772#1470073, @arsenm wrote:

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined.

Graphics APIs have "robust (buffer) access" settings in which out of bounds accesses must be guaranteed not to crash the program.

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
341–342	You could emit the stack size as a TargetGlobalAddress referring to a special symbol, and ask PAL to fixup the relocations for that symbol at load time. Details would have to be figured out, but that works as a rough approach.

Remove worklist usage.
Move from global to subtarget option.
Add second pass to resolve scratch size later in backend.
Add missing lit tests.

Harbormaster completed remote builds in B32617: Diff 201929.May 29 2019, 8:33 AM

critson marked 6 inline comments as done.May 29 2019, 8:38 AM

critson added inline comments.

lib/Target/AMDGPU/SIInsertScratchBounds.cpp
243–244	This code inserts new basic blocks such that only terminators modify exec, unless I missed something?
328–329	I don't think we can do anything sensible with instructions without MMO. They could be accessing any address space, so it would not be appropriate to apply scratch bounds checks to them.

Ping.
I was wondering if I can get a second reading on this?
I believe I've addressed everything except the introduction of a new address space, as this seems somewhat heavyweight to me.
If the address space is considered absolutely necessary then I'll work on that next.

In D60772#1501198, @nhaehnle wrote:

In D60772#1470073, @arsenm wrote:

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined.

Graphics APIs have "robust (buffer) access" settings in which out of bounds accesses must be guaranteed not to crash the program.

And this is allowed to be set for stack objects?

lib/Target/AMDGPU/SIFixScratchSize.cpp
71 ↗	(On Diff #201929)	No c string functions

arsenm added inline comments.Jun 6 2019, 7:29 PM

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
896	I would expect this kind of handling to be done as part of selection, not a pretty late pass
lib/Target/AMDGPU/SIFixScratchSize.cpp
62 ↗	(On Diff #201929)	This whole pass doesn't work when there's dynamic stack usage?
65 ↗	(On Diff #201929)	Can early exit if there is no stack usage
68 ↗	(On Diff #201929)	Can't assume the instruction use

In D60772#1501198, @nhaehnle wrote:

In D60772#1470073, @arsenm wrote:

I don't understand the problem being solved here. Who/what is this intended to benefit? An out of bounds access is going to be undefined.

Graphics APIs have "robust (buffer) access" settings in which out of bounds accesses must be guaranteed not to crash the program.

There must be some restrictions on this? I don't think any kind of codegen pass like this will be strong enough to guarantee this. An IR optimization could have introduced a trap or something based on a detected invalid access. This probably won't happen today since things are overly conservative with non-0 address spaces, but this is something I want to fix

This has been superseded by front-end work in graphics compiler.

Herald added a subscriber: kerbowa. · View Herald TranscriptOct 11 2020, 10:34 PM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPU.h

4 lines

AMDGPUTargetMachine.cpp

10 lines

CMakeLists.txt

1 line

SIInsertScratchBounds.cpp

371 lines

Diff 195360

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	FunctionPass *createSIAddIMGInitPass();			FunctionPass *createSIAddIMGInitPass();
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
	FunctionPass *createSILoadStoreOptimizerPass();			FunctionPass *createSILoadStoreOptimizerPass();
	FunctionPass *createSIWholeQuadModePass();			FunctionPass *createSIWholeQuadModePass();
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
	FunctionPass *createSIOptimizeExecMaskingPreRAPass();			FunctionPass *createSIOptimizeExecMaskingPreRAPass();
	FunctionPass *createSIFixSGPRCopiesPass();			FunctionPass *createSIFixSGPRCopiesPass();
	FunctionPass *createSIMemoryLegalizerPass();			FunctionPass *createSIMemoryLegalizerPass();
				FunctionPass *createSIInsertScratchBoundsPass();
	FunctionPass *createSIInsertWaitcntsPass();			FunctionPass *createSIInsertWaitcntsPass();
	FunctionPass *createSIPreAllocateWWMRegsPass();			FunctionPass *createSIPreAllocateWWMRegsPass();
	FunctionPass *createSIFormMemoryClausesPass();			FunctionPass *createSIFormMemoryClausesPass();
	FunctionPass *createAMDGPUSimplifyLibCallsPass(const TargetOptions &);			FunctionPass *createAMDGPUSimplifyLibCallsPass(const TargetOptions &);
	FunctionPass *createAMDGPUUseNativeCallsPass();			FunctionPass *createAMDGPUUseNativeCallsPass();
	FunctionPass *createAMDGPUCodeGenPreparePass();			FunctionPass *createAMDGPUCodeGenPreparePass();
	FunctionPass *createAMDGPUMachineCFGStructurizerPass();			FunctionPass *createAMDGPUMachineCFGStructurizerPass();
	FunctionPass *createAMDGPURewriteOutArgumentsPass();			FunctionPass *createAMDGPURewriteOutArgumentsPass();
	▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines
	extern char &SILoadStoreOptimizerID;			extern char &SILoadStoreOptimizerID;

	void initializeSIWholeQuadModePass(PassRegistry &);			void initializeSIWholeQuadModePass(PassRegistry &);
	extern char &SIWholeQuadModeID;			extern char &SIWholeQuadModeID;

	void initializeSILowerControlFlowPass(PassRegistry &);			void initializeSILowerControlFlowPass(PassRegistry &);
	extern char &SILowerControlFlowID;			extern char &SILowerControlFlowID;

				void initializeSIInsertScratchBoundsPass(PassRegistry &);
				extern char &SIInsertScratchBoundsID;

	void initializeSIInsertSkipsPass(PassRegistry &);			void initializeSIInsertSkipsPass(PassRegistry &);
	extern char &SIInsertSkipsPassID;			extern char &SIInsertSkipsPassID;

	void initializeSIOptimizeExecMaskingPass(PassRegistry &);			void initializeSIOptimizeExecMaskingPass(PassRegistry &);
	extern char &SIOptimizeExecMaskingID;			extern char &SIOptimizeExecMaskingID;

	void initializeSIPreAllocateWWMRegsPass(PassRegistry &);			void initializeSIPreAllocateWWMRegsPass(PassRegistry &);
	extern char &SIPreAllocateWWMRegsID;			extern char &SIPreAllocateWWMRegsID;
	▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines	EnableDCEInRA("amdgpu-dce-in-ra",
cl::desc("Enable machine DCE inside regalloc"));		cl::desc("Enable machine DCE inside regalloc"));

static cl::opt<bool> EnableScalarIRPasses(		static cl::opt<bool> EnableScalarIRPasses(
"amdgpu-scalar-ir-passes",		"amdgpu-scalar-ir-passes",
cl::desc("Enable scalar IR passes"),		cl::desc("Enable scalar IR passes"),
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

		// Enable scratch bounds checking
		static cl::opt<bool> EnableScratchBoundsChecking(
		"amdgpu-scratch-bounds-checking",
		cl::desc("Enable scratch bounds checking"),
		cl::init(false),
		cl::Hidden);

extern "C" void LLVMInitializeAMDGPUTarget() {		extern "C" void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());		RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeR600ClauseMergePassPass(*PR);		initializeR600ClauseMergePassPass(*PR);
initializeR600ControlFlowFinalizerPass(*PR);		initializeR600ControlFlowFinalizerPass(*PR);
▲ Show 20 Lines • Show All 695 Lines • ▼ Show 20 Lines	bool GCNPassConfig::addGlobalInstructionSelect() {
addPass(new InstructionSelect());		addPass(new InstructionSelect());
return false;		return false;
}		}

void GCNPassConfig::addPreRegAlloc() {		void GCNPassConfig::addPreRegAlloc() {
if (LateCFGStructurize) {		if (LateCFGStructurize) {
addPass(createAMDGPUMachineCFGStructurizerPass());		addPass(createAMDGPUMachineCFGStructurizerPass());
}		}
		if (EnableScratchBoundsChecking) {
		arsenmUnsubmitted Not Done Reply Inline Actions I would expect this kind of handling to be done as part of selection, not a pretty late pass arsenm: I would expect this kind of handling to be done as part of selection, not a pretty late pass
		addPass(createSIInsertScratchBoundsPass());
		}
addPass(createSIWholeQuadModePass());		addPass(createSIWholeQuadModePass());
}		}

void GCNPassConfig::addFastRegAlloc() {		void GCNPassConfig::addFastRegAlloc() {
// FIXME: We have to disable the verifier here because of PHIElimination +		// FIXME: We have to disable the verifier here because of PHIElimination +
// TwoAddressInstructions disabling it.		// TwoAddressInstructions disabling it.

// This must be run immediately after phi elimination and before		// This must be run immediately after phi elimination and before
▲ Show 20 Lines • Show All 139 Lines • Show Last 20 Lines

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
SIAnnotateControlFlow.cpp		SIAnnotateControlFlow.cpp
SIFixSGPRCopies.cpp		SIFixSGPRCopies.cpp
SIFixupVectorISel.cpp		SIFixupVectorISel.cpp
SIFixVGPRCopies.cpp		SIFixVGPRCopies.cpp
SIPreAllocateWWMRegs.cpp		SIPreAllocateWWMRegs.cpp
SIFoldOperands.cpp		SIFoldOperands.cpp
SIFormMemoryClauses.cpp		SIFormMemoryClauses.cpp
SIFrameLowering.cpp		SIFrameLowering.cpp
		SIInsertScratchBounds.cpp
SIInsertSkips.cpp		SIInsertSkips.cpp
SIInsertWaitcnts.cpp		SIInsertWaitcnts.cpp
SIInstrInfo.cpp		SIInstrInfo.cpp
SIISelLowering.cpp		SIISelLowering.cpp
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
SILowerControlFlow.cpp		SILowerControlFlow.cpp
SILowerI1Copies.cpp		SILowerI1Copies.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
Show All 19 Lines

lib/Target/AMDGPU/SIInsertScratchBounds.cpp

This file was added.

				//===- SIInsertScratchBounds.cpp - insert scratch bounds checks -===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// This pass inserts bounds checks on scratch accesses.
				/// Out-of-bounds reads return zero, and out-of-bounds writes have no effect.
				/// This is intended to be used on GFX9 where bounds checking is no longer
				/// performed by hardware and hence page faults can results from out-of-bounds
				/// accesses by shaders.
				///
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "SIInstrInfo.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"

				#include <set>

				using namespace llvm;

				#define DEBUG_TYPE "si-insert-scratch-bounds"

				namespace {

				class SIInsertScratchBounds : public MachineFunctionPass {
				private:
				const GCNSubtarget *ST;
				const SIInstrInfo *TII;
				MachineRegisterInfo *MRI;
				const SIRegisterInfo *RI;
				std::vector<MachineInstr*> Worklist;

				public:
				static char ID;

				SIInsertScratchBounds() : MachineFunctionPass(ID) {}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				bool insertBoundsCheck(MachineFunction &MF, MachineInstr *MI,
				const int64_t ScratchSize,
				const unsigned SizeReg,
				bool &SizeUsed);

				bool runOnMachineFunction(MachineFunction &MF) override;
				};

				static void zeroReg(MachineBasicBlock &MBB, MachineRegisterInfo *MRI,
				const SIRegisterInfo RI, const SIInstrInfo TII,
				MachineBasicBlock::iterator &I, const DebugLoc &DL,
				unsigned Reg) {

				auto EndDstRC = MRI->getRegClass(Reg);
				uint32_t RegSize = RI->getRegSizeInBits(*EndDstRC) / 32;

				assert(RI->isVGPR(*MRI, Reg) && "can only zero VGPRs");

				if (RegSize == 1)
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_MOV_B32_e32), Reg).addImm(0);
				else {
				SmallVector<unsigned, 8> TRegs;
				for (unsigned i = 0; i < RegSize; ++i) {
				unsigned TReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_MOV_B32_e32), TReg).addImm(0);
				TRegs.push_back(TReg);
				}
				MachineInstrBuilder MIB =
				BuildMI(MBB, I, DL, TII->get(AMDGPU::REG_SEQUENCE), Reg);
				for (unsigned i = 0; i < RegSize; ++i) {
				MIB.addReg(TRegs[i]);
				MIB.addImm(RI->getSubRegFromChannel(i));
				}
				}
				}

				static void cndmask0Reg(MachineBasicBlock &MBB, MachineRegisterInfo *MRI,
				const SIRegisterInfo RI, const SIInstrInfo TII,
				MachineBasicBlock::iterator &I, const DebugLoc &DL,
				unsigned SrcReg, unsigned MaskReg, bool KillMask,
				unsigned DstReg) {

				auto EndDstRC = MRI->getRegClass(DstReg);
				uint32_t RegSize = RI->getRegSizeInBits(*EndDstRC) / 32;

				assert(RI->isVGPR(*MRI, DstReg) && "can only cndmask VGPRs");

				if (RegSize == 1)
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_CNDMASK_B32_e64), DstReg)
				.addImm(0)
				.addImm(0)
				.addImm(0)
				.addReg(SrcReg)
				.addReg(MaskReg, getKillRegState(KillMask));
				else {
				SmallVector<unsigned, 8> TRegs;
				for (unsigned i = 0; i < RegSize; ++i) {
				unsigned TReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
				BuildMI(MBB, I, DL, TII->get(AMDGPU::V_CNDMASK_B32_e64), TReg)
				.addImm(0)
				.addImm(0)
				.addImm(0)
				.addReg(SrcReg, 0, AMDGPU::sub0 + i)
				.addReg(MaskReg, getKillRegState(KillMask && (i == (RegSize - 1))));
				TRegs.push_back(TReg);
				}
				MachineInstrBuilder MIB =
				BuildMI(MBB, I, DL, TII->get(AMDGPU::REG_SEQUENCE), DstReg);
				for (unsigned i = 0; i < RegSize; ++i) {
				MIB.addReg(TRegs[i]);
				MIB.addImm(RI->getSubRegFromChannel(i));
				}
				}
				}

				} // end anonymous namespace

				INITIALIZE_PASS(SIInsertScratchBounds, DEBUG_TYPE,
				"SI Insert Scratch Bounds Checks",
				false, false)

				char SIInsertScratchBounds::ID = 0;

				char &llvm::SIInsertScratchBoundsID = SIInsertScratchBounds::ID;

				FunctionPass *llvm::createSIInsertScratchBoundsPass() {
				return new SIInsertScratchBounds;
				}

				bool SIInsertScratchBounds::insertBoundsCheck(MachineFunction &MF,
				MachineInstr *MI,
				const int64_t ScratchSize,
				const unsigned SizeReg,
				bool &SizeUsed) {
				const bool IsLoad = MI->mayLoad();
				DebugLoc DL = MI->getDebugLoc();

				const MachineOperand *Offset =
				TII->getNamedOperand(*MI, AMDGPU::OpName::offset);
				const MachineOperand *VAddr =
				TII->getNamedOperand(*MI, AMDGPU::OpName::vaddr);
				const MachineOperand *Addr =
				VAddr ? VAddr : TII->getNamedOperand(*MI, AMDGPU::OpName::saddr);

				if (!Addr \|\| !Addr->isReg()) {
				// Constant offset -> determine bounds check statically
				if (Offset->getImm() >= ScratchSize) {
				// Statically out-of-bounds -> delete instruction
				if (IsLoad) {
				MachineBasicBlock *MBB = MI->getParent();
				MachineBasicBlock::iterator I(MI);
				MachineOperand &Dst = MI->getOperand(0);
				zeroReg(*MBB, MRI, RI, TII, I, DL, Dst.getReg());
				}
				MI->removeFromParent();
				return true;
				} else {
				// Statically in bounds
				return false;
				}
				}

				// Setup new block structure
				MachineBasicBlock *PreAccessBB = MI->getParent();
				MachineBasicBlock *ScratchAccessBB = MF.CreateMachineBasicBlock();
				MachineBasicBlock *PostAccessBB = MF.CreateMachineBasicBlock();

				MachineFunction::iterator MBBI(*PreAccessBB);
				++MBBI;

				MF.insert(MBBI, ScratchAccessBB);
				MF.insert(MBBI, PostAccessBB);

				ScratchAccessBB->addSuccessor(PostAccessBB);

				// Move instructions following scratch access to new basic block
				MachineBasicBlock::iterator SuccI(*MI);
				++SuccI;
				PostAccessBB->transferSuccessorsAndUpdatePHIs(PreAccessBB);
				PostAccessBB->splice(
				PostAccessBB->begin(), PreAccessBB, SuccI, PreAccessBB->end()
				);

				PreAccessBB->addSuccessor(ScratchAccessBB);

				// Move scratch access to its own basic block
				MI->removeFromParent();
				ScratchAccessBB->insertAfter(ScratchAccessBB->begin(), MI);

				MachineBasicBlock::iterator PreI = PreAccessBB->end();
				MachineBasicBlock::iterator PostI = PostAccessBB->begin();
				MachineBasicBlock::iterator ScratchI = ScratchAccessBB->end();
				unsigned AddrReg;
				bool KillAddr = false;

				assert(Addr && Addr->isReg());

				if (Offset && (Offset->getImm() > 0)) {
				AddrReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
				KillAddr = true;

				if (ST->hasAddNoCarry()) {
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::V_ADD_U32_e32), AddrReg)
				.addImm(Offset->getImm())
				.addReg(Addr->getReg());
				} else {
				const unsigned OffsetReg =
				MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
				const unsigned UnusedCarry =
				MRI->createVirtualRegister(&AMDGPU::SReg_64RegClass);

				MRI->setRegAllocationHint(UnusedCarry, 0, AMDGPU::VCC);

				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg)
				.addImm(Offset->getImm());
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::V_ADD_I32_e64), AddrReg)
				.addReg(UnusedCarry, RegState::Define \| RegState::Dead)
				.addReg(Addr->getReg())
				.addReg(OffsetReg, RegState::Kill);
				}
				arsenmUnsubmitted Done Reply Inline Actions You can use TII::getAddNoCarry arsenm: You can use TII::getAddNoCarry
				} else {
				AddrReg = Addr->getReg();
				}

				if (RI->isVGPR(*MRI, AddrReg)) {
				const unsigned CondReg
				= MRI->createVirtualRegister(&AMDGPU::SReg_64_XEXECRegClass);
				const unsigned ExecReg
				= MRI->createVirtualRegister(&AMDGPU::SReg_64_XEXECRegClass);

				BuildMI(*PreAccessBB, PreI, DL,
				TII->get(AMDGPU::V_CMP_LT_U32_e64), CondReg)
				.addReg(AddrReg, getKillRegState(KillAddr))
				.addReg(SizeReg);
				BuildMI(*PreAccessBB, PreI, DL,
				TII->get(AMDGPU::S_AND_SAVEEXEC_B64), ExecReg)
				arsenmUnsubmitted Done Reply Inline Actions I'm working on only allowing instructions that are terminators to modify exec, but this is introducing a new exec write in the middle of a block arsenm: I'm working on only allowing instructions that are terminators to modify exec, but this is…
				critsonAuthorUnsubmitted Not Done Reply Inline Actions This code inserts new basic blocks such that only terminators modify exec, unless I missed something? critson: This code inserts new basic blocks such that only terminators modify exec, unless I missed…
				.addReg(CondReg, getKillRegState(!IsLoad));
				BuildMI(*ScratchAccessBB, ScratchI, DL,
				TII->get(AMDGPU::S_MOV_B64), AMDGPU::EXEC)
				.addReg(ExecReg, RegState::Kill);

				if (IsLoad) {
				MachineOperand &Dst = MI->getOperand(0);
				const unsigned DstReg = Dst.getReg();
				const TargetRegisterClass *DstRC = MRI->getRegClass(DstReg);
				const unsigned LoadDstReg = MRI->createVirtualRegister(DstRC);

				Dst.setReg(LoadDstReg);

				cndmask0Reg(*PostAccessBB, MRI, RI, TII, PostI, DL,
				LoadDstReg, CondReg, true, DstReg);
				}
				} else {
				if (MI->mayLoad()) {
				// Load -> scalar comparison, then load, else load zero
				MachineBasicBlock *OutOfBoundsBB = MF.CreateMachineBasicBlock();
				MachineBasicBlock::iterator OOBI = OutOfBoundsBB->end();

				MBBI--;
				MF.insert(MBBI, OutOfBoundsBB);
				OutOfBoundsBB->addSuccessor(PostAccessBB);
				PreAccessBB->addSuccessor(OutOfBoundsBB);

				// TODO: mark SCC as clobbered?
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CMP_LT_U32))
				.addReg(AddrReg, getKillRegState(KillAddr))
				.addReg(SizeReg);
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CBRANCH_SCC0))
				.addMBB(OutOfBoundsBB);

				BuildMI(*ScratchAccessBB, ScratchI, DL, TII->get(AMDGPU::S_BRANCH))
				.addMBB(PostAccessBB);

				MachineOperand &Dst = MI->getOperand(0);
				const unsigned DstReg = Dst.getReg();

				const TargetRegisterClass *DstRC = MRI->getRegClass(DstReg);
				const unsigned LoadDstReg = MRI->createVirtualRegister(DstRC);
				const unsigned ZeroDstReg = MRI->createVirtualRegister(DstRC);

				zeroReg(*OutOfBoundsBB, MRI, RI, TII, OOBI, DL, ZeroDstReg);

				BuildMI(*PostAccessBB, PostI, DL, TII->get(TargetOpcode::PHI), DstReg)
				.addReg(LoadDstReg)
				.addMBB(ScratchAccessBB)
				.addReg(ZeroDstReg)
				.addMBB(OutOfBoundsBB);

				Dst.setReg(LoadDstReg);
				} else {
				// Store -> scalar comparison and skip store
				// TODO: mark SCC as clobbered?
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CMP_LT_U32))
				.addReg(AddrReg, getKillRegState(KillAddr))
				.addReg(SizeReg);
				BuildMI(*PreAccessBB, PreI, DL, TII->get(AMDGPU::S_CBRANCH_SCC0))
				.addMBB(PostAccessBB);
				PreAccessBB->addSuccessor(PostAccessBB);
				}
				}

				SizeUsed = true;
				return true;
				}

				bool SIInsertScratchBounds::runOnMachineFunction(MachineFunction &MF) {
				bool Changed = false;

				ST = &MF.getSubtarget<GCNSubtarget>();
				TII = ST->getInstrInfo();
				MRI = &MF.getRegInfo();
				RI = ST->getRegisterInfo();

				Worklist.clear();

				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				if (MI.mayLoad() \|\| MI.mayStore()) {
				for (const auto &MMO : MI.memoperands()) {
				const unsigned AddrSpace = MMO->getPointerInfo().getAddrSpace();
				if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
				arsenmUnsubmitted Not Done Reply Inline Actions This isn't going to catch scratch accesses without an MMO, you need to check the opcodes arsenm: This isn't going to catch scratch accesses without an MMO, you need to check the opcodes
				critsonAuthorUnsubmitted Not Done Reply Inline Actions I don't think we can do anything sensible with instructions without MMO. They could be accessing any address space, so it would not be appropriate to apply scratch bounds checks to them. critson: I don't think we can do anything sensible with instructions without MMO. They could be…
				// uses scratch; needs to be processed
				Worklist.push_back(&MI);
				break;
				arsenmUnsubmitted Done Reply Inline Actions I dislike building a worklist of all instructions in the function to handle. This shouldn't be hard to handle as you go through the function arsenm: I dislike building a worklist of all instructions in the function to handle. This shouldn't be…
				}
				}
				}
				}
				}

				if (!Worklist.empty()) {
				const MachineFrameInfo &FrameInfo = MF.getFrameInfo();
				const int64_t ScratchSizeEstimate =
				(int64_t) FrameInfo.estimateStackSize(MF);
				arsenmUnsubmitted Done Reply Inline Actions This isn't going to be strong enough to guarantee that there will never be an out of bounds access. This approach also isn't going to work in a callable function, in the presence of calls, or variable sized stack objects (which I'm planning on implementing) arsenm: This isn't going to be strong enough to guarantee that there will never be an out of bounds…
				critsonAuthorUnsubmitted Done Reply Inline Actions On that basis do you have any opinions on the appropriate method for establishing stack size going forward? critson: On that basis do you have any opinions on the appropriate method for establishing stack size…
				nhaehnleUnsubmitted Done Reply Inline Actions You could emit the stack size as a TargetGlobalAddress referring to a special symbol, and ask PAL to fixup the relocations for that symbol at load time. Details would have to be figured out, but that works as a rough approach. nhaehnle: You could emit the stack size as a TargetGlobalAddress referring to a special symbol, and ask…

				const unsigned SizeReg =
				MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
				bool SizeUsed = false;

				for (MachineInstr *MI : Worklist) {
				Changed \|= insertBoundsCheck(
				MF, MI, ScratchSizeEstimate, SizeReg, SizeUsed
				);
				}

				// If scratch size is required then add to prelude
				if (SizeUsed) {
				MachineBasicBlock *PreludeBB = &MF.front();
				MachineBasicBlock::iterator PreludeI = PreludeBB->begin();
				DebugLoc UnknownDL;

				BuildMI(*PreludeBB, PreludeI, UnknownDL,
				TII->get(AMDGPU::S_MOV_B32), SizeReg)
				.addImm(ScratchSizeEstimate);

				Changed = true;
				}

				Worklist.clear();
				}

				return Changed;
				}