This is an archive of the discontinued LLVM Phabricator instance.

Differential D35967

[AMDGPU] Collapse adjacent SI_END_CF
ClosedPublic

Authored by rampitec on Jul 27 2017, 4:41 PM.

Download Raw Diff

Details

Reviewers

arsenm

Commits

rG37e7f959c0a3: [AMDGPU] Collapse adjacent SI_END_CF
rL309762: [AMDGPU] Collapse adjacent SI_END_CF

Summary

Add a pass to remove redundant S_OR_B64 instructions enabling lanes in
the exec. If two SI_END_CF (lowered as S_OR_B64) come together without any
vector instructions between them we can only keep outer SI_END_CF, given
that CFG is structured and exec bits of the outer end statement are always
not less than exec bit of the inner one.

This needs to be done before the RA to eliminate saved exec bits registers
but after register coalescer to have no vector registers copies in between
of different end cf statements.

Diff Detail

Event Timeline

rampitec created this revision.Jul 27 2017, 4:41 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptJul 27 2017, 4:41 PM

arsenm added inline comments.Jul 27 2017, 7:14 PM

lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82	I would hope we aren't seeing s_mov_b64s with register inputs at this point. Does this actually happen?
114	Missing skipFunction()
129	isBranch check first. I'm not sure why this needs to specifically skip branches though
153–155	Since you don't seem to be using LIS for anything, could you move this out of the loop so all of the updates are done at once after you're done modifying the uses?

Addressed review comments.

lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82	Yes. That is what we actually have at this point: 64B %vreg1<def> = COPY %EXEC, %EXEC<imp-def>; SReg_64:%vreg1 80B %vreg55<def> = S_AND_B64 %vreg1, %vreg12, %SCC<imp-def,dead>; SReg_64:%vreg55,%vreg1,%vreg12 96B %EXEC<def> = S_MOV_B64_term %vreg55; SReg_64:%vreg55
129	It actually breaks if outer end_cf is not an immediate layout successor, as a branch does not read exec, at least not an unconditional branch. As far as I understood in that situation with not a simplest enclosure of cfg scopes a block placement can result in a wrong mask at the end.

rampitec added a child revision: D36007: [AMDGPU] Turn s_and_saveexec_b64 into s_and_b64 if result is unused.Jul 28 2017, 10:46 AM

arsenm added inline comments.Jul 28 2017, 11:35 AM

lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82	This won't catch that though. S_MOV_B64_term is different
129	Should this be using isLayoutSuccessor then?

rampitec added inline comments.Jul 28 2017, 11:39 AM

lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82	OK, I really meant to catch COPY, catching term was not an intention. Do you want to have no switch here?
129	I'm scanning instructions anyway, I believe either way it should work.

Addressed review comments.

Ping

LGTM

This revision is now accepted and ready to land.Aug 1 2017, 3:49 PM

Closed by commit rL309762: [AMDGPU] Collapse adjacent SI_END_CF (authored by rampitec). · Explain WhyAug 1 2017, 4:15 PM

This revision was automatically updated to reflect the committed changes.

arsenm added inline comments.Aug 1 2017, 5:17 PM

llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150 ↗	(On Diff #109249)	We should also be stripping out exec modifications with no VALU instructions before s_endpgm
150 ↗	(On Diff #109249)	Any scalar instruction really

rampitec added inline comments.Aug 1 2017, 5:33 PM

llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150 ↗	(On Diff #109249)	Not a scalar store. Also not sure about barries and waits.

rampitec added inline comments.Aug 1 2017, 5:35 PM

llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150 ↗	(On Diff #109249)	And then not any instructions contributing to that scalar store.

nhaehnle added inline comments.Aug 2 2017, 8:38 AM

llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150 ↗	(On Diff #109249)	Please keep SI_RETURN_TO_EPILOG in mind. For non-monolithic graphics shaders, returning with the correct EXEC mask is important (and we can't just dead-code-eliminate SALU instructions either). Basically, as long as it's for S_ENDPGM only, it should be okay.

rampitec mentioned this in D129073: [AMDGPU] Combine s_or_saveexec, s_xor instructions..Jul 7 2022, 11:42 AM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPU.h

4 lines

AMDGPUTargetMachine.cpp

4 lines

CMakeLists.txt

1 line

SIOptimizeExecMaskingPreRA.cpp

160 lines

test/

CodeGen/

AMDGPU/

collapse-endcf.ll

188 lines

Diff 108550

lib/Target/AMDGPU/AMDGPU.h

	Show All 38 Lines
	FunctionPass *createSIAnnotateControlFlowPass();			FunctionPass *createSIAnnotateControlFlowPass();
	FunctionPass *createSIFoldOperandsPass();			FunctionPass *createSIFoldOperandsPass();
	FunctionPass *createSIPeepholeSDWAPass();			FunctionPass *createSIPeepholeSDWAPass();
	FunctionPass *createSILowerI1CopiesPass();			FunctionPass *createSILowerI1CopiesPass();
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
	FunctionPass *createSILoadStoreOptimizerPass();			FunctionPass *createSILoadStoreOptimizerPass();
	FunctionPass *createSIWholeQuadModePass();			FunctionPass *createSIWholeQuadModePass();
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
				FunctionPass *createSIOptimizeExecMaskingPreRAPass();
	FunctionPass *createSIFixSGPRCopiesPass();			FunctionPass *createSIFixSGPRCopiesPass();
	FunctionPass *createSIMemoryLegalizerPass();			FunctionPass *createSIMemoryLegalizerPass();
	FunctionPass *createSIDebuggerInsertNopsPass();			FunctionPass *createSIDebuggerInsertNopsPass();
	FunctionPass *createSIInsertWaitsPass();			FunctionPass *createSIInsertWaitsPass();
	FunctionPass *createSIInsertWaitcntsPass();			FunctionPass *createSIInsertWaitcntsPass();
	FunctionPass *createAMDGPUCodeGenPreparePass();			FunctionPass *createAMDGPUCodeGenPreparePass();
	FunctionPass *createAMDGPUMachineCFGStructurizerPass();			FunctionPass *createAMDGPUMachineCFGStructurizerPass();

	▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines

	ModulePass* createAMDGPUUnifyMetadataPass();			ModulePass* createAMDGPUUnifyMetadataPass();
	void initializeAMDGPUUnifyMetadataPass(PassRegistry&);			void initializeAMDGPUUnifyMetadataPass(PassRegistry&);
	extern char &AMDGPUUnifyMetadataID;			extern char &AMDGPUUnifyMetadataID;

	void initializeSIFixControlFlowLiveIntervalsPass(PassRegistry&);			void initializeSIFixControlFlowLiveIntervalsPass(PassRegistry&);
	extern char &SIFixControlFlowLiveIntervalsID;			extern char &SIFixControlFlowLiveIntervalsID;

				void initializeSIOptimizeExecMaskingPreRAPass(PassRegistry&);
				extern char &SIOptimizeExecMaskingPreRAID;

	void initializeAMDGPUAnnotateUniformValuesPass(PassRegistry&);			void initializeAMDGPUAnnotateUniformValuesPass(PassRegistry&);
	extern char &AMDGPUAnnotateUniformValuesPassID;			extern char &AMDGPUAnnotateUniformValuesPassID;

	void initializeAMDGPUCodeGenPreparePass(PassRegistry&);			void initializeAMDGPUCodeGenPreparePass(PassRegistry&);
	extern char &AMDGPUCodeGenPrepareID;			extern char &AMDGPUCodeGenPrepareID;

	void initializeSIAnnotateControlFlowPass(PassRegistry&);			void initializeSIAnnotateControlFlowPass(PassRegistry&);
	extern char &SIAnnotateControlFlowPassID;			extern char &SIAnnotateControlFlowPassID;
	▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeSILowerI1CopiesPass(*PR);		initializeSILowerI1CopiesPass(*PR);
initializeSIFixSGPRCopiesPass(*PR);		initializeSIFixSGPRCopiesPass(*PR);
initializeSIFixVGPRCopiesPass(*PR);		initializeSIFixVGPRCopiesPass(*PR);
initializeSIFoldOperandsPass(*PR);		initializeSIFoldOperandsPass(*PR);
initializeSIPeepholeSDWAPass(*PR);		initializeSIPeepholeSDWAPass(*PR);
initializeSIShrinkInstructionsPass(*PR);		initializeSIShrinkInstructionsPass(*PR);
initializeSIFixControlFlowLiveIntervalsPass(*PR);		initializeSIFixControlFlowLiveIntervalsPass(*PR);
		initializeSIOptimizeExecMaskingPreRAPass(*PR);
initializeSILoadStoreOptimizerPass(*PR);		initializeSILoadStoreOptimizerPass(*PR);
initializeAMDGPUAlwaysInlinePass(*PR);		initializeAMDGPUAlwaysInlinePass(*PR);
initializeAMDGPUAnnotateKernelFeaturesPass(*PR);		initializeAMDGPUAnnotateKernelFeaturesPass(*PR);
initializeAMDGPUAnnotateUniformValuesPass(*PR);		initializeAMDGPUAnnotateUniformValuesPass(*PR);
initializeAMDGPULowerIntrinsicsPass(*PR);		initializeAMDGPULowerIntrinsicsPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
initializeAMDGPUCodeGenPreparePass(*PR);		initializeAMDGPUCodeGenPreparePass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
▲ Show 20 Lines • Show All 617 Lines • ▼ Show 20 Lines	void GCNPassConfig::addFastRegAlloc(FunctionPass *RegAllocPass) {
// TwoAddressInstructions, otherwise the processing of the tied operand of		// TwoAddressInstructions, otherwise the processing of the tied operand of
// SI_ELSE will introduce a copy of the tied operand source after the else.		// SI_ELSE will introduce a copy of the tied operand source after the else.
insertPass(&PHIEliminationID, &SILowerControlFlowID, false);		insertPass(&PHIEliminationID, &SILowerControlFlowID, false);

TargetPassConfig::addFastRegAlloc(RegAllocPass);		TargetPassConfig::addFastRegAlloc(RegAllocPass);
}		}

void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {		void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {
		if (getOptLevel() > CodeGenOpt::None)
		insertPass(&MachineSchedulerID, &SIOptimizeExecMaskingPreRAID);

// This needs to be run directly before register allocation because earlier		// This needs to be run directly before register allocation because earlier
// passes might recompute live intervals.		// passes might recompute live intervals.
insertPass(&MachineSchedulerID, &SIFixControlFlowLiveIntervalsID);		insertPass(&MachineSchedulerID, &SIFixControlFlowLiveIntervalsID);

// This must be run immediately after phi elimination and before		// This must be run immediately after phi elimination and before
// TwoAddressInstructions, otherwise the processing of the tied operand of		// TwoAddressInstructions, otherwise the processing of the tied operand of
// SI_ELSE will introduce a copy of the tied operand source after the else.		// SI_ELSE will introduce a copy of the tied operand source after the else.
insertPass(&PHIEliminationID, &SILowerControlFlowID, false);		insertPass(&PHIEliminationID, &SILowerControlFlowID, false);
Show All 39 Lines

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
SIISelLowering.cpp		SIISelLowering.cpp
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
SILowerControlFlow.cpp		SILowerControlFlow.cpp
SILowerI1Copies.cpp		SILowerI1Copies.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
SIMachineScheduler.cpp		SIMachineScheduler.cpp
SIMemoryLegalizer.cpp		SIMemoryLegalizer.cpp
SIOptimizeExecMasking.cpp		SIOptimizeExecMasking.cpp
		SIOptimizeExecMaskingPreRA.cpp
SIPeepholeSDWA.cpp		SIPeepholeSDWA.cpp
SIRegisterInfo.cpp		SIRegisterInfo.cpp
SIShrinkInstructions.cpp		SIShrinkInstructions.cpp
SIWholeQuadMode.cpp		SIWholeQuadMode.cpp
GCNIterativeScheduler.cpp		GCNIterativeScheduler.cpp
GCNMinRegStrategy.cpp		GCNMinRegStrategy.cpp
GCNRegPressure.cpp		GCNRegPressure.cpp
${GLOBAL_ISEL_BUILD_FILES}		${GLOBAL_ISEL_BUILD_FILES}
)		)

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(Utils)		add_subdirectory(Utils)

lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp

This file was added.

				//===-- SIOptimizeExecMaskingPreRA.cpp ------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief This pass removes redundant S_OR_B64 instructions enabling lanes in
				/// the exec. If two SI_END_CF (lowered as S_OR_B64) come together without any
				/// vector instructions between them we can only keep outer SI_END_CF, given
				/// that CFG is structured and exec bits of the outer end statement are always
				/// not less than exec bit of the inner one.
				///
				/// This needs to be done before the RA to eliminate saved exec bits registers
				/// but after register coalescer to have no vector registers copies in between
				/// of different end cf statements.
				///
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "SIInstrInfo.h"
				#include "llvm/CodeGen/LiveIntervalAnalysis.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"

				using namespace llvm;

				#define DEBUG_TYPE "si-optimize-exec-masking-pre-ra"

				namespace {

				class SIOptimizeExecMaskingPreRA : public MachineFunctionPass {
				public:
				static char ID;

				public:
				SIOptimizeExecMaskingPreRA() : MachineFunctionPass(ID) {
				initializeSIOptimizeExecMaskingPreRAPass(*PassRegistry::getPassRegistry());
				}

				bool runOnMachineFunction(MachineFunction &MF) override;

				StringRef getPassName() const override {
				return "SI optimize exec mask operations pre-RA";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<LiveIntervals>();
				AU.setPreservesAll();
				MachineFunctionPass::getAnalysisUsage(AU);
				}
				};

				} // End anonymous namespace.

				INITIALIZE_PASS_BEGIN(SIOptimizeExecMaskingPreRA, DEBUG_TYPE,
				"SI optimize exec mask operations pre-RA", false, false)
				INITIALIZE_PASS_DEPENDENCY(LiveIntervals)
				INITIALIZE_PASS_END(SIOptimizeExecMaskingPreRA, DEBUG_TYPE,
				"SI optimize exec mask operations pre-RA", false, false)

				char SIOptimizeExecMaskingPreRA::ID = 0;

				char &llvm::SIOptimizeExecMaskingPreRAID = SIOptimizeExecMaskingPreRA::ID;

				FunctionPass *llvm::createSIOptimizeExecMaskingPreRAPass() {
				return new SIOptimizeExecMaskingPreRA();
				}

				static bool isEndCF(const MachineInstr& MI, const SIRegisterInfo* TRI) {
				return MI.getOpcode() == AMDGPU::S_OR_B64 &&
				MI.modifiesRegister(AMDGPU::EXEC, TRI);
				}

				static bool isFullExecCopy(const MachineInstr& MI) {
				switch (MI.getOpcode()) {
				default:
				break;
				case AMDGPU::S_MOV_B64:
				arsenmUnsubmitted Done Reply Inline Actions I would hope we aren't seeing s_mov_b64s with register inputs at this point. Does this actually happen? arsenm: I would hope we aren't seeing s_mov_b64s with register inputs at this point. Does this actually…
				rampitecAuthorUnsubmitted Done Reply Inline Actions Yes. That is what we actually have at this point: 64B %vreg1<def> = COPY %EXEC, %EXEC<imp-def>; SReg_64:%vreg1 80B %vreg55<def> = S_AND_B64 %vreg1, %vreg12, %SCC<imp-def,dead>; SReg_64:%vreg55,%vreg1,%vreg12 96B %EXEC<def> = S_MOV_B64_term %vreg55; SReg_64:%vreg55 rampitec: Yes. That is what we actually have at this point: ``` 64B %vreg1<def> = COPY…
				arsenmUnsubmitted Done Reply Inline Actions This won't catch that though. S_MOV_B64_term is different arsenm: This won't catch that though. S_MOV_B64_term is different
				rampitecAuthorUnsubmitted Done Reply Inline Actions OK, I really meant to catch COPY, catching term was not an intention. Do you want to have no switch here? rampitec: OK, I really meant to catch COPY, catching term was not an intention. Do you want to have no…
				case AMDGPU::COPY:
				return MI.getOperand(1).isReg() &&
				MI.getOperand(1).getReg() == AMDGPU::EXEC &&
				!MI.getOperand(1).getSubReg();
				}
				return false;
				}

				static unsigned getOrNonExecReg(const MachineInstr &MI,
				const SIInstrInfo &TII) {
				auto Op = TII.getNamedOperand(MI, AMDGPU::OpName::src1);
				if (Op->isReg() && Op->getReg() != AMDGPU::EXEC)
				return Op->getReg();
				Op = TII.getNamedOperand(MI, AMDGPU::OpName::src0);
				if (Op->isReg() && Op->getReg() != AMDGPU::EXEC)
				return Op->getReg();
				return AMDGPU::NoRegister;
				}

				static MachineInstr* getOrExecSource(const MachineInstr &MI,
				const SIInstrInfo &TII,
				const MachineRegisterInfo &MRI) {
				auto SavedExec = getOrNonExecReg(MI, TII);
				if (SavedExec == AMDGPU::NoRegister)
				return nullptr;
				auto SaveExecInst = MRI.getUniqueVRegDef(SavedExec);
				if (!SaveExecInst \|\| !isFullExecCopy(*SaveExecInst))
				return nullptr;
				return SaveExecInst;
				}

				bool SIOptimizeExecMaskingPreRA::runOnMachineFunction(MachineFunction &MF) {
				arsenmUnsubmitted Done Reply Inline Actions Missing skipFunction() arsenm: Missing skipFunction()
				const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
				const SIRegisterInfo *TRI = ST.getRegisterInfo();
				const SIInstrInfo *TII = ST.getInstrInfo();
				MachineRegisterInfo &MRI = MF.getRegInfo();
				LiveIntervals *LIS = &getAnalysis<LiveIntervals>();
				bool Changed = false;

				for (MachineBasicBlock &MBB : MF) {
				auto Lead = MBB.begin(), E = MBB.end();
				if (MBB.succ_size() != 1 \|\| Lead == E \|\| !isEndCF(*Lead, TRI))
				continue;
				auto I = std::next(Lead);

				for ( ; I != E; ++I) {
				if (!TII->isSALU(*I) \|\| I->readsRegister(AMDGPU::EXEC, TRI) \|\|
				arsenmUnsubmitted Done Reply Inline Actions isBranch check first. I'm not sure why this needs to specifically skip branches though arsenm: isBranch check first. I'm not sure why this needs to specifically skip branches though
				rampitecAuthorUnsubmitted Done Reply Inline Actions It actually breaks if outer end_cf is not an immediate layout successor, as a branch does not read exec, at least not an unconditional branch. As far as I understood in that situation with not a simplest enclosure of cfg scopes a block placement can result in a wrong mask at the end. rampitec: It actually breaks if outer end_cf is not an immediate layout successor, as a branch does not…
				arsenmUnsubmitted Done Reply Inline Actions Should this be using isLayoutSuccessor then? arsenm: Should this be using isLayoutSuccessor then?
				rampitecAuthorUnsubmitted Done Reply Inline Actions I'm scanning instructions anyway, I believe either way it should work. rampitec: I'm scanning instructions anyway, I believe either way it should work.
				I->isBranch())
				break;
				}

				if (I != E)
				continue;

				const MachineBasicBlock* Succ = *MBB.succ_begin();
				const auto NextLead = Succ->begin();
				if (NextLead == Succ->end() \|\| !isEndCF(*NextLead, TRI) \|\|
				!getOrExecSource(NextLead, TII, MRI))
				continue;

				DEBUG(dbgs() << "Redundant EXEC = S_OR_B64 found: " << *Lead << '\n');

				unsigned SaveExecReg = getOrNonExecReg(Lead, TII);
				LIS->RemoveMachineInstrFromMaps(*Lead);
				Lead->eraseFromParent();
				if (SaveExecReg) {
				LIS->removeInterval(SaveExecReg);
				LIS->createAndComputeVirtRegInterval(SaveExecReg);
				}
				// Recompute liveness for both reg units of exec.
				LIS->removeRegUnit(*MCRegUnitIterator(AMDGPU::EXEC_LO, TRI));
				LIS->removeRegUnit(*MCRegUnitIterator(AMDGPU::EXEC_HI, TRI));

				arsenmUnsubmitted Done Reply Inline Actions Since you don't seem to be using LIS for anything, could you move this out of the loop so all of the updates are done at once after you're done modifying the uses? arsenm: Since you don't seem to be using LIS for anything, could you move this out of the loop so all…
				Changed = true;
				}

				return Changed;
				}

test/CodeGen/AMDGPU/collapse-endcf.ll

This file was added.

				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

				; GCN-LABEL: {{^}}simple_nested_if:
				; GCN: s_and_saveexec_b64 [[SAVEEXEC:s\[[0-9:]+\]]]
				; GCN-NEXT: ; mask branch [[ENDIF:BB[0-9_]+]]
				; GCN-NEXT: s_cbranch_execz [[ENDIF]]
				; GCN: s_and_saveexec_b64
				; GCN-NEXT: ; mask branch [[ENDIF]]
				; GCN-NEXT: {{^BB[0-9_]+}}:
				; GCN: store_dword
				; GCN-NEXT: {{^}}[[ENDIF]]:
				; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC]]
				; GCN-NEXT: s_endpgm
				define amdgpu_kernel void @simple_nested_if(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp1 = icmp ugt i32 %tmp, 1
				br i1 %tmp1, label %bb.outer.then, label %bb.outer.end

				bb.outer.then: ; preds = %bb
				%tmp4 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp
				store i32 0, i32 addrspace(1)* %tmp4, align 4
				%tmp5 = icmp eq i32 %tmp, 2
				br i1 %tmp5, label %bb.outer.end, label %bb.inner.then

				bb.inner.then: ; preds = %bb.outer.then
				%tmp7 = add i32 %tmp, 1
				%tmp9 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp7
				store i32 1, i32 addrspace(1)* %tmp9, align 4
				br label %bb.outer.end

				bb.outer.end: ; preds = %bb.outer.then, %bb.inner.then, %bb
				ret void
				}

				; GCN-LABEL: {{^}}uncollapsable_nested_if:
				; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]
				; GCN-NEXT: ; mask branch [[ENDIF_OUTER:BB[0-9_]+]]
				; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER]]
				; GCN: s_and_saveexec_b64 [[SAVEEXEC_INNER:s\[[0-9:]+\]]]
				; GCN-NEXT: ; mask branch [[ENDIF_INNER:BB[0-9_]+]]
				; GCN-NEXT: {{^BB[0-9_]+}}:
				; GCN: store_dword
				; GCN-NEXT: {{^}}[[ENDIF_INNER]]:
				; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER]]
				; GCN: store_dword
				; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:
				; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]
				; GCN-NEXT: s_endpgm
				define amdgpu_kernel void @uncollapsable_nested_if(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp1 = icmp ugt i32 %tmp, 1
				br i1 %tmp1, label %bb.outer.then, label %bb.outer.end

				bb.outer.then: ; preds = %bb
				%tmp4 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp
				store i32 0, i32 addrspace(1)* %tmp4, align 4
				%tmp5 = icmp eq i32 %tmp, 2
				br i1 %tmp5, label %bb.inner.end, label %bb.inner.then

				bb.inner.then: ; preds = %bb.outer.then
				%tmp7 = add i32 %tmp, 1
				%tmp8 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp7
				store i32 1, i32 addrspace(1)* %tmp8, align 4
				br label %bb.inner.end

				bb.inner.end: ; preds = %bb.inner.then, %bb.outer.then
				%tmp9 = add i32 %tmp, 2
				%tmp10 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp9
				store i32 2, i32 addrspace(1)* %tmp10, align 4
				br label %bb.outer.end

				bb.outer.end: ; preds = %bb.inner.then, %bb
				ret void
				}

				; GCN-LABEL: {{^}}nested_if_if_else:
				; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]
				; GCN-NEXT: ; mask branch [[ENDIF_OUTER:BB[0-9_]+]]
				; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER]]
				; GCN: s_and_saveexec_b64 [[SAVEEXEC_INNER:s\[[0-9:]+\]]]
				; GCN-NEXT: s_xor_b64 [[SAVEEXEC_INNER2:s\[[0-9:]+\]]], exec, [[SAVEEXEC_INNER]]
				; GCN-NEXT: ; mask branch [[THEN_INNER:BB[0-9_]+]]
				; GCN-NEXT: {{^BB[0-9_]+}}:
				; GCN: store_dword
				; GCN-NEXT: {{^}}[[THEN_INNER]]:
				; GCN-NEXT: s_or_saveexec_b64 [[SAVEEXEC_INNER3:s\[[0-9:]+\]]], [[SAVEEXEC_INNER2]]
				; GCN-NEXT: s_xor_b64 exec, exec, [[SAVEEXEC_INNER3]]
				; GCN-NEXT: ; mask branch [[ENDIF_OUTER]]
				; GCN: store_dword
				; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:
				; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]
				; GCN-NEXT: s_endpgm
				define amdgpu_kernel void @nested_if_if_else(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp1 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp
				store i32 0, i32 addrspace(1)* %tmp1, align 4
				%tmp2 = icmp ugt i32 %tmp, 1
				br i1 %tmp2, label %bb.outer.then, label %bb.outer.end

				bb.outer.then: ; preds = %bb
				%tmp5 = icmp eq i32 %tmp, 2
				br i1 %tmp5, label %bb.then, label %bb.else

				bb.then: ; preds = %bb.outer.then
				%tmp3 = add i32 %tmp, 1
				%tmp4 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp3
				store i32 1, i32 addrspace(1)* %tmp4, align 4
				br label %bb.outer.end

				bb.else: ; preds = %bb.outer.then
				%tmp7 = add i32 %tmp, 2
				%tmp9 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp7
				store i32 2, i32 addrspace(1)* %tmp9, align 4
				br label %bb.outer.end

				bb.outer.end: ; preds = %bb, %bb.then, %bb.else
				ret void
				}

				; GCN-LABEL: {{^}}nested_if_else_if:
				; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]
				; GCN-NEXT: s_xor_b64 [[SAVEEXEC_OUTER2:s\[[0-9:]+\]]], exec, [[SAVEEXEC_OUTER]]
				; GCN-NEXT: ; mask branch [[THEN_OUTER:BB[0-9_]+]]
				; GCN-NEXT: s_cbranch_execz [[THEN_OUTER]]
				; GCN-NEXT: {{^BB[0-9_]+}}:
				; GCN: store_dword
				; GCN-NEXT: s_and_saveexec_b64 [[SAVEEXEC_INNER_IF_OUTER_ELSE:s\[[0-9:]+\]]]
				; GCN-NEXT: ; mask branch [[THEN_OUTER_FLOW:BB[0-9_]+]]
				; GCN-NEXT: {{^BB[0-9_]+}}:
				; GCN: store_dword
				; GCN-NEXT: {{^}}[[THEN_OUTER_FLOW]]:
				; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER_IF_OUTER_ELSE]]
				; GCN-NEXT: {{^}}[[THEN_OUTER]]:
				; GCN-NEXT: s_or_saveexec_b64 [[SAVEEXEC_OUTER3:s\[[0-9:]+\]]], [[SAVEEXEC_OUTER2]]
				; GCN-NEXT: s_xor_b64 exec, exec, [[SAVEEXEC_OUTER3]]
				; GCN-NEXT: ; mask branch [[ENDIF_OUTER:BB[0-9_]+]]
				; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER]]
				; GCN-NEXT: {{^BB[0-9_]+}}:
				; GCN: store_dword
				; GCN-NEXT: s_and_saveexec_b64 [[SAVEEXEC_INNER_IF_OUTER_THEN:s\[[0-9:]+\]]]
				; GCN-NEXT: ; mask branch [[ENDIF_INNER_OUTER_THEN:BB[0-9_]+]]
				; GCN-NEXT: {{^BB[0-9_]+}}:
				; GCN: store_dword
				; GCN-NEXT: {{^}}[[ENDIF_INNER_OUTER_THEN]]:
				; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER_IF_OUTER_THEN]]
				; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:
				; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER3]]
				; GCN-NEXT: s_endpgm
				define amdgpu_kernel void @nested_if_else_if(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp1 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %tmp
				store i32 0, i32 addrspace(1)* %tmp1, align 4
				%cc1 = icmp ugt i32 %tmp, 1
				br i1 %cc1, label %bb.outer.then, label %bb.outer.else

				bb.outer.then:
				%tmp2 = getelementptr inbounds i32, i32 addrspace(1)* %tmp1, i32 1
				store i32 1, i32 addrspace(1)* %tmp2, align 4
				%cc2 = icmp eq i32 %tmp, 2
				br i1 %cc2, label %bb.inner.then, label %bb.outer.end

				bb.inner.then:
				%tmp3 = getelementptr inbounds i32, i32 addrspace(1)* %tmp1, i32 2
				store i32 2, i32 addrspace(1)* %tmp3, align 4
				br label %bb.outer.end

				bb.outer.else:
				%tmp4 = getelementptr inbounds i32, i32 addrspace(1)* %tmp1, i32 3
				store i32 3, i32 addrspace(1)* %tmp4, align 4
				%cc3 = icmp eq i32 %tmp, 2
				br i1 %cc3, label %bb.inner.then2, label %bb.outer.end

				bb.inner.then2:
				%tmp5 = getelementptr inbounds i32, i32 addrspace(1)* %tmp1, i32 4
				store i32 4, i32 addrspace(1)* %tmp5, align 4
				br label %bb.outer.end

				bb.outer.end:
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #0

				attributes #0 = { nounwind readnone speculatable }