This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
17/17
AMDGPUUnifyDivergentExitNodes.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
2/2
kill-infinite-loop.ll
-
multi-divergent-exit-region.ll

Differential D102830

[AMDGPU] Avoid null export insertion when unifying exit blocks
AbandonedPublic

Authored by critson on May 20 2021, 1:25 AM.

Download Raw Diff

Details

Reviewers

foad
piotr
ruiling
arsenm
cwabbott

Summary

Avoid adding a null export by unifying existing "done" exports
in the unified exit block of a shader with divergent exits.
If a null export is required then place this in a separate block
only visited by divergent exits that require it.

This assumes well-formed IR generated by existing frontends.
Specifically that there is a single "done" export for the shader
which occurs in uniform control flow.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.May 20 2021, 1:25 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 7 others. · View Herald TranscriptMay 20 2021, 1:25 AM

critson requested review of this revision.May 20 2021, 1:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 20 2021, 1:25 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B105370: Diff 346653.May 20 2021, 2:18 AM

Mostly minor comments inline but I'm concerned about the "multiple predecessors with different done exports" case.

llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp
138	Could be "for (auto &I : reverse(BB)) ..."
141	Why compare with BoolTrue, rather than examine the value? Is this just optimisation to make the test as fast as possible?
220	Isn't it possible that different predecessors have different "done" exports? What is this function supposed to do in that case?
224	Don't need the ", 8" (unless you have special knowledge that 8 really is the optimal value).
228	What does "otherwise" refer back to?
229	Adding "else Phis[Idx] = Undef" here would simplify the creation of the new export intrinsic below.
243	I think `return IsCompr ? (...) : (...)` would be clearer.
249	Clearer to use .empty() than to use .size() as a boolean, here and 12 lines below.
328–329	Wasn't this supposed to say "exec mask"?

In D102830#2770970, @foad wrote:

Mostly minor comments inline but I'm concerned about the "multiple predecessors with different done exports" case.

I have gone into this a bit more in the comment.
However this expects input IR is well-formed w.r.t. exports, as it is produced by existing front-ends.
If we wanted to handle the generic case then I feel like we might as well write a pass that checks all exports and adjusts them so they are well formed and run it after this pass.

llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp
141	I believe this is standard practice rather than retrieving the constant int values representing true and false repeatedly.
220	This does assumes the IR is well-formed to a certain degree. I have added a comment to try to document this. Consider the following IR: if (condition) { export done 1 } else { export done 2 } return This will eventually be compiled to: set exec mask for if-branch export done 1 set exec mask for else-branch export done 2 restore exec mask return Multiple export done instructions will be executed which is invalid. The expected cases for this are: Divergent exits with their own exports, e.g. if (condition) { export done 1 return } else { export done 2 return } Or divergent exits without exports or a uniformly reached export, e.g. export done if (condition) { return } else { return }
228	I don't know, let's remove it.
229	Sure, I had to change the type of Phis and add some casts.

Address reviewer comments

Harbormaster completed remote builds in B105561: Diff 346922.May 20 2021, 9:29 PM

foad added inline comments.May 21 2021, 2:04 AM

llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp
220	OK. Thanks for explaining. I find it hard to get my head round the possible cases that we do or don't have to handle. Do you cope with: if (uniform condition) export done 1 else export done 2 if (divergent condition) return 1 else return 2 ?
229	Sorry, I didn't realise it would need a cast.

I don't see why we need to handle such situation that would never happen. Can we simply assert each function has at most one exp_done?

critson added inline comments.May 21 2021, 3:16 AM

llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp
220	Good question, I think the answer is no -- although it is in principle valid IR that we do not generate. I realize I was thinking primarily of divergent conditions, rather than uniform ones. Although the comment I added is still accurate, <= 1 reachable export done. In principle I can rework this to handle the case you raised, but I am not sure it is worth it?

foad added inline comments.May 21 2021, 3:25 AM

llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp
220	I have no idea what cases must be handled or are "worth" handling, so I think I'll leave that part of the review to someone who understands that.

Rework based on the assumption there is only one "done" export.
This is true for existing front-ends.

critson edited the summary of this revision. (Show Details)Jul 6 2021, 4:15 AM

foad added inline comments.Jul 6 2021, 4:32 AM

llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll
11–12	This comment seems obsolete now.
28–29	Same here.

Harbormaster completed remote builds in B112579: Diff 356673.Jul 6 2021, 4:43 AM

Remove redundant comments

critson marked 5 inline comments as done.Jul 6 2021, 8:03 PM

Harbormaster completed remote builds in B112715: Diff 356855.Jul 6 2021, 8:46 PM

critson mentioned this in D105610: [AMDGPU] Don't handle export done when unify exit nodes.Jul 8 2021, 12:13 AM

Hi, I think we can further simplify the logic in our backend regarding export done and implement my idea as D105610, could you take a look if you have concern? the main idea is we just trust frontend will setup exp done correctly and only insert exp-null-done for kill intrinsic which is handled in SILateBranchLowering.cpp.

critson abandoned this revision.Jul 9 2021, 1:58 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUUnifyDivergentExitNodes.cpp

158 lines

test/

CodeGen/

AMDGPU/

kill-infinite-loop.ll

14 lines

multi-divergent-exit-region.ll

16 lines

Diff 356855

llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	public:
AMDGPUUnifyDivergentExitNodes() : FunctionPass(ID) {		AMDGPUUnifyDivergentExitNodes() : FunctionPass(ID) {
initializeAMDGPUUnifyDivergentExitNodesPass(*PassRegistry::getPassRegistry());		initializeAMDGPUUnifyDivergentExitNodesPass(*PassRegistry::getPassRegistry());
}		}

// We can preserve non-critical-edgeness when we unify function exit nodes		// We can preserve non-critical-edgeness when we unify function exit nodes
void getAnalysisUsage(AnalysisUsage &AU) const override;		void getAnalysisUsage(AnalysisUsage &AU) const override;
BasicBlock *unifyReturnBlockSet(Function &F, DomTreeUpdater &DTU,		BasicBlock *unifyReturnBlockSet(Function &F, DomTreeUpdater &DTU,
ArrayRef<BasicBlock *> ReturningBlocks,		ArrayRef<BasicBlock *> ReturningBlocks,
bool InsertExport, StringRef Name);		bool UpdateExports);
bool runOnFunction(Function &F) override;		bool runOnFunction(Function &F) override;
};		};

} // end anonymous namespace		} // end anonymous namespace

char AMDGPUUnifyDivergentExitNodes::ID = 0;		char AMDGPUUnifyDivergentExitNodes::ID = 0;

char &llvm::AMDGPUUnifyDivergentExitNodesID = AMDGPUUnifyDivergentExitNodes::ID;		char &llvm::AMDGPUUnifyDivergentExitNodesID = AMDGPUUnifyDivergentExitNodes::ID;
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	for (BasicBlock *Pred : predecessors(Top)) {
if (Visited.insert(Pred).second)		if (Visited.insert(Pred).second)
Stack.push_back(Pred);		Stack.push_back(Pred);
}		}
}		}

return true;		return true;
}		}

static void removeDoneExport(Function &F) {		// Reverse search block to find a "done" export
ConstantInt *BoolFalse = ConstantInt::getFalse(F.getContext());		static IntrinsicInst findExportDone(BasicBlock BB, ConstantInt *BoolTrue) {
for (BasicBlock &BB : F) {		for (auto &I : reverse(*BB)) {
		foadUnsubmitted Done Reply Inline Actions Could be "for (auto &I : reverse(BB)) ..." foad: Could be "for (auto &I : reverse(BB)) ..."
for (Instruction &I : BB) {
if (IntrinsicInst *Intrin = llvm::dyn_cast<IntrinsicInst>(&I)) {		if (IntrinsicInst *Intrin = llvm::dyn_cast<IntrinsicInst>(&I)) {
if (Intrin->getIntrinsicID() == Intrinsic::amdgcn_exp) {		if (Intrin->getIntrinsicID() == Intrinsic::amdgcn_exp) {
Intrin->setArgOperand(6, BoolFalse); // done		if (Intrin->getArgOperand(6) == BoolTrue)
		foadUnsubmitted Done Reply Inline Actions Why compare with BoolTrue, rather than examine the value? Is this just optimisation to make the test as fast as possible? foad: Why compare with BoolTrue, rather than examine the value? Is this just optimisation to make the…
		critsonAuthorUnsubmitted Done Reply Inline Actions I believe this is standard practice rather than retrieving the constant int values representing true and false repeatedly. critson: I believe this is standard practice rather than retrieving the constant int values representing…
		return Intrin;
} else if (Intrin->getIntrinsicID() == Intrinsic::amdgcn_exp_compr) {		} else if (Intrin->getIntrinsicID() == Intrinsic::amdgcn_exp_compr) {
Intrin->setArgOperand(4, BoolFalse); // done		if (Intrin->getArgOperand(4) == BoolTrue)
		return Intrin;
}		}
}		}
}		}
		return nullptr;
}		}

		// Search a block and its predecessors to see if there is a reachable
		// "done" export.
		// Note: this assumes the IR is already well-formed
		static bool hasExportDone(BasicBlock BB, ConstantInt BoolTrue) {
		SmallPtrSet<BasicBlock *, 8> Visited;
		SmallVector<BasicBlock *> Stack;

		Stack.push_back(BB);
		do {
		BasicBlock *Top = Stack.pop_back_val();
		if (findExportDone(Top, BoolTrue))
		return true;

		for (BasicBlock *Pred : predecessors(Top)) {
		if (Visited.insert(Pred).second)
		Stack.push_back(Pred);
		}
		} while (!Stack.empty());

		return false;
}		}

BasicBlock *AMDGPUUnifyDivergentExitNodes::unifyReturnBlockSet(		BasicBlock *AMDGPUUnifyDivergentExitNodes::unifyReturnBlockSet(
Function &F, DomTreeUpdater &DTU, ArrayRef<BasicBlock *> ReturningBlocks,		Function &F, DomTreeUpdater &DTU, ArrayRef<BasicBlock *> ReturningBlocks,
bool InsertExport, StringRef Name) {		bool UpdateExports) {
// Otherwise, we need to insert a new basic block into the function, add a PHI		const StringRef ExportBlockName = "UnifiedExportBlock";
// nodes (if the function returns values), and convert all of the return		const StringRef ReturnBlockName = "UnifiedReturnBlock";
// instructions into unconditional branches.
BasicBlock *NewRetBlock = BasicBlock::Create(F.getContext(), Name, &F);		// Process exports first so that any new blocks are inserted in order.
IRBuilder<> B(NewRetBlock);		BasicBlock *ExportBlock = nullptr;
		SmallPtrSet<const BasicBlock *, 2> NeedsExport;
if (InsertExport) {
// Ensure that there's only one "done" export in the shader by removing the		if (UpdateExports) {
// "done" bit set on the original final export. More than one "done" export		// Determine whether return blocks have a reachable "done" export.
// can lead to undefined behavior.		// Note: this assume any reachable export is uniform, it is up to
removeDoneExport(F);		// frontends to ensure this is true.
		for (BasicBlock *BB : ReturningBlocks) {
		if (!hasExportDone(BB, ConstantInt::getTrue(F.getContext())))
		NeedsExport.insert(BB);
		}
		if (!NeedsExport.empty()) {
		// Insert export block with export null.
		// If every return needs an export, this will become the return block.
		ExportBlock = BasicBlock::Create(
		F.getContext(),
		NeedsExport.size() == ReturningBlocks.size() ? ReturnBlockName
		: ExportBlockName,
		&F);
		IRBuilder<> B(ExportBlock);
Value *Undef = UndefValue::get(B.getFloatTy());		Value *Undef = UndefValue::get(B.getFloatTy());
B.CreateIntrinsic(Intrinsic::amdgcn_exp, { B.getFloatTy() },		B.CreateIntrinsic(Intrinsic::amdgcn_exp, {B.getFloatTy()},
{		{
B.getInt32(AMDGPU::Exp::ET_NULL),		B.getInt32(AMDGPU::Exp::ET_NULL),
B.getInt32(0), // enabled channels		B.getInt32(0), // enabled channels
Undef, Undef, Undef, Undef, // values		Undef, Undef, Undef, Undef, // values
B.getTrue(), // done		B.getTrue(), // done
B.getTrue(), // valid mask		B.getTrue(), // valid mask
});		});
}		}
		}

		// We need to insert a new basic block into the function,
		// add a PHI nodes (if the function returns values),
		// and convert all of the return instructions into unconditional branches.
		BasicBlock *NewRetBlock =
		NeedsExport.size() == ReturningBlocks.size()
		? ExportBlock
		: BasicBlock::Create(F.getContext(), ReturnBlockName, &F);
		IRBuilder<> B(NewRetBlock);
		foadUnsubmitted Done Reply Inline Actions Isn't it possible that different predecessors have different "done" exports? What is this function supposed to do in that case? foad: Isn't it possible that different predecessors have different "done" exports? What is this…
		critsonAuthorUnsubmitted Done Reply Inline Actions This does assumes the IR is well-formed to a certain degree. I have added a comment to try to document this. Consider the following IR: if (condition) { export done 1 } else { export done 2 } return This will eventually be compiled to: set exec mask for if-branch export done 1 set exec mask for else-branch export done 2 restore exec mask return Multiple export done instructions will be executed which is invalid. The expected cases for this are: Divergent exits with their own exports, e.g. if (condition) { export done 1 return } else { export done 2 return } Or divergent exits without exports or a uniformly reached export, e.g. export done if (condition) { return } else { return } critson: This does assumes the IR is well-formed to a certain degree. I have added a comment to try to…
		foadUnsubmitted Done Reply Inline Actions OK. Thanks for explaining. I find it hard to get my head round the possible cases that we do or don't have to handle. Do you cope with: if (uniform condition) export done 1 else export done 2 if (divergent condition) return 1 else return 2 ? foad: OK. Thanks for explaining. I find it hard to get my head round the possible cases that we do or…
		critsonAuthorUnsubmitted Done Reply Inline Actions Good question, I think the answer is no -- although it is in principle valid IR that we do not generate. I realize I was thinking primarily of divergent conditions, rather than uniform ones. Although the comment I added is still accurate, <= 1 reachable export done. In principle I can rework this to handle the case you raised, but I am not sure it is worth it? critson: Good question, I think the answer is no -- although it is in principle valid IR that we do not…
		foadUnsubmitted Done Reply Inline Actions I have no idea what cases must be handled or are "worth" handling, so I think I'll leave that part of the review to someone who understands that. foad: I have no idea what cases must be handled or are "worth" handling, so I think I'll leave that…

		if (ExportBlock && ExportBlock != NewRetBlock)
		BranchInst::Create(NewRetBlock, ExportBlock);

		foadUnsubmitted Done Reply Inline Actions Don't need the ", 8" (unless you have special knowledge that 8 really is the optimal value). foad: Don't need the ", 8" (unless you have special knowledge that 8 really is the optimal value).
PHINode *PN = nullptr;		PHINode *PN = nullptr;
if (F.getReturnType()->isVoidTy()) {		if (F.getReturnType()->isVoidTy()) {
B.CreateRetVoid();		B.CreateRetVoid();
} else {		} else {
		foadUnsubmitted Done Reply Inline Actions What does "otherwise" refer back to? foad: What does "otherwise" refer back to?
		critsonAuthorUnsubmitted Done Reply Inline Actions I don't know, let's remove it. critson: I don't know, let's remove it.
// If the function doesn't return void... add a PHI node to the block...		// If the function doesn't return void... add a PHI node to the block...
		foadUnsubmitted Done Reply Inline Actions Adding "else Phis[Idx] = Undef" here would simplify the creation of the new export intrinsic below. foad: Adding "else Phis[Idx] = Undef" here would simplify the creation of the new export intrinsic…
		critsonAuthorUnsubmitted Done Reply Inline Actions Sure, I had to change the type of Phis and add some casts. critson: Sure, I had to change the type of Phis and add some casts.
		foadUnsubmitted Done Reply Inline Actions Sorry, I didn't realise it would need a cast. foad: Sorry, I didn't realise it would need a cast.
PN = B.CreatePHI(F.getReturnType(), ReturningBlocks.size(),		PN = B.CreatePHI(F.getReturnType(), ReturningBlocks.size(),
"UnifiedRetVal");		"UnifiedRetVal");
assert(!InsertExport);		assert(!UpdateExports);
B.CreateRet(PN);		B.CreateRet(PN);
}		}

// Loop over all of the blocks, replacing the return instruction with an		// Loop over all of the blocks, replacing the return instruction with an
// unconditional branch.		// unconditional branch.
std::vector<DominatorTree::UpdateType> Updates;		std::vector<DominatorTree::UpdateType> Updates;
Updates.reserve(ReturningBlocks.size());		Updates.reserve(ReturningBlocks.size() + 1);
		if (ExportBlock)
		Updates.push_back({DominatorTree::Insert, NewRetBlock, NewRetBlock});
for (BasicBlock *BB : ReturningBlocks) {		for (BasicBlock *BB : ReturningBlocks) {
// Add an incoming element to the PHI node for every return instruction that		// Add an incoming element to the PHI node for every return instruction that
		foadUnsubmitted Done Reply Inline Actions I think `return IsCompr ? (...) : (...)` would be clearer. foad: I think `return IsCompr ? (...) : (...)` would be clearer.
// is merging into this new block...		// is merging into this new block...
if (PN)		if (PN)
PN->addIncoming(BB->getTerminator()->getOperand(0), BB);		PN->addIncoming(BB->getTerminator()->getOperand(0), BB);

// Remove and delete the return inst.		// Remove and delete the return inst.
BB->getTerminator()->eraseFromParent();		BB->getTerminator()->eraseFromParent();
		foadUnsubmitted Done Reply Inline Actions Clearer to use .empty() than to use .size() as a boolean, here and 12 lines below. foad: Clearer to use .empty() than to use .size() as a boolean, here and 12 lines below.
		if (NeedsExport.count(BB)) {
		BranchInst::Create(ExportBlock, BB);
		Updates.push_back({DominatorTree::Insert, BB, ExportBlock});
		} else {
BranchInst::Create(NewRetBlock, BB);		BranchInst::Create(NewRetBlock, BB);
Updates.push_back({DominatorTree::Insert, BB, NewRetBlock});		Updates.push_back({DominatorTree::Insert, BB, NewRetBlock});
}		}
		}

if (RequireAndPreserveDomTree)		if (RequireAndPreserveDomTree)
DTU.applyUpdates(Updates);		DTU.applyUpdates(Updates);
Updates.clear();		Updates.clear();

for (BasicBlock *BB : ReturningBlocks) {		for (BasicBlock *BB : ReturningBlocks) {
// Cleanup possible branch to unconditional branch to the return.		// Cleanup possible branch to unconditional branch to the return.
simplifyCFG(BB, *TTI, RequireAndPreserveDomTree ? &DTU : nullptr,		simplifyCFG(BB, *TTI, RequireAndPreserveDomTree ? &DTU : nullptr,
Show All 23 Lines	bool AMDGPUUnifyDivergentExitNodes::runOnFunction(Function &F) {
// return.		// return.
SmallVector<BasicBlock *, 4> ReturningBlocks;		SmallVector<BasicBlock *, 4> ReturningBlocks;
SmallVector<BasicBlock *, 4> UniformlyReachedRetBlocks;		SmallVector<BasicBlock *, 4> UniformlyReachedRetBlocks;
SmallVector<BasicBlock *, 4> UnreachableBlocks;		SmallVector<BasicBlock *, 4> UnreachableBlocks;

// Dummy return block for infinite loop.		// Dummy return block for infinite loop.
BasicBlock *DummyReturnBB = nullptr;		BasicBlock *DummyReturnBB = nullptr;

bool InsertExport = false;		bool UpdateExports = false;

bool Changed = false;		bool Changed = false;
std::vector<DominatorTree::UpdateType> Updates;		std::vector<DominatorTree::UpdateType> Updates;

for (BasicBlock *BB : PDT.roots()) {		for (BasicBlock *BB : PDT.roots()) {
if (isa<ReturnInst>(BB->getTerminator())) {		if (isa<ReturnInst>(BB->getTerminator())) {
if (!isUniformlyReached(DA, *BB))		if (!isUniformlyReached(DA, *BB))
ReturningBlocks.push_back(BB);		ReturningBlocks.push_back(BB);
Show All 14 Lines	if (isa<ReturnInst>(BB->getTerminator())) {
// For pixel shaders, the producer guarantees that an export is		// For pixel shaders, the producer guarantees that an export is
// executed before each return instruction. However, if there is an		// executed before each return instruction. However, if there is an
// infinite loop and we insert a return ourselves, we need to uphold		// infinite loop and we insert a return ourselves, we need to uphold
// that guarantee by inserting a null export. This can happen e.g. in		// that guarantee by inserting a null export. This can happen e.g. in
// an infinite loop with kill instructions, which is supposed to		// an infinite loop with kill instructions, which is supposed to
// terminate. However, we don't need to do this if there is a non-void		// terminate. However, we don't need to do this if there is a non-void
// return value, since then there is an epilog afterwards which will		// return value, since then there is an epilog afterwards which will
// still export.		// still export.
//		//
// Note: In the case where only some threads enter the infinite loop,		// Note: we only insert a null export in the case where an existing
		foadUnsubmitted Done Reply Inline Actions Wasn't this supposed to say "exec mask"? foad: Wasn't this supposed to say "exec mask"?
// this can result in the null export happening redundantly after the		// "done" export is not reachable from a return block.
// original exports. However, The last "real" export happens after all		// This avoids double "done" exports for well-formed IR generated by
// the threads that didn't enter an infinite loop converged, which		// existing frontends (which ensure there is only one such export in
// means that the only extra threads to execute the null export are		// occuring in uniform control flow).
// threads that entered the infinite loop, and they only could've
// exited through being killed which sets their exec bit to 0.
// Therefore, unless there's an actual infinite loop, which can have
// invalid results, or there's a kill after the last export, which we
// assume the frontend won't do, this export will have the same exec
// mask as the last "real" export, and therefore the valid mask will be
// overwritten with the same value and will still be correct. Also,
// even though this forces an extra unnecessary export wait, we assume
// that this happens rare enough in practice to that we don't have to
// worry about performance.
if (F.getCallingConv() == CallingConv::AMDGPU_PS &&		if (F.getCallingConv() == CallingConv::AMDGPU_PS &&
RetTy->isVoidTy()) {		RetTy->isVoidTy()) {
InsertExport = true;		UpdateExports = true;
}		}

ReturnInst::Create(F.getContext(), RetVal, DummyReturnBB);		ReturnInst::Create(F.getContext(), RetVal, DummyReturnBB);
ReturningBlocks.push_back(DummyReturnBB);		ReturningBlocks.push_back(DummyReturnBB);
}		}

if (BI->isUnconditional()) {		if (BI->isUnconditional()) {
BasicBlock *LoopHeaderBB = BI->getSuccessor(0);		BasicBlock *LoopHeaderBB = BI->getSuccessor(0);
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	bool AMDGPUUnifyDivergentExitNodes::runOnFunction(Function &F) {
if (RequireAndPreserveDomTree)		if (RequireAndPreserveDomTree)
DTU.applyUpdates(Updates);		DTU.applyUpdates(Updates);
Updates.clear();		Updates.clear();

// Now handle return blocks.		// Now handle return blocks.
if (ReturningBlocks.empty())		if (ReturningBlocks.empty())
return Changed; // No blocks return		return Changed; // No blocks return

if (ReturningBlocks.size() == 1 && !InsertExport)		if (ReturningBlocks.size() == 1 && !UpdateExports)
return Changed; // Already has a single return block		return Changed; // Already has a single return block

// Unify returning blocks. If we are going to insert the export it is also		// Unify returning blocks. If we are going to insert the export it is also
// necessary to include blocks that are uniformly reached, because in addition		// necessary to include blocks that are uniformly reached, because in addition
// to inserting the export the "done" bits on existing exports will be cleared		// to inserting the export the "done" bits on existing exports will be cleared
// and we do not want to end up with the normal export in a non-unified,		// and we do not want to end up with the normal export in a non-unified,
// uniformly reached block with the "done" bit cleared.		// uniformly reached block with the "done" bit cleared.
auto BlocksToUnify = std::move(ReturningBlocks);		auto BlocksToUnify = std::move(ReturningBlocks);
if (InsertExport) {		if (UpdateExports) {
llvm::append_range(BlocksToUnify, UniformlyReachedRetBlocks);		llvm::append_range(BlocksToUnify, UniformlyReachedRetBlocks);
}		}

unifyReturnBlockSet(F, DTU, BlocksToUnify, InsertExport,		unifyReturnBlockSet(F, DTU, BlocksToUnify, UpdateExports);
"UnifiedReturnBlock");
return true;		return true;
}		}

llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs -simplifycfg-require-and-preserve-domtree=1 < %s \| FileCheck -enable-var-scope %s			; RUN: llc -march=amdgcn -verify-machineinstrs -simplifycfg-require-and-preserve-domtree=1 < %s \| FileCheck -enable-var-scope %s

	; Although it's modeled without any control flow in order to get better code			; Although it's modeled without any control flow in order to get better code
	; out of the structurizer, @llvm.amdgcn.kill actually ends the thread that calls			; out of the structurizer, @llvm.amdgcn.kill actually ends the thread that calls
	; it with "true". In case it's called in a provably infinite loop, we still			; it with "true". In case it's called in a provably infinite loop, we still
	; need to successfully exit and export something, even if we can't know where			; need to successfully exit and export something, even if we can't know where
	; to jump to in the LLVM IR. Therefore we insert a null export ourselves in			; to jump to in the LLVM IR. Therefore we insert a null export ourselves in
	; this case right before the s_endpgm to avoid GPU hangs, which is what this			; this case right before the s_endpgm to avoid GPU hangs, which is what this
	; tests.			; tests.

	; CHECK-LABEL: return_void			; CHECK-LABEL: return_void
	; Make sure that we remove the done bit from the original export			; CHECK: exp mrt0 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} done vm
				foadUnsubmitted Done Reply Inline Actions This comment seems obsolete now. foad: This comment seems obsolete now.
	; CHECK: exp mrt0 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} vm			; CHECK-NOT: exp
	; CHECK: exp null off, off, off, off done vm			; CHECK: s_endpgm
	; CHECK-NEXT: s_endpgm
	define amdgpu_ps void @return_void(float %0) #0 {			define amdgpu_ps void @return_void(float %0) #0 {
	main_body:			main_body:
	%cmp = fcmp olt float %0, 1.000000e+01			%cmp = fcmp olt float %0, 1.000000e+01
	br i1 %cmp, label %end, label %loop			br i1 %cmp, label %end, label %loop

	loop:			loop:
	call void @llvm.amdgcn.kill(i1 false) #3			call void @llvm.amdgcn.kill(i1 false) #3
	br label %loop			br label %loop

	end:			end:
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float 0., float 0., float 0., float 1., i1 true, i1 true) #3			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float 0., float 0., float 0., float 1., i1 true, i1 true) #3
	ret void			ret void
	}			}

	; Check that we also remove the done bit from compressed exports correctly.
	; CHECK-LABEL: return_void_compr			; CHECK-LABEL: return_void_compr
				foadUnsubmitted Done Reply Inline Actions Same here. foad: Same here.
	; CHECK: exp mrt0 v{{[0-9]+}}, off, v{{[0-9]+}}, off compr vm			; CHECK: exp mrt0 v{{[0-9]+}}, off, v{{[0-9]+}}, off done compr vm
	; CHECK: exp null off, off, off, off done vm			; CHECK-NOT: exp
	; CHECK-NEXT: s_endpgm			; CHECK: s_endpgm
	define amdgpu_ps void @return_void_compr(float %0) #0 {			define amdgpu_ps void @return_void_compr(float %0) #0 {
	main_body:			main_body:
	%cmp = fcmp olt float %0, 1.000000e+01			%cmp = fcmp olt float %0, 1.000000e+01
	br i1 %cmp, label %end, label %loop			br i1 %cmp, label %end, label %loop

	loop:			loop:
	call void @llvm.amdgcn.kill(i1 false) #3			call void @llvm.amdgcn.kill(i1 false) #3
	br label %loop			br label %loop
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/multi-divergent-exit-region.ll

	Show First 20 Lines • Show All 712 Lines • ▼ Show 20 Lines

	bb4: ; preds = %bb3			bb4: ; preds = %bb3
	ret void			ret void

	bb5: ; preds = %bb3			bb5: ; preds = %bb3
	unreachable			unreachable
	}			}

	; Test that there is an extra export inserted after the normal export,			; Test that there is an extra export inserted for the infinite loop
	; if the normal export is inside a uniformly reached block and there is			; exit path of the pixel shader.
	; an infinite loop in the pixel shader.

	; IR-LABEL: @uniformly_reached_export			; IR-LABEL: @uniformly_reached_export
	; IR-NEXT: .entry:			; IR-NEXT: .entry:
	; IR: br i1 [[CND:%.]], label %[[EXP:.]], label %[[FLOW:.*]]			; IR: br i1 [[CND:%.]], label %[[EXP:.]], label %[[FLOW:.*]]

	; IR: [[FLOW]]:			; IR: [[FLOW]]:
	; IR-NEXT: phi			; IR-NEXT: phi
	; IR-NEXT: br i1 [[CND2:%.]], label %[[LOOP:.]], label %UnifiedReturnBlock			; IR-NEXT: br i1 [[CND2:%.]], label %[[LOOP:.]], label %[[FLOW1:.*]]

	; IR: [[LOOP]]:			; IR: [[LOOP]]:
	; IR-NEXT: br i1 false, label %[[FLOW1:.*]], label %[[LOOP]]			; IR-NEXT: br i1 false, label %UnifiedExportBlock, label %[[LOOP]]

	; IR: [[EXP]]:			; IR: [[EXP]]:
	; IR-NEXT: call void @llvm.amdgcn.exp.compr.v2f16(i32 immarg 0, i32 immarg 15, <2 x half> <half 0xH3C00, half 0xH0000>, <2 x half> <half 0xH0000, half 0xH3C00>, i1 immarg false, i1 immarg true)			; IR-NEXT: call void @llvm.amdgcn.exp.compr.v2f16(i32 immarg 0, i32 immarg 15, <2 x half> <half 0xH3C00, half 0xH0000>, <2 x half> <half 0xH0000, half 0xH3C00>, i1 immarg true, i1 immarg true)
	; IR-NEXT: br label %[[FLOW]]			; IR-NEXT: br label %[[FLOW]]

	; IR: [[FLOW1]]:			; IR: [[FLOW1]]:
	; IR-NEXT: br label %UnifiedReturnBlock			; IR-NEXT: br label %UnifiedReturnBlock

	; IR: UnifiedReturnBlock:			; IR: UnifiedExportBlock:
	; IR-NEXT: call void @llvm.amdgcn.exp.f32(i32 9, i32 0, float undef, float undef, float undef, float undef, i1 true, i1 true)			; IR-NEXT: call void @llvm.amdgcn.exp.f32(i32 9, i32 0, float undef, float undef, float undef, float undef, i1 true, i1 true)
				; IR-NEXT: br label %[[FLOW1]]

				; IR: UnifiedReturnBlock:
	; IR-NEXT: ret void			; IR-NEXT: ret void

	define amdgpu_ps void @uniformly_reached_export(float inreg %tmp25) {			define amdgpu_ps void @uniformly_reached_export(float inreg %tmp25) {
	.entry:			.entry:
	%tmp26 = fcmp olt float %tmp25, 0.000000e+00			%tmp26 = fcmp olt float %tmp25, 0.000000e+00
	br i1 %tmp26, label %loop, label %bb27			br i1 %tmp26, label %loop, label %bb27

	loop: ; preds = %loop, %.entry			loop: ; preds = %loop, %.entry
	Show All 12 Lines