This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
-
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
1/2
SIInsertSkips.cpp
1/2
SILowerControlFlow.cpp
7/22
SIRemoveShortExecBranches.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
atomic_optimizations_local_pointer.ll
-
atomic_optimizations_pixelshader.ll
-
branch-condition-and.ll
-
branch-relaxation.ll
-
call-skip.ll
-
collapse-endcf.ll
-
control-flow-fastregalloc.ll
1/2
convergent-inlineasm.ll
-
divergent-branch-uniform-condition.ll
-
else.ll
-
hoist-cond.ll
-
insert-skips-flat-vmem.mir
-
insert-skips-gws.mir
-
insert-skips-ignored-insts.mir
-
insert-skips-kill-uncond.mir
-
mubuf-legalize-operands.ll
-
mul24-pass-ordering.ll
-
ret_jump.ll
-
si-annotate-cf-noloop.ll
-
si-lower-control-flow-unreachable-block.ll
-
si-lower-control-flow.mir
-
skip-branch-taildup-ret.mir
-
skip-branch-trap.ll
-
skip-if-dead.ll
-
subreg-coalescer-undef-use.ll
-
uniform-cfg.ll
-
uniform-loop-inside-nonuniform.ll
-
valu-i1.ll
-
wave32.ll
-
wqm.ll

Differential D68092

[AMDGPU] Invert the handling of skip insertion.
ClosedPublic

Authored by cdevadas on Sep 26 2019, 10:36 AM.

Download Raw Diff

Details

Reviewers

arsenm

Commits

rGe53a9d96e6a0: Resubmit: [AMDGPU] Invert the handling of skip insertion.
rG0dc6c249bffa: [AMDGPU] Invert the handling of skip insertion.

Summary

Current implementation of skip insertion (SIInsertSkip) makes it a mandatory pass
required for correctness. The idea was to have this handling as an optional pass.
This patch inserts the s_cbranch_execz upfront during SILowerControlFlow to skip over
the sections of code when no lanes are active. SIRemoveShortExecBranches tries to
remove the skips for short branches. It also tries to retain s_cbranch_execz
for the cases where the skip branch may be necessary.

The new pass, SIRemoveShortExecBranches will replace the handling of
skip insertion in the existing SIInsertSkip Pass.

Diff Detail

Repository: rL LLVM

Event Timeline

cdevadas created this revision.Sep 26 2019, 10:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 26 2019, 10:36 AM

Herald added subscribers: llvm-commits, mgorny, nhaehnle and 4 others. · View Herald Transcript

cdevadas retitled this revision from Invert the handling of skip insertion. to [AMDGPU] Invert the handling of skip insertion..Sep 26 2019, 10:38 AM

Herald added subscribers: t-tye, tpr, dstuttard and 2 others. · View Herald TranscriptSep 26 2019, 10:38 AM

arsenm added inline comments.Sep 26 2019, 12:36 PM

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
37	You could make this a static member and use cl::location with the flag
51–53	This isn't necessary
78–86	Should use analyzeBranch instead
88–89	This function should probably be rewritten at some point, but for now it's probably not important. I would rename it to sound stronger. mustRetainExeczBranch or something?
149	You don't need to scan forward through the whole block to find the branches. You can just check getFirstTerminator (or just call analyzeBranch on the block and check the condition type)
test/CodeGen/AMDGPU/convergent-inlineasm.ll
7	I assume the branch was here before and this isn't a change?

cdevadas marked an inline comment as done.Sep 27 2019, 6:00 AM

cdevadas added inline comments.

test/CodeGen/AMDGPU/convergent-inlineasm.ll
7	Yes, the branch was here even earlier.

Thank you for working on this.

There are a bunch of high-level problems with this code which are really due to its history and not due to your changes, but I'd appreciate if we could get things right now as the code is rewritten. Mostly, the decision logic is overly complex and at least partially wrong because it makes assumptions about the order of basic blocks (i..e, the outer loop of shouldRetainSkips).

Let's rethink what the condition should be. Here's an attempt: s_cbranch_execz should be removed if

Not taking the branch when EXEC=0 will end up falling through to the branch target anyway.[0]
The fall-through code sequence has no unwanted side effects
The fall-through code sequence is short and cheap[1]
Heuristically, the cheapness requirement should arguably mean that the fall-through code sequence contains no branches at all. After all, if we're going to hit another branch anyway, we may as well just take the original EXECZ branch. This also simplifies the check that we correctly fall through to the branch target. (An exception could perhaps be made for a nested s_cbranch_execz, if the overall code sequence is short and cheap enough that it makes sense to remove both of them.)

How does that sound? This is a significant conceptual change to how the pass works, but I think it's for the better.

[0] There is an interesting subtle point here. The way we currently lower the original thread-based control flow, this fall-through property is always guaranteed except for subtleties involving VCC[Z] branches in loops. However, perhaps we won't always lower control flow in the same way, and perhaps we will one day generate code which doesn't have this property.

[1] What we really want is that cost of fall-through code < p_taken * (cost of taken branch) + p_nottaken (cost of not-taken branch + cost of fall-through code).

lib/Target/AMDGPU/SILowerControlFlow.cpp
247–248	Comment needs to be updated.
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
10–12	This should say that it removes EXECZ branches for short branches, right? Also... "try to retain"? I do hope we always succeed at that :)

You are right, we need to redesign the function shouldRetainSkips, especially in computing the cost. It is not guaranteed that the order of 'From' to 'To' blocks is a fall-through.
There could essentially be a nested control-flow which makes the cost computation a little complex. We can only approximate the number of instructions in the region.
We have talked about it earlier and trying to make the current design more close to how SIInsertSkip works now.

lib/Target/AMDGPU/SILowerControlFlow.cpp
247–248	I will change it.
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
10–12	Yes, it is misleading. I will fix the comment.

incorporated the suggestions + rebase

nhaehnle added inline comments.Oct 8 2019, 8:06 AM

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
113	analyzeBranch's return value must be checked.
117–118	What's the logic here behind using domination as a criterion?

cdevadas marked 3 inline comments as done.Oct 8 2019, 10:08 AM

cdevadas added inline comments.

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
113	Sure. Will add that.
117–118	There could be a situation in which execnz (inserted during SI_LOOP lowering) can be inverted to execz by an optimization (for instance, BranchFolding). This execz should always be retained. This special check is added to handle it. Unfortunately, I couldn't write/find a test-case to reproduce it.

arsenm added inline comments.Oct 8 2019, 10:25 AM

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
114–115	I think this reinterpreting analyzeBranch's outputs the way is potentially confusing. I think you don't actually need to check analyzeBranch directly here; I think MachineBasicBlock::getFallThrough does exactly this anyway (and handles the case where there's an unconditional branch as well)
117–118	I'm not sure dominance is sufficient for irreducible loops, which you won't run into in practice (as in, they probably hit another control flow bug long before this) but we should handle it correctly

nhaehnle added inline comments.Oct 9 2019, 3:41 AM

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
117–118	I have seen irreducible loops go all the way through compilation (because they triggered a bug somewhere, I believe in waitcount insertion), so yeah, that needs to be handled correctly. I still think a reasonable way to do this is just to scan forward like mustRetainExeczBranch already does, see if we encounter the execz target block during that scan, and only remove the execz branch in that case.

cdevadas marked 2 inline comments as done.Oct 23 2019, 7:37 AM

cdevadas added inline comments.

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
114–115	The check was necessary when it is a direct fallthrough; the analyzeBranch returns without assigning the FalseMBB. All I eventually require is FalseMBB field. MachineBasicBlock::getFallThrough returns the fallthrough branch and doesn't serve the real purpose (getting the FlaseMBB) esp. when the false path is taken via. an unconditional branch. With the following sequence, for instance, getFallThrough() returns %bb.1. But what we need is %bb.3 bb.0: successors: %bb.3, %bb.1 ------------- S_CBRANCH_EXECZ %bb.1 S_BRANCH %bb.3 bb.1: ; predecessors: %bb.0, %bb.3 successors: %bb.2, %bb.4 I believe, extracting the FalseMBB from the successor_list would be a better idea. The SrcMBB will always have exactly two successors.

Checked the return value of analyzeBranch.
Considered only the forward branches to avoid back-edges from this optimization.

Herald added a subscriber: arphaman. · View Herald TranscriptNov 12 2019, 5:25 AM

arsenm added inline comments.Nov 13 2019, 2:24 AM

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
64–76	You're already doing a forward scan through the blocks in mustRetainExeczBranch, so it seems like this should just be checked there
83–85	I was thinking this would go in a separate wrapper function along with the analyzeBranch call. The connection to analyzeBranch isn't obvious at this point

Created a wrapper to get the true & false branch targets. Also, used BB numbering to identify the forward jumps.

arsenm added inline comments.Nov 19 2019, 4:39 AM

lib/Target/AMDGPU/SIInsertSkips.cpp
469–475	Shouldn't this be dead code now? i.e. you shouldn't be adding a case, and should be removing the SI_MASK_BRANCH part
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
67	This name is slightly misleading. getBlockDestinations?
140	I'm not sure if potentially renumbering the blocks counts as a change, but it probably doesn't really matter. It's not a guaranteed property between passes I guess

cdevadas marked 3 inline comments as done.Nov 19 2019, 5:03 AM

cdevadas added inline comments.

lib/Target/AMDGPU/SIInsertSkips.cpp
469–475	Yes. But retained this code for the existing mir tests added for SIInsertSkip. This file will go away entirely after handling the kill intrinsics and other unrelated implementations elsewhere. Once the new design is in the upstream, I will clean-up the mir tests too.
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
67	sure, I will use this name.
140	Inserted the renumbering here to fix (restore) any broken BB numbering if any prior optimization pass has removed/added the BBs.

incorporated the suggestion (used getBlockDestinations for the function name)

LGTM with grammar fix, although this isn't complete. There's still work to kill the existing pass

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
13	s/is no unwanted sideeffects/are no unwanted side effects/

This revision is now accepted and ready to land.Dec 10 2019, 8:31 AM

Closed by commit rG0dc6c249bffa: [AMDGPU] Invert the handling of skip insertion. (authored by cdevadas). · Explain WhyJan 15 2020, 2:00 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: hiraditya. · View Herald TranscriptJan 15 2020, 2:00 AM

foad added a subscriber: foad.Jan 15 2020, 3:41 AM

nhaehnle mentioned this in D72997: [AMDGPU] SIRemoveShortExecBranches should not remove branches exiting loops.Jan 21 2020, 12:35 AM

Hi there,

This change introduced GPU hangs with RADV. I know that it has been reverted since but release/10.x doesn't have the revert, hence it's still broken.
Can the revert (or the correct fix) be backported to LLVM 10 ?

Thanks!

Herald added a subscriber: kerbowa. · View Herald TranscriptJan 30 2020, 7:50 AM

In D68092#1849644, @hakzsam wrote:

Hi there,

This change introduced GPU hangs with RADV. I know that it has been reverted since but release/10.x doesn't have the revert, hence it's still broken.
Can the revert (or the correct fix) be backported to LLVM 10 ?

Thanks!

https://bugs.llvm.org/show_bug.cgi?id=44720

foad mentioned this in D73771: [AMDGPU] Don't remove short branches over kills.Jan 31 2020, 2:48 AM

foad mentioned this in rG97d9a76afc97: [AMDGPU] Don't remove short branches over kills.Feb 3 2020, 1:32 AM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUTargetMachine.cpp

2 lines

CMakeLists.txt

1 line

SIInsertSkips.cpp

5 lines

SILowerControlFlow.cpp

6 lines

SIRemoveShortExecBranches.cpp

163 lines

test/

CodeGen/

AMDGPU/

atomic_optimizations_local_pointer.ll

4 lines

atomic_optimizations_pixelshader.ll

2 lines

branch-condition-and.ll

5 lines

branch-relaxation.ll

9 lines

call-skip.ll

9 lines

collapse-endcf.ll

49 lines

control-flow-fastregalloc.ll

15 lines

convergent-inlineasm.ll

8 lines

divergent-branch-uniform-condition.ll

8 lines

else.ll

3 lines

hoist-cond.ll

2 lines

insert-skips-flat-vmem.mir

2 lines

insert-skips-gws.mir

2 lines

insert-skips-ignored-insts.mir

2 lines

insert-skips-kill-uncond.mir

2 lines

mubuf-legalize-operands.ll

6 lines

mul24-pass-ordering.ll

3 lines

ret_jump.ll

23 lines

si-annotate-cf-noloop.ll

2 lines

si-lower-control-flow-unreachable-block.ll

10 lines

si-lower-control-flow.mir

2 lines

skip-branch-taildup-ret.mir

2 lines

skip-branch-trap.ll

7 lines

skip-if-dead.ll

13 lines

subreg-coalescer-undef-use.ll

4 lines

uniform-cfg.ll

2 lines

uniform-loop-inside-nonuniform.ll

4 lines

valu-i1.ll

41 lines

wave32.ll

16 lines

wqm.ll

5 lines

Diff 221984

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines
	extern char &SILoadStoreOptimizerID;			extern char &SILoadStoreOptimizerID;

	void initializeSIWholeQuadModePass(PassRegistry &);			void initializeSIWholeQuadModePass(PassRegistry &);
	extern char &SIWholeQuadModeID;			extern char &SIWholeQuadModeID;

	void initializeSILowerControlFlowPass(PassRegistry &);			void initializeSILowerControlFlowPass(PassRegistry &);
	extern char &SILowerControlFlowID;			extern char &SILowerControlFlowID;

				void initializeSIRemoveShortExecBranchesPass(PassRegistry &);
				extern char &SIRemoveShortExecBranchesID;

	void initializeSIInsertSkipsPass(PassRegistry &);			void initializeSIInsertSkipsPass(PassRegistry &);
	extern char &SIInsertSkipsPassID;			extern char &SIInsertSkipsPassID;

	void initializeSIOptimizeExecMaskingPass(PassRegistry &);			void initializeSIOptimizeExecMaskingPass(PassRegistry &);
	extern char &SIOptimizeExecMaskingID;			extern char &SIOptimizeExecMaskingID;

	void initializeSIPreAllocateWWMRegsPass(PassRegistry &);			void initializeSIPreAllocateWWMRegsPass(PassRegistry &);
	extern char &SIPreAllocateWWMRegsID;			extern char &SIPreAllocateWWMRegsID;
	▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUPropagateAttributesLatePass(*PR);		initializeAMDGPUPropagateAttributesLatePass(*PR);
initializeAMDGPURewriteOutArgumentsPass(*PR);		initializeAMDGPURewriteOutArgumentsPass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertWaitcntsPass(*PR);		initializeSIInsertWaitcntsPass(*PR);
initializeSIModeRegisterPass(*PR);		initializeSIModeRegisterPass(*PR);
initializeSIWholeQuadModePass(*PR);		initializeSIWholeQuadModePass(*PR);
initializeSILowerControlFlowPass(*PR);		initializeSILowerControlFlowPass(*PR);
		initializeSIRemoveShortExecBranchesPass(*PR);
initializeSIInsertSkipsPass(*PR);		initializeSIInsertSkipsPass(*PR);
initializeSIMemoryLegalizerPass(*PR);		initializeSIMemoryLegalizerPass(*PR);
initializeSIOptimizeExecMaskingPass(*PR);		initializeSIOptimizeExecMaskingPass(*PR);
initializeSIPreAllocateWWMRegsPass(*PR);		initializeSIPreAllocateWWMRegsPass(*PR);
initializeSIFormMemoryClausesPass(*PR);		initializeSIFormMemoryClausesPass(*PR);
initializeAMDGPUUnifyDivergentExitNodesPass(*PR);		initializeAMDGPUUnifyDivergentExitNodesPass(*PR);
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUExternalAAWrapperPass(*PR);		initializeAMDGPUExternalAAWrapperPass(*PR);
▲ Show 20 Lines • Show All 752 Lines • ▼ Show 20 Lines	void GCNPassConfig::addPreEmitPass() {
//		//
// Here we add a stand-alone hazard recognizer pass which can handle all		// Here we add a stand-alone hazard recognizer pass which can handle all
// cases.		// cases.
//		//
// FIXME: This stand-alone pass will emit indiv. S_NOP 0, as needed. It would		// FIXME: This stand-alone pass will emit indiv. S_NOP 0, as needed. It would
// be better for it to emit S_NOP <N> when possible.		// be better for it to emit S_NOP <N> when possible.
addPass(&PostRAHazardRecognizerID);		addPass(&PostRAHazardRecognizerID);

		addPass(&SIRemoveShortExecBranchesID);
addPass(&SIInsertSkipsPassID);		addPass(&SIInsertSkipsPassID);
addPass(&BranchRelaxationPassID);		addPass(&BranchRelaxationPassID);
}		}

TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {
return new GCNPassConfig(*this, PM);		return new GCNPassConfig(*this, PM);
}		}

▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
SILowerSGPRSpills.cpp		SILowerSGPRSpills.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
SIMachineScheduler.cpp		SIMachineScheduler.cpp
SIMemoryLegalizer.cpp		SIMemoryLegalizer.cpp
SIOptimizeExecMasking.cpp		SIOptimizeExecMasking.cpp
SIOptimizeExecMaskingPreRA.cpp		SIOptimizeExecMaskingPreRA.cpp
SIPeepholeSDWA.cpp		SIPeepholeSDWA.cpp
SIRegisterInfo.cpp		SIRegisterInfo.cpp
		SIRemoveShortExecBranches.cpp
SIShrinkInstructions.cpp		SIShrinkInstructions.cpp
SIWholeQuadMode.cpp		SIWholeQuadMode.cpp
GCNILPSched.cpp		GCNILPSched.cpp
GCNRegBankReassign.cpp		GCNRegBankReassign.cpp
GCNNSAReassign.cpp		GCNNSAReassign.cpp
GCNDPPCombine.cpp		GCNDPPCombine.cpp
SIModeRegister.cpp		SIModeRegister.cpp
)		)

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(Utils)		add_subdirectory(Utils)

lib/Target/AMDGPU/SIInsertSkips.cpp

Show All 35 Lines
#include <cstdint>		#include <cstdint>
#include <iterator>		#include <iterator>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "si-insert-skips"		#define DEBUG_TYPE "si-insert-skips"

static cl::opt<unsigned> SkipThresholdFlag(		static cl::opt<unsigned> SkipThresholdFlag(
"amdgpu-skip-threshold",		"amdgpu-skip-threshold-legacy",
cl::desc("Number of instructions before jumping over divergent control flow"),		cl::desc("Number of instructions before jumping over divergent control flow"),
cl::init(12), cl::Hidden);		cl::init(12), cl::Hidden);

namespace {		namespace {

class SIInsertSkips : public MachineFunctionPass {		class SIInsertSkips : public MachineFunctionPass {
private:		private:
const SIRegisterInfo *TRI = nullptr;		const SIRegisterInfo *TRI = nullptr;
▲ Show 20 Lines • Show All 408 Lines • ▼ Show 20 Lines	for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();

MachineBasicBlock::iterator I, Next;		MachineBasicBlock::iterator I, Next;
for (I = MBB.begin(); I != MBB.end(); I = Next) {		for (I = MBB.begin(); I != MBB.end(); I = Next) {
Next = std::next(I);		Next = std::next(I);

MachineInstr &MI = *I;		MachineInstr &MI = *I;

switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
		case AMDGPU::S_CBRANCH_EXECZ:
		ExecBranchStack.push_back(MI.getOperand(0).getMBB());
		break;
case AMDGPU::SI_MASK_BRANCH:		case AMDGPU::SI_MASK_BRANCH:
ExecBranchStack.push_back(MI.getOperand(0).getMBB());		ExecBranchStack.push_back(MI.getOperand(0).getMBB());
MadeChange \|= skipMaskBranch(MI, MBB);		MadeChange \|= skipMaskBranch(MI, MBB);
break;		break;
		arsenmUnsubmitted Not Done Reply Inline Actions Shouldn't this be dead code now? i.e. you shouldn't be adding a case, and should be removing the SI_MASK_BRANCH part arsenm: Shouldn't this be dead code now? i.e. you shouldn't be adding a case, and should be removing…
		cdevadasAuthorUnsubmitted Done Reply Inline Actions Yes. But retained this code for the existing mir tests added for SIInsertSkip. This file will go away entirely after handling the kill intrinsics and other unrelated implementations elsewhere. Once the new design is in the upstream, I will clean-up the mir tests too. cdevadas: Yes. But retained this code for the existing mir tests added for SIInsertSkip. This file will…

case AMDGPU::S_BRANCH:		case AMDGPU::S_BRANCH:
// Optimize out branches to the next block.		// Optimize out branches to the next block.
// FIXME: Shouldn't this be handled by BranchFolding?		// FIXME: Shouldn't this be handled by BranchFolding?
if (MBB.isLayoutSuccessor(MI.getOperand(0).getMBB())) {		if (MBB.isLayoutSuccessor(MI.getOperand(0).getMBB())) {
MI.eraseFromParent();		MI.eraseFromParent();
} else if (HaveSkipBlock) {		} else if (HaveSkipBlock) {
// Remove the given unconditional branch when a skip block has been		// Remove the given unconditional branch when a skip block has been
▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SILowerControlFlow.cpp

Show First 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	void SILowerControlFlow::emitIf(MachineInstr &MI) {
}		}

// Use a copy that is a terminator to get correct spill code placement it with		// Use a copy that is a terminator to get correct spill code placement it with
// fast regalloc.		// fast regalloc.
MachineInstr *SetExec =		MachineInstr *SetExec =
BuildMI(MBB, I, DL, TII->get(MovTermOpc), Exec)		BuildMI(MBB, I, DL, TII->get(MovTermOpc), Exec)
.addReg(Tmp, RegState::Kill);		.addReg(Tmp, RegState::Kill);

// Insert a pseudo terminator to help keep the verifier happy. This will also		// Insert a pseudo terminator to help keep the verifier happy. This will also
// be used later when inserting skips.		// be used later when inserting skips.
		nhaehnleUnsubmitted Not Done Reply Inline Actions Comment needs to be updated. nhaehnle: Comment needs to be updated.
		cdevadasAuthorUnsubmitted Done Reply Inline Actions I will change it. cdevadas: I will change it.
MachineInstr *NewBr = BuildMI(MBB, I, DL, TII->get(AMDGPU::SI_MASK_BRANCH))		MachineInstr *NewBr = BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
.add(MI.getOperand(2));		.add(MI.getOperand(2));

if (!LIS) {		if (!LIS) {
MI.eraseFromParent();		MI.eraseFromParent();
return;		return;
}		}

LIS->InsertMachineInstrInMaps(*CopyExec);		LIS->InsertMachineInstrInMaps(*CopyExec);
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	void SILowerControlFlow::emitElse(MachineInstr &MI) {
}		}

MachineInstr *Xor =		MachineInstr *Xor =
BuildMI(MBB, ElsePt, DL, TII->get(XorTermrOpc), Exec)		BuildMI(MBB, ElsePt, DL, TII->get(XorTermrOpc), Exec)
.addReg(Exec)		.addReg(Exec)
.addReg(DstReg);		.addReg(DstReg);

MachineInstr *Branch =		MachineInstr *Branch =
BuildMI(MBB, ElsePt, DL, TII->get(AMDGPU::SI_MASK_BRANCH))		BuildMI(MBB, ElsePt, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
.addMBB(DestBB);		.addMBB(DestBB);

if (!LIS) {		if (!LIS) {
MI.eraseFromParent();		MI.eraseFromParent();
return;		return;
}		}

LIS->RemoveMachineInstrFromMaps(MI);		LIS->RemoveMachineInstrFromMaps(MI);
MI.eraseFromParent();		MI.eraseFromParent();
▲ Show 20 Lines • Show All 242 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp

This file was added.

				//===-- SIRemoveShortExecBranches.cpp ------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// This pass removes the unwanted exec mask instructions inserted during
				/// SILowerControlFlow. It removes the exec masks for the short branches and
				/// tries to retain it for the long branches.
				nhaehnleUnsubmitted Not Done Reply Inline Actions This should say that it removes EXECZ branches for short branches, right? Also... "try to retain"? I do hope we always succeed at that :) nhaehnle: This should say that it removes EXECZ branches for short branches, right? Also... "try to…
				cdevadasAuthorUnsubmitted Done Reply Inline Actions Yes, it is misleading. I will fix the comment. cdevadas: Yes, it is misleading. I will fix the comment.
				///
				arsenmUnsubmitted Not Done Reply Inline Actions s/is no unwanted sideeffects/are no unwanted side effects/ arsenm: s/is no unwanted sideeffects/are no unwanted side effects/
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "SIInstrInfo.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"

				using namespace llvm;

				#define DEBUG_TYPE "si-remove-short-exec-branches"

				static cl::opt<unsigned> SkipThresholdFlag(
				"amdgpu-skip-threshold",
				cl::desc(
				"Number of instructions before jumping over divergent control flow"),
				cl::init(12), cl::Hidden);

				namespace {

				class SIRemoveShortExecBranches : public MachineFunctionPass {
				private:
				const SIInstrInfo *TII = nullptr;
				unsigned SkipThreshold = 0;
				arsenmUnsubmitted Not Done Reply Inline Actions You could make this a static member and use cl::location with the flag arsenm: You could make this a static member and use cl::location with the flag
				bool shouldRetainSkip(const MachineBasicBlock &From,
				const MachineBasicBlock &To) const;
				bool removeMaskBranch(MachineInstr &MI, MachineBasicBlock &MBB);

				public:
				static char ID;

				SIRemoveShortExecBranches() : MachineFunctionPass(ID) {
				initializeSIRemoveShortExecBranchesPass(*PassRegistry::getPassRegistry());
				}

				bool runOnMachineFunction(MachineFunction &MF) override;

				StringRef getPassName() const override {
				return "SI remove short exec branches";
				}
				arsenmUnsubmitted Not Done Reply Inline Actions This isn't necessary arsenm: This isn't necessary
				};

				} // End anonymous namespace.

				INITIALIZE_PASS(SIRemoveShortExecBranches, DEBUG_TYPE,
				"SI remove short exec branches", false, false)

				char SIRemoveShortExecBranches::ID = 0;

				char &llvm::SIRemoveShortExecBranchesID = SIRemoveShortExecBranches::ID;

				static bool opcodeEmitsNoInsts(const MachineInstr &MI) {
				if (MI.isMetaInstruction())
				return true;
				arsenmUnsubmitted Not Done Reply Inline Actions This name is slightly misleading. getBlockDestinations? arsenm: This name is slightly misleading. getBlockDestinations?
				cdevadasAuthorUnsubmitted Done Reply Inline Actions sure, I will use this name. cdevadas: sure, I will use this name.

				// Handle target specific opcodes.
				switch (MI.getOpcode()) {
				case AMDGPU::SI_MASK_BRANCH:
				return true;
				default:
				return false;
				}
				}
				arsenmUnsubmitted Not Done Reply Inline Actions You're already doing a forward scan through the blocks in mustRetainExeczBranch, so it seems like this should just be checked there arsenm: You're already doing a forward scan through the blocks in mustRetainExeczBranch, so it seems…

				static MachineBasicBlock *getFalseBranch(MachineBasicBlock &SrcMBB,
				MachineBasicBlock *TrueBB) {
				assert(SrcMBB.succ_size() == 2);
				MachineBasicBlock::succ_iterator It = SrcMBB.succ_begin();
				MachineBasicBlock::succ_iterator Next = It;
				++Next;

				return (It == TrueBB) ? Next : *It;
				arsenmUnsubmitted Not Done Reply Inline Actions I was thinking this would go in a separate wrapper function along with the analyzeBranch call. The connection to analyzeBranch isn't obvious at this point arsenm: I was thinking this would go in a separate wrapper function along with the analyzeBranch call.
				}
				arsenmUnsubmitted Not Done Reply Inline Actions Should use analyzeBranch instead arsenm: Should use analyzeBranch instead

				bool SIRemoveShortExecBranches::shouldRetainSkip(
				const MachineBasicBlock &From, const MachineBasicBlock &To) const {
				arsenmUnsubmitted Not Done Reply Inline Actions This function should probably be rewritten at some point, but for now it's probably not important. I would rename it to sound stronger. mustRetainExeczBranch or something? arsenm: This function should probably be rewritten at some point, but for now it's probably not…
				unsigned NumInstr = 0;
				const MachineFunction *MF = From.getParent();

				for (MachineFunction::const_iterator MBBI(&From), ToI(&To), End = MF->end();
				MBBI != End && MBBI != ToI; ++MBBI) {
				const MachineBasicBlock &MBB = *MBBI;

				for (MachineBasicBlock::const_iterator I = MBB.begin(), E = MBB.end();
				I != E; ++I) {
				if (opcodeEmitsNoInsts(*I))
				continue;

				// When a uniform loop is inside non-uniform control flow, the branch
				// leaving the loop might be an S_CBRANCH_VCCNZ, which is never taken
				// when EXEC = 0. We should skip the loop lest it becomes infinite.
				if (I->getOpcode() == AMDGPU::S_CBRANCH_VCCNZ \|\|
				I->getOpcode() == AMDGPU::S_CBRANCH_VCCZ)
				return true;

				if (TII->hasUnwantedEffectsWhenEXECEmpty(*I))
				return true;

				// These instructions are potentially expensive even if EXEC = 0.
				if (TII->isSMRD(I) \|\| TII->isVMEM(I) \|\| TII->isFLAT(*I) \|\|
				nhaehnleUnsubmitted Done Reply Inline Actions analyzeBranch's return value must be checked. nhaehnle: analyzeBranch's return value must be checked.
				cdevadasAuthorUnsubmitted Done Reply Inline Actions Sure. Will add that. cdevadas: Sure. Will add that.
				I->getOpcode() == AMDGPU::S_WAITCNT)
				return true;
				arsenmUnsubmitted Not Done Reply Inline Actions I think this reinterpreting analyzeBranch's outputs the way is potentially confusing. I think you don't actually need to check analyzeBranch directly here; I think MachineBasicBlock::getFallThrough does exactly this anyway (and handles the case where there's an unconditional branch as well) arsenm: I think this reinterpreting analyzeBranch's outputs the way is potentially confusing. I think…
				cdevadasAuthorUnsubmitted Done Reply Inline Actions The check was necessary when it is a direct fallthrough; the analyzeBranch returns without assigning the FalseMBB. All I eventually require is FalseMBB field. MachineBasicBlock::getFallThrough returns the fallthrough branch and doesn't serve the real purpose (getting the FlaseMBB) esp. when the false path is taken via. an unconditional branch. With the following sequence, for instance, getFallThrough() returns %bb.1. But what we need is %bb.3 bb.0: successors: %bb.3, %bb.1 ------------- S_CBRANCH_EXECZ %bb.1 S_BRANCH %bb.3 bb.1: ; predecessors: %bb.0, %bb.3 successors: %bb.2, %bb.4 I believe, extracting the FalseMBB from the successor_list would be a better idea. The SrcMBB will always have exactly two successors. cdevadas: The check was necessary when it is a direct fallthrough; the analyzeBranch returns without…

				++NumInstr;
				if (NumInstr >= SkipThreshold)
				nhaehnleUnsubmitted Not Done Reply Inline Actions What's the logic here behind using domination as a criterion? nhaehnle: What's the logic here behind using domination as a criterion?
				cdevadasAuthorUnsubmitted Done Reply Inline Actions There could be a situation in which execnz (inserted during SI_LOOP lowering) can be inverted to execz by an optimization (for instance, BranchFolding). This execz should always be retained. This special check is added to handle it. Unfortunately, I couldn't write/find a test-case to reproduce it. cdevadas: There could be a situation in which execnz (inserted during SI_LOOP lowering) can be inverted…
				arsenmUnsubmitted Not Done Reply Inline Actions I'm not sure dominance is sufficient for irreducible loops, which you won't run into in practice (as in, they probably hit another control flow bug long before this) but we should handle it correctly arsenm: I'm not sure dominance is sufficient for irreducible loops, which you won't run into in…
				nhaehnleUnsubmitted Not Done Reply Inline Actions I have seen irreducible loops go all the way through compilation (because they triggered a bug somewhere, I believe in waitcount insertion), so yeah, that needs to be handled correctly. I still think a reasonable way to do this is just to scan forward like mustRetainExeczBranch already does, see if we encounter the execz target block during that scan, and only remove the execz branch in that case. nhaehnle: I have seen irreducible loops go all the way through compilation (because they triggered a bug…
				return true;
				}
				}

				return false;
				}

				// Returns true if the skip branch instruction is removed.
				bool SIRemoveShortExecBranches::removeMaskBranch(MachineInstr &MI,
				MachineBasicBlock &SrcMBB) {
				MachineBasicBlock *DestBB = MI.getOperand(0).getMBB();

				if (shouldRetainSkip(getFalseBranch(SrcMBB, DestBB), DestBB))
				return false;

				LLVM_DEBUG(dbgs() << "Removing the exec mask branch: " << MI);
				MI.eraseFromParent();
				SrcMBB.removeSuccessor(DestBB);

				return true;
				}

				arsenmUnsubmitted Not Done Reply Inline Actions I'm not sure if potentially renumbering the blocks counts as a change, but it probably doesn't really matter. It's not a guaranteed property between passes I guess arsenm: I'm not sure if potentially renumbering the blocks counts as a change, but it probably doesn't…
				cdevadasAuthorUnsubmitted Done Reply Inline Actions Inserted the renumbering here to fix (restore) any broken BB numbering if any prior optimization pass has removed/added the BBs. cdevadas: Inserted the renumbering here to fix (restore) any broken BB numbering if any prior…
				bool SIRemoveShortExecBranches::runOnMachineFunction(MachineFunction &MF) {
				const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
				TII = ST.getInstrInfo();
				SkipThreshold = SkipThresholdFlag;
				bool Changed = false;

				for (MachineBasicBlock &MBB : MF) {
				MachineBasicBlock::iterator I, Next;
				for (I = MBB.begin(); I != MBB.end(); I = Next) {
				arsenmUnsubmitted Not Done Reply Inline Actions You don't need to scan forward through the whole block to find the branches. You can just check getFirstTerminator (or just call analyzeBranch on the block and check the condition type) arsenm: You don't need to scan forward through the whole block to find the branches. You can just check…
				Next = std::next(I);
				MachineInstr &MI = *I;
				switch (MI.getOpcode()) {
				case AMDGPU::S_CBRANCH_EXECZ:
				Changed = removeMaskBranch(MI, MBB);
				break;
				default:
				break;
				}
				}
				}

				return Changed;
				}

test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

	Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines
	; GFX1032: v_readlane_b32 s3, v2, 31			; GFX1032: v_readlane_b32 s3, v2, 31
	; GFX1032: v_mov_b32_dpp v1, v2 row_shr:1 row_mask:0xf bank_mask:0xf			; GFX1032: v_mov_b32_dpp v1, v2 row_shr:1 row_mask:0xf bank_mask:0xf
	; GFX1032: v_readlane_b32 s5, v2, 15			; GFX1032: v_readlane_b32 s5, v2, 15
	; GFX1032: v_writelane_b32 v1, s5, 16			; GFX1032: v_writelane_b32 v1, s5, 16
	; GFX1032: s_mov_b32 exec_lo, s4			; GFX1032: s_mov_b32 exec_lo, s4
	; GFX1032: v_cmp_eq_u32_e32 vcc_lo, 0, v0			; GFX1032: v_cmp_eq_u32_e32 vcc_lo, 0, v0
	; GFX1032: s_and_saveexec_b32 s4, vcc_lo			; GFX1032: s_and_saveexec_b32 s4, vcc_lo
	; GFX1032: s_cbranch_execz BB3_2			; GFX1032: s_cbranch_execz BB3_2
	; GFX1032: BB3_1:			; GFX1032: ; %bb.1:
	; GFX1032: v_mov_b32_e32 v0, local_var32@abs32@lo			; GFX1032: v_mov_b32_e32 v0, local_var32@abs32@lo
	; GFX1032: v_mov_b32_e32 v5, s3			; GFX1032: v_mov_b32_e32 v5, s3
	; GFX1032: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1032: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1032: s_waitcnt_vscnt null, 0x0			; GFX1032: s_waitcnt_vscnt null, 0x0
	; GFX1032: ds_add_rtn_u32 v0, v0, v5			; GFX1032: ds_add_rtn_u32 v0, v0, v5
	; GFX1032: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1032: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1032: buffer_gl0_inv			; GFX1032: buffer_gl0_inv
	; GFX1032: buffer_gl1_inv			; GFX1032: buffer_gl1_inv
	▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	; GFX1064: v_readlane_b32 s3, v2, 63			; GFX1064: v_readlane_b32 s3, v2, 63
	; GFX1064: v_writelane_b32 v1, s6, 32			; GFX1064: v_writelane_b32 v1, s6, 32
	; GFX1064: v_readlane_b32 s6, v2, 47			; GFX1064: v_readlane_b32 s6, v2, 47
	; GFX1064: v_writelane_b32 v1, s6, 48			; GFX1064: v_writelane_b32 v1, s6, 48
	; GFX1064: s_mov_b64 exec, s[4:5]			; GFX1064: s_mov_b64 exec, s[4:5]
	; GFX1064: v_cmp_eq_u32_e32 vcc, 0, v0			; GFX1064: v_cmp_eq_u32_e32 vcc, 0, v0
	; GFX1064: s_and_saveexec_b64 s[4:5], vcc			; GFX1064: s_and_saveexec_b64 s[4:5], vcc
	; GFX1064: s_cbranch_execz BB4_2			; GFX1064: s_cbranch_execz BB4_2
	; GFX1064: BB4_1:			; GFX1064: ; %bb.1:
	; GFX1064: v_mov_b32_e32 v0, local_var32@abs32@lo			; GFX1064: v_mov_b32_e32 v0, local_var32@abs32@lo
	; GFX1064: v_mov_b32_e32 v5, s3			; GFX1064: v_mov_b32_e32 v5, s3
	; GFX1064: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1064: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1064: s_waitcnt_vscnt null, 0x0			; GFX1064: s_waitcnt_vscnt null, 0x0
	; GFX1064: ds_add_rtn_u32 v0, v0, v5			; GFX1064: ds_add_rtn_u32 v0, v0, v5
	; GFX1064: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1064: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1064: buffer_gl0_inv			; GFX1064: buffer_gl0_inv
	; GFX1064: buffer_gl1_inv			; GFX1064: buffer_gl1_inv
	▲ Show 20 Lines • Show All 322 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll

	; RUN: llc -mtriple=amdgcn-- -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX7LESS %s			; RUN: llc -mtriple=amdgcn-- -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX7LESS %s
	; RUN: llc -mtriple=amdgcn-- -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX8MORE,GFX8MORE64,DPPCOMB %s			; RUN: llc -mtriple=amdgcn-- -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX8MORE,GFX8MORE64,DPPCOMB %s
	; RUN: llc -mtriple=amdgcn-- -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX8MORE,GFX8MORE64,DPPCOMB %s			; RUN: llc -mtriple=amdgcn-- -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX8MORE,GFX8MORE64,DPPCOMB %s
	; RUN: llc -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX8MORE,GFX8MORE64 %s			; RUN: llc -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-wavefrontsize32,+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN64,GFX8MORE,GFX8MORE64 %s
	; RUN: llc -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN32,GFX8MORE,GFX8MORE32 %s			; RUN: llc -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN32,GFX8MORE,GFX8MORE32 %s

	declare i1 @llvm.amdgcn.wqm.vote(i1)			declare i1 @llvm.amdgcn.wqm.vote(i1)
	declare i32 @llvm.amdgcn.buffer.atomic.add(i32, <4 x i32>, i32, i32, i1)			declare i32 @llvm.amdgcn.buffer.atomic.add(i32, <4 x i32>, i32, i32, i1)
	declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1)			declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1)

	; Show that what the atomic optimization pass will do for raw buffers.			; Show that what the atomic optimization pass will do for raw buffers.

	; GCN-LABEL: add_i32_constant:			; GCN-LABEL: add_i32_constant:
	; GCN-LABEL: BB0_1:			; GCN-LABEL: ; %bb.{{[0-9]+}}:
	; GCN32: v_cmp_ne_u32_e64 s[[exec_lo:[0-9]+]], 1, 0			; GCN32: v_cmp_ne_u32_e64 s[[exec_lo:[0-9]+]], 1, 0
	; GCN64: v_cmp_ne_u32_e64 s{{\[}}[[exec_lo:[0-9]+]]:[[exec_hi:[0-9]+]]{{\]}}, 1, 0			; GCN64: v_cmp_ne_u32_e64 s{{\[}}[[exec_lo:[0-9]+]]:[[exec_hi:[0-9]+]]{{\]}}, 1, 0
	; GCN: v_mbcnt_lo_u32_b32{{(_e[0-9]+)?}} v[[mbcnt:[0-9]+]], s[[exec_lo]], 0			; GCN: v_mbcnt_lo_u32_b32{{(_e[0-9]+)?}} v[[mbcnt:[0-9]+]], s[[exec_lo]], 0
	; GCN64: v_mbcnt_hi_u32_b32{{(_e[0-9]+)?}} v[[mbcnt]], s[[exec_hi]], v[[mbcnt]]			; GCN64: v_mbcnt_hi_u32_b32{{(_e[0-9]+)?}} v[[mbcnt]], s[[exec_hi]], v[[mbcnt]]
	; GCN: v_cmp_eq_u32{{(_e[0-9]+)?}} vcc{{(_lo)?}}, 0, v[[mbcnt]]			; GCN: v_cmp_eq_u32{{(_e[0-9]+)?}} vcc{{(_lo)?}}, 0, v[[mbcnt]]
	; GCN32: s_bcnt1_i32_b32 s[[popcount:[0-9]+]], s[[exec_lo]]			; GCN32: s_bcnt1_i32_b32 s[[popcount:[0-9]+]], s[[exec_lo]]
	; GCN64: s_bcnt1_i32_b64 s[[popcount:[0-9]+]], s{{\[}}[[exec_lo]]:[[exec_hi]]{{\]}}			; GCN64: s_bcnt1_i32_b64 s[[popcount:[0-9]+]], s{{\[}}[[exec_lo]]:[[exec_hi]]{{\]}}
	; GCN: v_mul_u32_u24{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[popcount]], 5			; GCN: v_mul_u32_u24{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[popcount]], 5
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/branch-condition-and.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	; This used to crash because during intermediate control flow lowering, there			; This used to crash because during intermediate control flow lowering, there
	; was a sequence			; was a sequence
	; s_mov_b64 s[0:1], exec			; s_mov_b64 s[0:1], exec
	; s_and_b64 s[2:3], s[0:1], s[2:3] ; def & use of the same register pair			; s_and_b64 s[2:3], s[0:1], s[2:3] ; def & use of the same register pair
	; ...			; ...
	; s_mov_b64_term exec, s[2:3]			; s_mov_b64_term exec, s[2:3]
	; that was not treated correctly.			; that was not treated correctly.
	;			;
	; GCN-LABEL: {{^}}ham:			; GCN-LABEL: {{^}}ham:
	; GCN-DAG: v_cmp_lt_f32_e64 [[OTHERCC:s\[[0-9]+:[0-9]+\]]],			; GCN-DAG: v_cmp_lt_f32_e64 [[OTHERCC:s\[[0-9]+:[0-9]+\]]],
	; GCN-DAG: v_cmp_lt_f32_e32 vcc,			; GCN-DAG: v_cmp_lt_f32_e32 vcc,
	; GCN: s_and_b64 [[AND:s\[[0-9]+:[0-9]+\]]], vcc, [[OTHERCC]]			; GCN: s_and_b64 [[AND:s\[[0-9]+:[0-9]+\]]], vcc, [[OTHERCC]]
	; GCN: s_and_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]], [[AND]]			; GCN: s_and_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]], [[AND]]
	; GCN: ; mask branch [[BB5:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: BB{{[0-9]+_[0-9]+}}: ; %bb4			; GCN-NEXT: ; %bb.{{[0-9]+}}: ; %bb4
	; GCN: ds_write_b32			; GCN: ds_write_b32

	; GCN: [[BB5]]			; GCN: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	; GCN-NEXT: .Lfunc_end			; GCN-NEXT: .Lfunc_end
	define amdgpu_ps void @ham(float %arg, float %arg1) #0 {			define amdgpu_ps void @ham(float %arg, float %arg1) #0 {
	bb:			bb:
	%tmp = fcmp ogt float %arg, 0.000000e+00			%tmp = fcmp ogt float %arg, 0.000000e+00
	%tmp2 = fcmp ogt float %arg1, 0.000000e+00			%tmp2 = fcmp ogt float %arg1, 0.000000e+00
	%tmp3 = and i1 %tmp, %tmp2			%tmp3 = and i1 %tmp, %tmp2
	br i1 %tmp3, label %bb4, label %bb5			br i1 %tmp3, label %bb4, label %bb5
	Show All 11 Lines

test/CodeGen/AMDGPU/branch-relaxation.ll

Show First 20 Lines • Show All 382 Lines • ▼ Show 20 Lines	; from firing, which defeats the need to expand the branches and this test.
ret void		ret void
}		}

; Requires expanding of required skip branch.		; Requires expanding of required skip branch.

; GCN-LABEL: {{^}}uniform_inside_divergent:		; GCN-LABEL: {{^}}uniform_inside_divergent:
; GCN: v_cmp_gt_u32_e32 vcc, 16, v{{[0-9]+}}		; GCN: v_cmp_gt_u32_e32 vcc, 16, v{{[0-9]+}}
; GCN-NEXT: s_and_saveexec_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], vcc		; GCN-NEXT: s_and_saveexec_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], vcc
; GCN-NEXT: ; mask branch [[ENDIF:BB[0-9]+_[0-9]+]]
; GCN-NEXT: s_cbranch_execnz [[IF:BB[0-9]+_[0-9]+]]		; GCN-NEXT: s_cbranch_execnz [[IF:BB[0-9]+_[0-9]+]]

; GCN-NEXT: [[LONGBB:BB[0-9]+_[0-9]+]]: ; %entry		; GCN-NEXT: [[LONGBB:BB[0-9]+_[0-9]+]]: ; %entry
; GCN-NEXT: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN-NEXT: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], [[BB2:BB[0-9]_[0-9]+]]-([[LONGBB]]+4)		; GCN-NEXT: s_add_u32 s[[PC_LO]], s[[PC_LO]], [[BB2:BB[0-9]_[0-9]+]]-([[LONGBB]]+4)
; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], 0{{$}}		; GCN-NEXT: s_addc_u32 s[[PC_HI]], s[[PC_HI]], 0{{$}}
; GCN-NEXT: s_setpc_b64 s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN-NEXT: s_setpc_b64 s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}

; GCN-NEXT: [[IF]]: ; %if		; GCN-NEXT: [[IF]]: ; %if
; GCN: buffer_store_dword		; GCN: buffer_store_dword
; GCN: s_cmp_lg_u32		; GCN: s_cmp_lg_u32
; GCN: s_cbranch_scc1 [[ENDIF]]		; GCN: s_cbranch_scc1 [[ENDIF:BB[0-9]+_[0-9]+]]

; GCN-NEXT: ; %bb.2: ; %if_uniform		; GCN-NEXT: ; %bb.2: ; %if_uniform
; GCN: buffer_store_dword		; GCN: buffer_store_dword

; GCN-NEXT: [[ENDIF]]: ; %endif		; GCN-NEXT: [[ENDIF]]: ; %endif
; GCN-NEXT: s_or_b64 exec, exec, [[MASK]]		; GCN-NEXT: s_or_b64 exec, exec, [[MASK]]
; GCN-NEXT: s_sleep 5		; GCN-NEXT: s_sleep 5
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
Show All 20 Lines
}		}

; si_mask_branch		; si_mask_branch

; GCN-LABEL: {{^}}analyze_mask_branch:		; GCN-LABEL: {{^}}analyze_mask_branch:
; GCN: v_cmp_nlt_f32_e32 vcc		; GCN: v_cmp_nlt_f32_e32 vcc
; GCN-NEXT: s_and_saveexec_b64 [[TEMP_MASK:s\[[0-9]+:[0-9]+\]]], vcc		; GCN-NEXT: s_and_saveexec_b64 [[TEMP_MASK:s\[[0-9]+:[0-9]+\]]], vcc
; GCN-NEXT: s_xor_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], exec, [[TEMP_MASK]]		; GCN-NEXT: s_xor_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], exec, [[TEMP_MASK]]
; GCN-NEXT: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]

; GCN: [[FLOW]]: ; %Flow		; GCN: BB{{[0-9]+_[0-9]+}}: ; %Flow
; GCN-NEXT: s_or_saveexec_b64 [[TEMP_MASK1:s\[[0-9]+:[0-9]+\]]], [[MASK]]		; GCN-NEXT: s_or_saveexec_b64 [[TEMP_MASK1:s\[[0-9]+:[0-9]+\]]], [[MASK]]
; GCN-NEXT: s_xor_b64 exec, exec, [[TEMP_MASK1]]		; GCN-NEXT: s_xor_b64 exec, exec, [[TEMP_MASK1]]
; GCN-NEXT: ; mask branch [[RET:BB[0-9]+_[0-9]+]]

; GCN: [[LOOP_BODY:BB[0-9]+_[0-9]+]]: ; %loop{{$}}		; GCN: [[LOOP_BODY:BB[0-9]+_[0-9]+]]: ; %loop{{$}}
; GCN: ;;#ASMSTART		; GCN: ;;#ASMSTART
; GCN: v_nop_e64		; GCN: v_nop_e64
; GCN: v_nop_e64		; GCN: v_nop_e64
; GCN: v_nop_e64		; GCN: v_nop_e64
; GCN: v_nop_e64		; GCN: v_nop_e64
; GCN: v_nop_e64		; GCN: v_nop_e64
; GCN: v_nop_e64		; GCN: v_nop_e64
; GCN: ;;#ASMEND		; GCN: ;;#ASMEND
; GCN: s_cbranch_vccz [[RET]]		; GCN: s_cbranch_vccz [[RET:BB[0-9]+_[0-9]+]]

; GCN-NEXT: [[LONGBB:BB[0-9]+_[0-9]+]]: ; %loop		; GCN-NEXT: [[LONGBB:BB[0-9]+_[0-9]+]]: ; %loop
; GCN-NEXT: ; in Loop: Header=[[LOOP_BODY]] Depth=1		; GCN-NEXT: ; in Loop: Header=[[LOOP_BODY]] Depth=1
; GCN-NEXT: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}		; GCN-NEXT: s_getpc_b64 s{{\[}}[[PC_LO:[0-9]+]]:[[PC_HI:[0-9]+]]{{\]}}
; GCN-NEXT: s_sub_u32 s[[PC_LO]], s[[PC_LO]], ([[LONGBB]]+4)-[[LOOP_BODY]]		; GCN-NEXT: s_sub_u32 s[[PC_LO]], s[[PC_LO]], ([[LONGBB]]+4)-[[LOOP_BODY]]
; GCN-NEXT: s_subb_u32 s[[PC_HI]], s[[PC_HI]], 0		; GCN-NEXT: s_subb_u32 s[[PC_HI]], s[[PC_HI]], 0
; GCN-NEXT: s_setpc_b64 s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}		; GCN-NEXT: s_setpc_b64 s{{\[}}[[PC_LO]]:[[PC_HI]]{{\]}}

▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/call-skip.ll

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s

	; A call should be skipped if all lanes are zero, since we don't know			; A call should be skipped if all lanes are zero, since we don't know
	; what side effects should be avoided inside the call.			; what side effects should be avoided inside the call.
	define hidden void @func() #1 {			define hidden void @func() #1 {
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}if_call:			; GCN-LABEL: {{^}}if_call:
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN-NEXT: ; mask branch [[END:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: s_cbranch_execz [[END]]
	; GCN: s_swappc_b64			; GCN: s_swappc_b64
	; GCN: [[END]]:			; GCN: [[END]]:
	define void @if_call(i32 %flag) #0 {			define void @if_call(i32 %flag) #0 {
	%cc = icmp eq i32 %flag, 0			%cc = icmp eq i32 %flag, 0
	br i1 %cc, label %call, label %end			br i1 %cc, label %call, label %end

	call:			call:
	call void @func()			call void @func()
	br label %end			br label %end

	end:			end:
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}if_asm:			; GCN-LABEL: {{^}}if_asm:
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN-NEXT: ; mask branch [[END:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: s_cbranch_execz [[END]]
	; GCN: ; sample asm			; GCN: ; sample asm
	; GCN: [[END]]:			; GCN: [[END]]:
	define void @if_asm(i32 %flag) #0 {			define void @if_asm(i32 %flag) #0 {
	%cc = icmp eq i32 %flag, 0			%cc = icmp eq i32 %flag, 0
	br i1 %cc, label %call, label %end			br i1 %cc, label %call, label %end

	call:			call:
	call void asm sideeffect "; sample asm", ""()			call void asm sideeffect "; sample asm", ""()
	br label %end			br label %end

	end:			end:
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}if_call_kernel:			; GCN-LABEL: {{^}}if_call_kernel:
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN-NEXT: ; mask branch [[END:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz BB3_2
	; GCN-NEXT: s_cbranch_execz [[END]]
	; GCN: s_swappc_b64			; GCN: s_swappc_b64
	define amdgpu_kernel void @if_call_kernel() #0 {			define amdgpu_kernel void @if_call_kernel() #0 {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%cc = icmp eq i32 %id, 0			%cc = icmp eq i32 %id, 0
	br i1 %cc, label %call, label %end			br i1 %cc, label %call, label %end

	call:			call:
	call void @func()			call void @func()
	Show All 11 Lines

test/CodeGen/AMDGPU/collapse-endcf.ll

	; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,ALL %s			; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,ALL %s
	; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -amdgpu-opt-exec-mask-pre-ra=0 < %s \| FileCheck -enable-var-scope -check-prefixes=DISABLED,ALL %s			; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -amdgpu-opt-exec-mask-pre-ra=0 < %s \| FileCheck -enable-var-scope -check-prefixes=DISABLED,ALL %s

	; ALL-LABEL: {{^}}simple_nested_if:			; ALL-LABEL: {{^}}simple_nested_if:
	; GCN: s_and_saveexec_b64 [[SAVEEXEC:s\[[0-9:]+\]]]			; GCN: s_and_saveexec_b64 [[SAVEEXEC:s\[[0-9:]+\]]]
	; GCN-NEXT: ; mask branch [[ENDIF:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF]]
	; GCN: s_and_b64 exec, exec, vcc			; GCN: s_and_b64 exec, exec, vcc
	; GCN-NEXT: ; mask branch [[ENDIF]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF]]			; GCN-NEXT: s_cbranch_execz [[ENDIF]]
	; GCN-NEXT: {{^BB[0-9_]+}}:			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: {{^}}[[ENDIF]]:			; GCN-NEXT: {{^}}[[ENDIF]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC]]
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: s_endpgm			; GCN: s_endpgm


	; DISABLED: s_or_b64 exec, exec			; DISABLED: s_or_b64 exec, exec
	Show All 18 Lines

	bb.outer.end: ; preds = %bb.outer.then, %bb.inner.then, %bb			bb.outer.end: ; preds = %bb.outer.then, %bb.inner.then, %bb
	store i32 3, i32 addrspace(3)* null			store i32 3, i32 addrspace(3)* null
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}uncollapsable_nested_if:			; ALL-LABEL: {{^}}uncollapsable_nested_if:
	; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]			; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]
	; GCN-NEXT: ; mask branch [[ENDIF_OUTER:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER]]
	; GCN: s_and_saveexec_b64 [[SAVEEXEC_INNER:s\[[0-9:]+\]]]			; GCN: s_and_saveexec_b64 [[SAVEEXEC_INNER:s\[[0-9:]+\]]]
	; GCN-NEXT: ; mask branch [[ENDIF_INNER:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF_INNER:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF_INNER]]			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: {{^BB[0-9_]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: {{^}}[[ENDIF_INNER]]:			; GCN-NEXT: {{^}}[[ENDIF_INNER]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER]]
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:			; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: s_endpgm			; GCN: s_endpgm
	Show All 23 Lines

	bb.outer.end: ; preds = %bb.inner.then, %bb			bb.outer.end: ; preds = %bb.inner.then, %bb
	store i32 3, i32 addrspace(3)* null			store i32 3, i32 addrspace(3)* null
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}nested_if_if_else:			; ALL-LABEL: {{^}}nested_if_if_else:
	; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]			; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]
	; GCN-NEXT: ; mask branch [[ENDIF_OUTER:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER]]
	; GCN: s_and_saveexec_b64 [[SAVEEXEC_INNER:s\[[0-9:]+\]]]			; GCN: s_and_saveexec_b64 [[SAVEEXEC_INNER:s\[[0-9:]+\]]]
	; GCN-NEXT: s_xor_b64 [[SAVEEXEC_INNER2:s\[[0-9:]+\]]], exec, [[SAVEEXEC_INNER]]			; GCN-NEXT: s_xor_b64 [[SAVEEXEC_INNER2:s\[[0-9:]+\]]], exec, [[SAVEEXEC_INNER]]
	; GCN-NEXT: ; mask branch [[THEN_INNER:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[THEN_INNER:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[THEN_INNER]]			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: {{^BB[0-9_]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: {{^}}[[THEN_INNER]]:			; GCN-NEXT: {{^}}[[THEN_INNER]]:
	; GCN-NEXT: s_or_saveexec_b64 [[SAVEEXEC_INNER3:s\[[0-9:]+\]]], [[SAVEEXEC_INNER2]]			; GCN-NEXT: s_or_saveexec_b64 [[SAVEEXEC_INNER3:s\[[0-9:]+\]]], [[SAVEEXEC_INNER2]]
	; GCN-NEXT: s_xor_b64 exec, exec, [[SAVEEXEC_INNER3]]			; GCN-NEXT: s_xor_b64 exec, exec, [[SAVEEXEC_INNER3]]
	; GCN-NEXT: ; mask branch [[ENDIF_OUTER]]			; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER]]
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:			; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @nested_if_if_else(i32 addrspace(1)* nocapture %arg) {			define amdgpu_kernel void @nested_if_if_else(i32 addrspace(1)* nocapture %arg) {
	bb:			bb:
	%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()			%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
	Show All 21 Lines
	bb.outer.end: ; preds = %bb, %bb.then, %bb.else			bb.outer.end: ; preds = %bb, %bb.then, %bb.else
	store i32 3, i32 addrspace(3)* null			store i32 3, i32 addrspace(3)* null
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}nested_if_else_if:			; ALL-LABEL: {{^}}nested_if_else_if:
	; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]			; GCN: s_and_saveexec_b64 [[SAVEEXEC_OUTER:s\[[0-9:]+\]]]
	; GCN-NEXT: s_xor_b64 [[SAVEEXEC_OUTER2:s\[[0-9:]+\]]], exec, [[SAVEEXEC_OUTER]]			; GCN-NEXT: s_xor_b64 [[SAVEEXEC_OUTER2:s\[[0-9:]+\]]], exec, [[SAVEEXEC_OUTER]]
	; GCN-NEXT: ; mask branch [[THEN_OUTER:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[THEN_OUTER:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[THEN_OUTER]]			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: {{^BB[0-9_]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: s_and_saveexec_b64 [[SAVEEXEC_INNER_IF_OUTER_ELSE:s\[[0-9:]+\]]]			; GCN-NEXT: s_and_saveexec_b64 [[SAVEEXEC_INNER_IF_OUTER_ELSE:s\[[0-9:]+\]]]
	; GCN-NEXT: ; mask branch [[THEN_OUTER_FLOW:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[THEN_OUTER_FLOW:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[THEN_OUTER_FLOW]]			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: {{^BB[0-9_]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: {{^}}[[THEN_OUTER_FLOW]]:			; GCN-NEXT: {{^}}[[THEN_OUTER_FLOW]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER_IF_OUTER_ELSE]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER_IF_OUTER_ELSE]]
	; GCN-NEXT: {{^}}[[THEN_OUTER]]:			; GCN-NEXT: {{^}}[[THEN_OUTER]]:
	; GCN-NEXT: s_or_saveexec_b64 [[SAVEEXEC_OUTER3:s\[[0-9:]+\]]], [[SAVEEXEC_OUTER2]]			; GCN-NEXT: s_or_saveexec_b64 [[SAVEEXEC_OUTER3:s\[[0-9:]+\]]], [[SAVEEXEC_OUTER2]]
	; GCN-NEXT: s_xor_b64 exec, exec, [[SAVEEXEC_OUTER3]]			; GCN-NEXT: s_xor_b64 exec, exec, [[SAVEEXEC_OUTER3]]
	; GCN-NEXT: ; mask branch [[ENDIF_OUTER:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF_OUTER]]			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: {{^BB[0-9_]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: s_and_saveexec_b64 [[SAVEEXEC_INNER_IF_OUTER_THEN:s\[[0-9:]+\]]]			; GCN-NEXT: s_and_saveexec_b64 [[SAVEEXEC_INNER_IF_OUTER_THEN:s\[[0-9:]+\]]]
	; GCN-NEXT: ; mask branch [[FLOW1:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[FLOW1:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[FLOW1]]			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: {{^BB[0-9_]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: [[FLOW1]]:			; GCN-NEXT: [[FLOW1]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER_IF_OUTER_THEN]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_INNER_IF_OUTER_THEN]]
	; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:			; GCN-NEXT: {{^}}[[ENDIF_OUTER]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC_OUTER]]
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @nested_if_else_if(i32 addrspace(1)* nocapture %arg) {			define amdgpu_kernel void @nested_if_else_if(i32 addrspace(1)* nocapture %arg) {
	Show All 28 Lines

	bb.outer.end:			bb.outer.end:
	store i32 3, i32 addrspace(3)* null			store i32 3, i32 addrspace(3)* null
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}s_endpgm_unsafe_barrier:			; ALL-LABEL: {{^}}s_endpgm_unsafe_barrier:
	; GCN: s_and_saveexec_b64 [[SAVEEXEC:s\[[0-9:]+\]]]			; GCN: s_and_saveexec_b64 [[SAVEEXEC:s\[[0-9:]+\]]]
	; GCN-NEXT: ; mask branch [[ENDIF:BB[0-9_]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF:BB[0-9_]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF]]			; GCN-NEXT: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: {{^BB[0-9_]+}}:
	; GCN: store_dword			; GCN: store_dword
	; GCN-NEXT: {{^}}[[ENDIF]]:			; GCN-NEXT: {{^}}[[ENDIF]]:
	; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC]]			; GCN-NEXT: s_or_b64 exec, exec, [[SAVEEXEC]]
	; GCN: s_barrier			; GCN: s_barrier
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	define amdgpu_kernel void @s_endpgm_unsafe_barrier(i32 addrspace(1)* nocapture %arg) {			define amdgpu_kernel void @s_endpgm_unsafe_barrier(i32 addrspace(1)* nocapture %arg) {
	bb:			bb:
	%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()			%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/control-flow-fastregalloc.ll

	Show All 29 Lines

	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:20 ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:20 ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:24 ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:24 ; 4-byte Folded Spill

	; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}			; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}

	; GCN: mask branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: {{^}}BB{{[0-9]+}}_1: ; %if			; GCN: ; %bb.{{[0-9]+}}: ; %if
	; GCN: s_mov_b32 m0, -1			; GCN: s_mov_b32 m0, -1
	; GCN: ds_read_b32 [[LOAD1:v[0-9]+]]			; GCN: ds_read_b32 [[LOAD1:v[0-9]+]]
	; GCN: buffer_load_dword [[RELOAD_LOAD0:v[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword [[RELOAD_LOAD0:v[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: s_waitcnt vmcnt(0) lgkmcnt(0)			; GCN: s_waitcnt vmcnt(0) lgkmcnt(0)


	; Spill val register			; Spill val register
	; GCN: v_add_i32_e32 [[VAL:v[0-9]+]], vcc, [[LOAD1]], [[RELOAD_LOAD0]]			; GCN: v_add_i32_e32 [[VAL:v[0-9]+]], vcc, [[LOAD1]], [[RELOAD_LOAD0]]
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines

	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:24 ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:24 ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:28 ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:28 ; 4-byte Folded Spill

	; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}			; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}

	; GCN-NEXT: ; mask branch [[END:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: s_cbranch_execz [[END]]


	; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:
	; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]			; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]
	; GCN: v_cmp_ne_u32_e32 vcc,			; GCN: v_cmp_ne_u32_e32 vcc,
	; GCN: s_and_b64 vcc, exec, vcc			; GCN: s_and_b64 vcc, exec, vcc
	; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], s7 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], s7 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: s_mov_b64 exec, [[CMP0]]			; GCN: s_mov_b64 exec, [[CMP0]]

	; FIXME: It makes no sense to put this skip here			; FIXME: It makes no sense to put this skip here
	; GCN-NEXT: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_execz [[FLOW:BB[0-9]+_[0-9]+]]
	; GCN: s_cbranch_execz [[FLOW]]
	; GCN-NEXT: s_branch [[ELSE:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_branch [[ELSE:BB[0-9]+_[0-9]+]]

	; GCN: [[FLOW]]: ; %Flow			; GCN: [[FLOW]]: ; %Flow
	; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]


	; VMEM: buffer_load_dword v[[FLOW_V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:[[SAVEEXEC_LO_OFFSET]]			; VMEM: buffer_load_dword v[[FLOW_V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:[[SAVEEXEC_LO_OFFSET]]
	Show All 17 Lines

	; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_LO:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_LO:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_HI:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_HI:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: buffer_store_dword [[FLOW_VAL]], off, s[0:3], s7 offset:[[RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[FLOW_VAL]], off, s[0:3], s7 offset:[[RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN: s_xor_b64 exec, exec, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO]]:[[FLOW_S_RELOAD_SAVEEXEC_HI]]{{\]}}			; GCN: s_xor_b64 exec, exec, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO]]:[[FLOW_S_RELOAD_SAVEEXEC_HI]]{{\]}}
	; GCN-NEXT: ; mask branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF]]


	; GCN: BB{{[0-9]+}}_2: ; %if			; GCN: ; %bb.{{[0-9]+}}: ; %if
	; GCN: ds_read_b32			; GCN: ds_read_b32
	; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]			; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]
	; GCN: buffer_store_dword [[ADD]], off, s[0:3], s7 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[ADD]], off, s[0:3], s7 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill
	; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: [[ELSE]]: ; %else			; GCN: [[ELSE]]: ; %else
	; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/convergent-inlineasm.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn--amdhsa -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	declare i32 @llvm.amdgcn.workitem.id.x() #0			declare i32 @llvm.amdgcn.workitem.id.x() #0
	; GCN-LABEL: {{^}}convergent_inlineasm:			; GCN-LABEL: {{^}}convergent_inlineasm:
	; GCN: %bb.0:			; GCN: %bb.0:
	; GCN: v_cmp_ne_u32_e64			; GCN: v_cmp_ne_u32_e64
	; GCN: ; mask branch			; GCN: s_cbranch_execz
				arsenmUnsubmitted Not Done Reply Inline Actions I assume the branch was here before and this isn't a change? arsenm: I assume the branch was here before and this isn't a change?
				cdevadasAuthorUnsubmitted Done Reply Inline Actions Yes, the branch was here even earlier. cdevadas: Yes, the branch was here even earlier.
	; GCN: BB{{[0-9]+_[0-9]+}}:			; GCN: ; %bb.{{[0-9]+}}:
	define amdgpu_kernel void @convergent_inlineasm(i64 addrspace(1)* nocapture %arg) {			define amdgpu_kernel void @convergent_inlineasm(i64 addrspace(1)* nocapture %arg) {
	bb:			bb:
	%tmp = call i32 @llvm.amdgcn.workitem.id.x()			%tmp = call i32 @llvm.amdgcn.workitem.id.x()
	%tmp1 = tail call i64 asm "v_cmp_ne_u32_e64 $0, 0, $1", "=s,v"(i32 1) #1			%tmp1 = tail call i64 asm "v_cmp_ne_u32_e64 $0, 0, $1", "=s,v"(i32 1) #1
	%tmp2 = icmp eq i32 %tmp, 8			%tmp2 = icmp eq i32 %tmp, 8
	br i1 %tmp2, label %bb3, label %bb5			br i1 %tmp2, label %bb3, label %bb5

	bb3: ; preds = %bb			bb3: ; preds = %bb
	%tmp4 = getelementptr i64, i64 addrspace(1)* %arg, i32 %tmp			%tmp4 = getelementptr i64, i64 addrspace(1)* %arg, i32 %tmp
	store i64 %tmp1, i64 addrspace(1)* %arg, align 8			store i64 %tmp1, i64 addrspace(1)* %arg, align 8
	br label %bb5			br label %bb5

	bb5: ; preds = %bb3, %bb			bb5: ; preds = %bb3, %bb
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}nonconvergent_inlineasm:			; GCN-LABEL: {{^}}nonconvergent_inlineasm:
	; GCN: ; mask branch			; GCN: s_cbranch_execz

	; GCN: BB{{[0-9]+_[0-9]+}}:			; GCN: ; %bb.{{[0-9]+}}:
	; GCN: v_cmp_ne_u32_e64			; GCN: v_cmp_ne_u32_e64

	; GCN: BB{{[0-9]+_[0-9]+}}:			; GCN: BB{{[0-9]+_[0-9]+}}:

	define amdgpu_kernel void @nonconvergent_inlineasm(i64 addrspace(1)* nocapture %arg) {			define amdgpu_kernel void @nonconvergent_inlineasm(i64 addrspace(1)* nocapture %arg) {
	bb:			bb:
	%tmp = call i32 @llvm.amdgcn.workitem.id.x()			%tmp = call i32 @llvm.amdgcn.workitem.id.x()
	%tmp1 = tail call i64 asm "v_cmp_ne_u32_e64 $0, 0, $1", "=s,v"(i32 1)			%tmp1 = tail call i64 asm "v_cmp_ne_u32_e64 $0, 0, $1", "=s,v"(i32 1)
	Show All 14 Lines

test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: s_or_b64 s[6:7], s[6:7], exec			; CHECK-NEXT: s_or_b64 s[6:7], s[6:7], exec
	; CHECK-NEXT: s_or_b64 s[8:9], s[8:9], exec			; CHECK-NEXT: s_or_b64 s[8:9], s[8:9], exec
	; CHECK-NEXT: s_cbranch_vccz BB0_2			; CHECK-NEXT: s_cbranch_vccz BB0_2
	; CHECK-NEXT: ; %bb.4: ; %endif1			; CHECK-NEXT: ; %bb.4: ; %endif1
	; CHECK-NEXT: ; in Loop: Header=BB0_3 Depth=1			; CHECK-NEXT: ; in Loop: Header=BB0_3 Depth=1
	; CHECK-NEXT: s_mov_b64 s[6:7], -1			; CHECK-NEXT: s_mov_b64 s[6:7], -1
	; CHECK-NEXT: s_and_saveexec_b64 s[8:9], s[0:1]			; CHECK-NEXT: s_and_saveexec_b64 s[8:9], s[0:1]
	; CHECK-NEXT: s_xor_b64 s[8:9], exec, s[8:9]			; CHECK-NEXT: s_xor_b64 s[8:9], exec, s[8:9]
	; CHECK-NEXT: ; mask branch BB0_1
	; CHECK-NEXT: s_cbranch_execz BB0_1			; CHECK-NEXT: s_cbranch_execz BB0_1
	; CHECK-NEXT: BB0_5: ; %endif2			; CHECK-NEXT: ; %bb.5: ; %endif2
	; CHECK-NEXT: ; in Loop: Header=BB0_3 Depth=1			; CHECK-NEXT: ; in Loop: Header=BB0_3 Depth=1
	; CHECK-NEXT: v_add_u32_e32 v1, 1, v1			; CHECK-NEXT: v_add_u32_e32 v1, 1, v1
	; CHECK-NEXT: s_xor_b64 s[6:7], exec, -1			; CHECK-NEXT: s_xor_b64 s[6:7], exec, -1
	; CHECK-NEXT: s_branch BB0_1			; CHECK-NEXT: s_branch BB0_1
	; CHECK-NEXT: BB0_6: ; %Flow2			; CHECK-NEXT: BB0_6: ; %Flow2
	; CHECK-NEXT: s_or_b64 exec, exec, s[10:11]			; CHECK-NEXT: s_or_b64 exec, exec, s[10:11]
	; CHECK-NEXT: v_mov_b32_e32 v1, 0			; CHECK-NEXT: v_mov_b32_e32 v1, 0
	; CHECK-NEXT: s_and_saveexec_b64 s[0:1], s[2:3]			; CHECK-NEXT: s_and_saveexec_b64 s[0:1], s[2:3]
	; CHECK-NEXT: ; mask branch BB0_8			; CHECK-NEXT: ; %bb.7: ; %if1
	; CHECK-NEXT: BB0_7: ; %if1
	; CHECK-NEXT: v_sqrt_f32_e32 v1, v0			; CHECK-NEXT: v_sqrt_f32_e32 v1, v0
	; CHECK-NEXT: BB0_8: ; %endloop			; CHECK-NEXT: ; %bb.8: ; %endloop
	; CHECK-NEXT: s_or_b64 exec, exec, s[0:1]			; CHECK-NEXT: s_or_b64 exec, exec, s[0:1]
	; CHECK-NEXT: exp mrt0 v1, v1, v1, v1 done vm			; CHECK-NEXT: exp mrt0 v1, v1, v1, v1 done vm
	; CHECK-NEXT: s_endpgm			; CHECK-NEXT: s_endpgm
	; this is the divergent branch with the condition not marked as divergent			; this is the divergent branch with the condition not marked as divergent
	start:			start:
	%v0 = call float @llvm.amdgcn.interp.p1(float %1, i32 0, i32 0, i32 %0)			%v0 = call float @llvm.amdgcn.interp.p1(float %1, i32 0, i32 0, i32 %0)
	br label %loop			br label %loop

	Show All 29 Lines

test/CodeGen/AMDGPU/else.ll

	; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck %s

	; CHECK-LABEL: {{^}}else_no_execfix:			; CHECK-LABEL: {{^}}else_no_execfix:
	; CHECK: ; %Flow			; CHECK: ; %Flow
	; CHECK-NEXT: s_or_saveexec_b64 [[DST:s\[[0-9]+:[0-9]+\]]],			; CHECK-NEXT: s_or_saveexec_b64 [[DST:s\[[0-9]+:[0-9]+\]]],
	; CHECK-NEXT: s_xor_b64 exec, exec, [[DST]]			; CHECK-NEXT: s_xor_b64 exec, exec, [[DST]]
	; CHECK-NEXT: ; mask branch
	define amdgpu_ps float @else_no_execfix(i32 %z, float %v) #0 {			define amdgpu_ps float @else_no_execfix(i32 %z, float %v) #0 {
	main_body:			main_body:
	%cc = icmp sgt i32 %z, 5			%cc = icmp sgt i32 %z, 5
	br i1 %cc, label %if, label %else			br i1 %cc, label %if, label %else

	if:			if:
	%v.if = fmul float %v, 2.0			%v.if = fmul float %v, 2.0
	br label %end			br label %end
	Show All 10 Lines
	; CHECK-LABEL: {{^}}else_execfix_leave_wqm:			; CHECK-LABEL: {{^}}else_execfix_leave_wqm:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: s_mov_b64 [[INIT_EXEC:s\[[0-9]+:[0-9]+\]]], exec			; CHECK-NEXT: s_mov_b64 [[INIT_EXEC:s\[[0-9]+:[0-9]+\]]], exec
	; CHECK: ; %Flow			; CHECK: ; %Flow
	; CHECK-NEXT: s_or_saveexec_b64 [[DST:s\[[0-9]+:[0-9]+\]]],			; CHECK-NEXT: s_or_saveexec_b64 [[DST:s\[[0-9]+:[0-9]+\]]],
	; CHECK-NEXT: s_and_b64 exec, exec, [[INIT_EXEC]]			; CHECK-NEXT: s_and_b64 exec, exec, [[INIT_EXEC]]
	; CHECK-NEXT: s_and_b64 [[AND_INIT:s\[[0-9]+:[0-9]+\]]], exec, [[DST]]			; CHECK-NEXT: s_and_b64 [[AND_INIT:s\[[0-9]+:[0-9]+\]]], exec, [[DST]]
	; CHECK-NEXT: s_xor_b64 exec, exec, [[AND_INIT]]			; CHECK-NEXT: s_xor_b64 exec, exec, [[AND_INIT]]
	; CHECK-NEXT: ; mask branch			; CHECK-NEXT: s_cbranch_execz
	define amdgpu_ps void @else_execfix_leave_wqm(i32 %z, float %v) #0 {			define amdgpu_ps void @else_execfix_leave_wqm(i32 %z, float %v) #0 {
	main_body:			main_body:
	%cc = icmp sgt i32 %z, 5			%cc = icmp sgt i32 %z, 5
	br i1 %cc, label %if, label %else			br i1 %cc, label %if, label %else

	if:			if:
	%v.if = fmul float %v, 2.0			%v.if = fmul float %v, 2.0
	br label %end			br label %end
	Show All 19 Lines

test/CodeGen/AMDGPU/hoist-cond.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs -disable-block-placement < %s \| FileCheck %s			; RUN: llc -march=amdgcn -verify-machineinstrs -disable-block-placement < %s \| FileCheck %s

	; Check that invariant compare is hoisted out of the loop.			; Check that invariant compare is hoisted out of the loop.
	; At the same time condition shall not be serialized into a VGPR and deserialized later			; At the same time condition shall not be serialized into a VGPR and deserialized later
	; using another v_cmp + v_cndmask, but used directly in s_and_saveexec_b64.			; using another v_cmp + v_cndmask, but used directly in s_and_saveexec_b64.

	; CHECK: v_cmp_{{..}}_u32_e{{32\|64}} [[COND:s\[[0-9]+:[0-9]+\]\|vcc]]			; CHECK: v_cmp_{{..}}_u32_e{{32\|64}} [[COND:s\[[0-9]+:[0-9]+\]\|vcc]]
	; CHECK: BB0_1:			; CHECK: BB0_1:
	; CHECK-NOT: v_cmp			; CHECK-NOT: v_cmp
	; CHECK_NOT: v_cndmask			; CHECK_NOT: v_cndmask
	; CHECK: s_and_saveexec_b64 s[{{[[0-9]+:[0-9]+}}], [[COND]]			; CHECK: s_and_saveexec_b64 s[{{[[0-9]+:[0-9]+}}], [[COND]]
	; CHECK: BB0_2:			; CHECK: ; %bb.2:

	define amdgpu_kernel void @hoist_cond(float addrspace(1)* nocapture %arg, float addrspace(1)* noalias nocapture readonly %arg1, i32 %arg3, i32 %arg4) {			define amdgpu_kernel void @hoist_cond(float addrspace(1)* nocapture %arg, float addrspace(1)* noalias nocapture readonly %arg1, i32 %arg3, i32 %arg4) {
	bb:			bb:
	%tmp = tail call i32 @llvm.amdgcn.workitem.id.x() #0			%tmp = tail call i32 @llvm.amdgcn.workitem.id.x() #0
	%tmp5 = icmp ult i32 %tmp, %arg3			%tmp5 = icmp ult i32 %tmp, %arg3
	br label %bb1			br label %bb1

	bb1: ; preds = %bb3, %bb			bb1: ; preds = %bb3, %bb
	Show All 26 Lines

test/CodeGen/AMDGPU/insert-skips-flat-vmem.mir

	# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
	# RUN: llc -march=amdgcn -mcpu=polaris10 -run-pass si-insert-skips -amdgpu-skip-threshold=1 -verify-machineinstrs %s -o - \| FileCheck %s			# RUN: llc -march=amdgcn -mcpu=polaris10 -run-pass si-insert-skips -amdgpu-skip-threshold-legacy=1 -verify-machineinstrs %s -o - \| FileCheck %s

	---			---

	name: skip_execz_flat			name: skip_execz_flat
	body: \|			body: \|
	; CHECK-LABEL: name: skip_execz_flat			; CHECK-LABEL: name: skip_execz_flat
	; CHECK: bb.0:			; CHECK: bb.0:
	; CHECK: successors: %bb.1(0x40000000), %bb.2(0x40000000)			; CHECK: successors: %bb.1(0x40000000), %bb.2(0x40000000)
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/insert-skips-gws.mir

	# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
	# RUN: llc -march=amdgcn -mcpu=gfx900 -run-pass si-insert-skips -amdgpu-skip-threshold=1 -verify-machineinstrs %s -o - \| FileCheck %s			# RUN: llc -march=amdgcn -mcpu=gfx900 -run-pass si-insert-skips -amdgpu-skip-threshold-legacy=1 -verify-machineinstrs %s -o - \| FileCheck %s
	# Make sure mandatory skips are inserted to ensure GWS ops aren't run with exec = 0			# Make sure mandatory skips are inserted to ensure GWS ops aren't run with exec = 0

	---			---

	name: skip_gws_init			name: skip_gws_init
	body: \|			body: \|
	; CHECK-LABEL: name: skip_gws_init			; CHECK-LABEL: name: skip_gws_init
	; CHECK: bb.0:			; CHECK: bb.0:
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/insert-skips-ignored-insts.mir

	# RUN: llc -mtriple=amdgcn-amd-amdhsa -run-pass si-insert-skips -amdgpu-skip-threshold=2 %s -o - \| FileCheck %s			# RUN: llc -mtriple=amdgcn-amd-amdhsa -run-pass si-insert-skips -amdgpu-skip-threshold-legacy=2 %s -o - \| FileCheck %s

	---			---

	# CHECK-LABEL: name: no_count_mask_branch_pseudo			# CHECK-LABEL: name: no_count_mask_branch_pseudo
	# CHECK: $vgpr1 = V_MOV_B32_e32 7, implicit $exec			# CHECK: $vgpr1 = V_MOV_B32_e32 7, implicit $exec
	# CHECK-NEXT: SI_MASK_BRANCH			# CHECK-NEXT: SI_MASK_BRANCH
	# CHECK-NOT: S_CBRANCH_EXECZ			# CHECK-NOT: S_CBRANCH_EXECZ
	name: no_count_mask_branch_pseudo			name: no_count_mask_branch_pseudo
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/insert-skips-kill-uncond.mir

	# RUN: llc -march=amdgcn -mcpu=polaris10 -run-pass si-insert-skips -amdgpu-skip-threshold=1 %s -o - \| FileCheck %s			# RUN: llc -march=amdgcn -mcpu=polaris10 -run-pass si-insert-skips -amdgpu-skip-threshold-legacy=1 %s -o - \| FileCheck %s
	# https://bugs.freedesktop.org/show_bug.cgi?id=99019			# https://bugs.freedesktop.org/show_bug.cgi?id=99019
	--- \|			--- \|
	define amdgpu_ps void @kill_uncond_branch() {			define amdgpu_ps void @kill_uncond_branch() {
	ret void			ret void
	}			}
	...			...
	---			---

	Show All 31 Lines

test/CodeGen/AMDGPU/mubuf-legalize-operands.ll

	Show First 20 Lines • Show All 152 Lines • ▼ Show 20 Lines
	; W64: s_waitcnt vmcnt(0)			; W64: s_waitcnt vmcnt(0)
	; W64: buffer_load_format_x [[RES:v[0-9]+]], [[IDX]], s{{\[}}[[SRSRC0]]:[[SRSRC3]]{{\]}}, 0 idxen			; W64: buffer_load_format_x [[RES:v[0-9]+]], [[IDX]], s{{\[}}[[SRSRC0]]:[[SRSRC3]]{{\]}}, 0 idxen
	; W64: s_xor_b64 exec, exec, [[CMP]]			; W64: s_xor_b64 exec, exec, [[CMP]]
	; W64: s_cbranch_execnz [[LOOPBB0]]			; W64: s_cbranch_execnz [[LOOPBB0]]

	; W64: s_mov_b64 exec, [[SAVEEXEC]]			; W64: s_mov_b64 exec, [[SAVEEXEC]]
	; W64: s_cbranch_execz [[TERMBB:BB[0-9]+_[0-9]+]]			; W64: s_cbranch_execz [[TERMBB:BB[0-9]+_[0-9]+]]

	; W64: BB{{[0-9]+_[0-9]+}}:			; W64: ; %bb.{{[0-9]+}}:
	; W64-DAG: v_mov_b32_e32 [[IDX:v[0-9]+]], s4			; W64-DAG: v_mov_b32_e32 [[IDX:v[0-9]+]], s4
	; W64-DAG: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec			; W64-DAG: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec

	; W64: [[LOOPBB1:BB[0-9]+_[0-9]+]]:			; W64: [[LOOPBB1:BB[0-9]+_[0-9]+]]:
	; W64-DAG: v_readfirstlane_b32 s[[SRSRC0:[0-9]+]], v4			; W64-DAG: v_readfirstlane_b32 s[[SRSRC0:[0-9]+]], v4
	; W64-DAG: v_readfirstlane_b32 s[[SRSRC1:[0-9]+]], v5			; W64-DAG: v_readfirstlane_b32 s[[SRSRC1:[0-9]+]], v5
	; W64-DAG: v_readfirstlane_b32 s[[SRSRC2:[0-9]+]], v6			; W64-DAG: v_readfirstlane_b32 s[[SRSRC2:[0-9]+]], v6
	; W64-DAG: v_readfirstlane_b32 s[[SRSRC3:[0-9]+]], v7			; W64-DAG: v_readfirstlane_b32 s[[SRSRC3:[0-9]+]], v7
	Show All 29 Lines
	; W32: s_waitcnt vmcnt(0)			; W32: s_waitcnt vmcnt(0)
	; W32: buffer_load_format_x [[RES:v[0-9]+]], [[IDX]], s{{\[}}[[SRSRC0]]:[[SRSRC3]]{{\]}}, 0 idxen			; W32: buffer_load_format_x [[RES:v[0-9]+]], [[IDX]], s{{\[}}[[SRSRC0]]:[[SRSRC3]]{{\]}}, 0 idxen
	; W32: s_xor_b32 exec_lo, exec_lo, [[CMP]]			; W32: s_xor_b32 exec_lo, exec_lo, [[CMP]]
	; W32: s_cbranch_execnz [[LOOPBB0]]			; W32: s_cbranch_execnz [[LOOPBB0]]

	; W32: s_mov_b32 exec_lo, [[SAVEEXEC]]			; W32: s_mov_b32 exec_lo, [[SAVEEXEC]]
	; W32: s_cbranch_execz [[TERMBB:BB[0-9]+_[0-9]+]]			; W32: s_cbranch_execz [[TERMBB:BB[0-9]+_[0-9]+]]

	; W32: BB{{[0-9]+_[0-9]+}}:			; W32: ; %bb.{{[0-9]+}}:
	; W32-DAG: v_mov_b32_e32 [[IDX:v[0-9]+]], s4			; W32-DAG: v_mov_b32_e32 [[IDX:v[0-9]+]], s4
	; W32-DAG: s_mov_b32 [[SAVEEXEC:s[0-9]+]], exec_lo			; W32-DAG: s_mov_b32 [[SAVEEXEC:s[0-9]+]], exec_lo

	; W32: [[LOOPBB1:BB[0-9]+_[0-9]+]]:			; W32: [[LOOPBB1:BB[0-9]+_[0-9]+]]:
	; W32-DAG: v_readfirstlane_b32 s[[SRSRC0:[0-9]+]], v4			; W32-DAG: v_readfirstlane_b32 s[[SRSRC0:[0-9]+]], v4
	; W32-DAG: v_readfirstlane_b32 s[[SRSRC1:[0-9]+]], v5			; W32-DAG: v_readfirstlane_b32 s[[SRSRC1:[0-9]+]], v5
	; W32-DAG: v_readfirstlane_b32 s[[SRSRC2:[0-9]+]], v6			; W32-DAG: v_readfirstlane_b32 s[[SRSRC2:[0-9]+]], v6
	; W32-DAG: v_readfirstlane_b32 s[[SRSRC3:[0-9]+]], v7			; W32-DAG: v_readfirstlane_b32 s[[SRSRC3:[0-9]+]], v7
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; W64-O0: buffer_store_dword [[RES]], off, s[0:3], s32 offset:[[RES_OFF_TMP:[0-9]+]] ; 4-byte Folded Spill			; W64-O0: buffer_store_dword [[RES]], off, s[0:3], s32 offset:[[RES_OFF_TMP:[0-9]+]] ; 4-byte Folded Spill
	; W64-O0: s_xor_b64 exec, exec, [[CMP]]			; W64-O0: s_xor_b64 exec, exec, [[CMP]]
	; W64-O0-NEXT: s_cbranch_execnz [[LOOPBB0]]			; W64-O0-NEXT: s_cbranch_execnz [[LOOPBB0]]
	; CHECK-O0: s_mov_b64 exec, [[SAVEEXEC]]			; CHECK-O0: s_mov_b64 exec, [[SAVEEXEC]]
	; W64-O0: buffer_load_dword [[RES:v[0-9]+]], off, s[0:3], s32 offset:[[RES_OFF_TMP]] ; 4-byte Folded Reload			; W64-O0: buffer_load_dword [[RES:v[0-9]+]], off, s[0:3], s32 offset:[[RES_OFF_TMP]] ; 4-byte Folded Reload
	; W64-O0: buffer_store_dword [[RES]], off, s[0:3], s32 offset:[[RES_OFF:[0-9]+]] ; 4-byte Folded Spill			; W64-O0: buffer_store_dword [[RES]], off, s[0:3], s32 offset:[[RES_OFF:[0-9]+]] ; 4-byte Folded Spill
	; W64-O0: s_cbranch_execz [[TERMBB:BB[0-9]+_[0-9]+]]			; W64-O0: s_cbranch_execz [[TERMBB:BB[0-9]+_[0-9]+]]

	; W64-O0: BB{{[0-9]+_[0-9]+}}:			; W64-O0: ; %bb.{{[0-9]+}}:
	; W64-O0-DAG: s_mov_b64 s{{\[}}[[SAVEEXEC0:[0-9]+]]:[[SAVEEXEC1:[0-9]+]]{{\]}}, exec			; W64-O0-DAG: s_mov_b64 s{{\[}}[[SAVEEXEC0:[0-9]+]]:[[SAVEEXEC1:[0-9]+]]{{\]}}, exec
	; W64-O0-DAG: buffer_store_dword {{v[0-9]+}}, off, s[0:3], s32 offset:[[IDX_OFF:[0-9]+]] ; 4-byte Folded Spill			; W64-O0-DAG: buffer_store_dword {{v[0-9]+}}, off, s[0:3], s32 offset:[[IDX_OFF:[0-9]+]] ; 4-byte Folded Spill
	; W64-O0: v_writelane_b32 [[VSAVEEXEC:v[0-9]+]], s[[SAVEEXEC0]], [[SAVEEXEC_IDX0:[0-9]+]]			; W64-O0: v_writelane_b32 [[VSAVEEXEC:v[0-9]+]], s[[SAVEEXEC0]], [[SAVEEXEC_IDX0:[0-9]+]]
	; W64-O0: v_writelane_b32 [[VSAVEEXEC:v[0-9]+]], s[[SAVEEXEC1]], [[SAVEEXEC_IDX1:[0-9]+]]			; W64-O0: v_writelane_b32 [[VSAVEEXEC:v[0-9]+]], s[[SAVEEXEC1]], [[SAVEEXEC_IDX1:[0-9]+]]

	; W64-O0: [[LOOPBB1:BB[0-9]+_[0-9]+]]:			; W64-O0: [[LOOPBB1:BB[0-9]+_[0-9]+]]:
	; W64-O0: buffer_load_dword v[[VRSRC0:[0-9]+]], {{.*}} ; 4-byte Folded Reload			; W64-O0: buffer_load_dword v[[VRSRC0:[0-9]+]], {{.*}} ; 4-byte Folded Reload
	; W64-O0: s_waitcnt vmcnt(0)			; W64-O0: s_waitcnt vmcnt(0)
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/mul24-pass-ordering.ll

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	define void @lsr_order_mul24_1(i32 %arg, i32 %arg1, i32 %arg2, float addrspace(3)* nocapture %arg3, i32 %arg4, i32 %arg5, i32 %arg6, i32 %arg7, i32 %arg8, i32 %arg9, float addrspace(1)* nocapture readonly %arg10, i32 %arg11, i32 %arg12, i32 %arg13, i32 %arg14, i32 %arg15, i32 %arg16, i1 zeroext %arg17, i1 zeroext %arg18) #0 {			define void @lsr_order_mul24_1(i32 %arg, i32 %arg1, i32 %arg2, float addrspace(3)* nocapture %arg3, i32 %arg4, i32 %arg5, i32 %arg6, i32 %arg7, i32 %arg8, i32 %arg9, float addrspace(1)* nocapture readonly %arg10, i32 %arg11, i32 %arg12, i32 %arg13, i32 %arg14, i32 %arg15, i32 %arg16, i1 zeroext %arg17, i1 zeroext %arg18) #0 {
	; GFX9-LABEL: lsr_order_mul24_1:			; GFX9-LABEL: lsr_order_mul24_1:
	; GFX9: ; %bb.0: ; %bb			; GFX9: ; %bb.0: ; %bb
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: v_and_b32_e32 v5, 1, v18			; GFX9-NEXT: v_and_b32_e32 v5, 1, v18
	; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v5			; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v5
	; GFX9-NEXT: v_cmp_lt_u32_e64 s[4:5], v0, v1			; GFX9-NEXT: v_cmp_lt_u32_e64 s[4:5], v0, v1
	; GFX9-NEXT: s_and_saveexec_b64 s[10:11], s[4:5]			; GFX9-NEXT: s_and_saveexec_b64 s[10:11], s[4:5]
	; GFX9-NEXT: ; mask branch BB1_4
	; GFX9-NEXT: s_cbranch_execz BB1_4			; GFX9-NEXT: s_cbranch_execz BB1_4
	; GFX9-NEXT: BB1_1: ; %bb19			; GFX9-NEXT: ; %bb.1: ; %bb19
	; GFX9-NEXT: v_cvt_f32_u32_e32 v7, v6			; GFX9-NEXT: v_cvt_f32_u32_e32 v7, v6
	; GFX9-NEXT: v_and_b32_e32 v5, 0xffffff, v6			; GFX9-NEXT: v_and_b32_e32 v5, 0xffffff, v6
	; GFX9-NEXT: v_add_u32_e32 v6, v4, v0			; GFX9-NEXT: v_add_u32_e32 v6, v4, v0
	; GFX9-NEXT: v_lshl_add_u32 v3, v6, 2, v3			; GFX9-NEXT: v_lshl_add_u32 v3, v6, 2, v3
	; GFX9-NEXT: v_rcp_iflag_f32_e32 v4, v7			; GFX9-NEXT: v_rcp_iflag_f32_e32 v4, v7
	; GFX9-NEXT: v_lshlrev_b32_e32 v6, 2, v2			; GFX9-NEXT: v_lshlrev_b32_e32 v6, 2, v2
	; GFX9-NEXT: v_add_u32_e32 v7, v17, v12			; GFX9-NEXT: v_add_u32_e32 v7, v17, v12
	; GFX9-NEXT: s_mov_b64 s[12:13], 0			; GFX9-NEXT: s_mov_b64 s[12:13], 0
	▲ Show 20 Lines • Show All 190 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/ret_jump.ll

	; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	; This should end with an no-op sequence of exec mask manipulations			; This should end with an no-op sequence of exec mask manipulations
	; Mask should be in original state after executed unreachable block			; Mask should be in original state after executed unreachable block


	; GCN-LABEL: {{^}}uniform_br_trivial_ret_divergent_br_trivial_unreachable:			; GCN-LABEL: {{^}}uniform_br_trivial_ret_divergent_br_trivial_unreachable:
	; GCN: s_cbranch_scc1 [[RET_BB:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_scc1 [[RET_BB:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: ; %else			; GCN-NEXT: ; %else

	; GCN: s_and_saveexec_b64 [[SAVE_EXEC:s\[[0-9]+:[0-9]+\]]], vcc			; GCN: s_and_saveexec_b64 [[SAVE_EXEC:s\[[0-9]+:[0-9]+\]]], vcc
	; GCN-NEXT: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]

	; GCN: BB{{[0-9]+_[0-9]+}}: ; %unreachable.bb			; GCN: ; %bb.{{[0-9]+}}: ; %unreachable.bb
	; GCN-NEXT: ; divergent unreachable			; GCN-NEXT: ; divergent unreachable

	; GCN-NEXT: {{^}}[[FLOW]]: ; %Flow			; GCN-NEXT: ; %bb.{{[0-9]+}}: ; %Flow
	; GCN-NEXT: s_or_b64 exec, exec			; GCN-NEXT: s_or_b64 exec, exec

	; GCN-NEXT: [[RET_BB]]:			; GCN-NEXT: [[RET_BB]]:
	; GCN-NEXT: ; return			; GCN-NEXT: ; return
	; GCN-NEXT: .Lfunc_end0			; GCN-NEXT: .Lfunc_end0
	define amdgpu_ps <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @uniform_br_trivial_ret_divergent_br_trivial_unreachable([9 x <4 x i32>] addrspace(4)* inreg %arg, [17 x <4 x i32>] addrspace(4)* inreg %arg1, [17 x <8 x i32>] addrspace(4)* inreg %arg2, i32 addrspace(4)* inreg %arg3, float inreg %arg4, i32 inreg %arg5, <2 x i32> %arg6, <2 x i32> %arg7, <2 x i32> %arg8, <3 x i32> %arg9, <2 x i32> %arg10, <2 x i32> %arg11, <2 x i32> %arg12, float %arg13, float %arg14, float %arg15, float %arg16, i32 inreg %arg17, i32 %arg18, i32 %arg19, float %arg20, i32 %arg21) #0 {			define amdgpu_ps <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @uniform_br_trivial_ret_divergent_br_trivial_unreachable([9 x <4 x i32>] addrspace(4)* inreg %arg, [17 x <4 x i32>] addrspace(4)* inreg %arg1, [17 x <8 x i32>] addrspace(4)* inreg %arg2, i32 addrspace(4)* inreg %arg3, float inreg %arg4, i32 inreg %arg5, <2 x i32> %arg6, <2 x i32> %arg7, <2 x i32> %arg8, <3 x i32> %arg9, <2 x i32> %arg10, <2 x i32> %arg11, <2 x i32> %arg12, float %arg13, float %arg14, float %arg15, float %arg16, i32 inreg %arg17, i32 %arg18, i32 %arg19, float %arg20, i32 %arg21) #0 {
	entry:			entry:
	%i.i = extractelement <2 x i32> %arg7, i32 0			%i.i = extractelement <2 x i32> %arg7, i32 0
	Show All 22 Lines
	unreachable.bb: ; preds = %else			unreachable.bb: ; preds = %else
	unreachable			unreachable

	ret.bb: ; preds = %else, %main_body			ret.bb: ; preds = %else, %main_body
	ret <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef			ret <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef
	}			}

	; GCN-LABEL: {{^}}uniform_br_nontrivial_ret_divergent_br_nontrivial_unreachable:			; GCN-LABEL: {{^}}uniform_br_nontrivial_ret_divergent_br_nontrivial_unreachable:
	; GCN: s_cbranch_vccnz [[RET_BB:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_vccz

	; GCN: ; %bb.{{[0-9]+}}: ; %else			; GCN: ; %bb.{{[0-9]+}}: ; %Flow
				; GCN: s_cbranch_execnz [[RETURN:BB[0-9]+_[0-9]+]]

				; GCN: ; %UnifiedReturnBlock
				; GCN-NEXT: s_or_b64 exec, exec
				; GCN-NEXT: s_waitcnt

				; GCN: BB{{[0-9]+_[0-9]+}}: ; %else
	; GCN: s_and_saveexec_b64 [[SAVE_EXEC:s\[[0-9]+:[0-9]+\]]], vcc			; GCN: s_and_saveexec_b64 [[SAVE_EXEC:s\[[0-9]+:[0-9]+\]]], vcc
	; GCN-NEXT: ; mask branch [[FLOW1:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: ; %unreachable.bb			; GCN-NEXT: ; %unreachable.bb
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: ; divergent unreachable			; GCN: ; divergent unreachable

	; GCN: ; %ret.bb			; GCN: ; %ret.bb
	; GCN: store_dword			; GCN: store_dword

	; GCN: ; %UnifiedReturnBlock
	; GCN-NEXT: s_or_b64 exec, exec
	; GCN-NEXT: s_waitcnt
	; GCN-NEXT: ; return
	; GCN-NEXT: .Lfunc_end
	define amdgpu_ps <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @uniform_br_nontrivial_ret_divergent_br_nontrivial_unreachable([9 x <4 x i32>] addrspace(4)* inreg %arg, [17 x <4 x i32>] addrspace(4)* inreg %arg1, [17 x <8 x i32>] addrspace(4)* inreg %arg2, i32 addrspace(4)* inreg %arg3, float inreg %arg4, i32 inreg %arg5, <2 x i32> %arg6, <2 x i32> %arg7, <2 x i32> %arg8, <3 x i32> %arg9, <2 x i32> %arg10, <2 x i32> %arg11, <2 x i32> %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, i32 inreg %arg18, i32 %arg19, float %arg20, i32 %arg21) #0 {			define amdgpu_ps <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @uniform_br_nontrivial_ret_divergent_br_nontrivial_unreachable([9 x <4 x i32>] addrspace(4)* inreg %arg, [17 x <4 x i32>] addrspace(4)* inreg %arg1, [17 x <8 x i32>] addrspace(4)* inreg %arg2, i32 addrspace(4)* inreg %arg3, float inreg %arg4, i32 inreg %arg5, <2 x i32> %arg6, <2 x i32> %arg7, <2 x i32> %arg8, <3 x i32> %arg9, <2 x i32> %arg10, <2 x i32> %arg11, <2 x i32> %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, i32 inreg %arg18, i32 %arg19, float %arg20, i32 %arg21) #0 {
	main_body:			main_body:
	%i.i = extractelement <2 x i32> %arg7, i32 0			%i.i = extractelement <2 x i32> %arg7, i32 0
	%j.i = extractelement <2 x i32> %arg7, i32 1			%j.i = extractelement <2 x i32> %arg7, i32 1
	%i.f.i = bitcast i32 %i.i to float			%i.f.i = bitcast i32 %i.i to float
	%j.f.i = bitcast i32 %j.i to float			%j.f.i = bitcast i32 %j.i to float
	%p1.i = call float @llvm.amdgcn.interp.p1(float %i.f.i, i32 1, i32 0, i32 %arg5) #2			%p1.i = call float @llvm.amdgcn.interp.p1(float %i.f.i, i32 1, i32 0, i32 %arg5) #2
	%p2 = call float @llvm.amdgcn.interp.p2(float %p1.i, float %j.f.i, i32 1, i32 0, i32 %arg5) #2			%p2 = call float @llvm.amdgcn.interp.p2(float %p1.i, float %j.f.i, i32 1, i32 0, i32 %arg5) #2
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/si-annotate-cf-noloop.ll

	Show All 34 Lines

	; OPT-LABEL: @annotate_ret_noloop(			; OPT-LABEL: @annotate_ret_noloop(
	; OPT-NOT: call i1 @llvm.amdgcn.loop			; OPT-NOT: call i1 @llvm.amdgcn.loop

	; GCN-LABEL: {{^}}annotate_ret_noloop:			; GCN-LABEL: {{^}}annotate_ret_noloop:
	; GCN: load_dwordx4			; GCN: load_dwordx4
	; GCN: v_cmp_nlt_f32			; GCN: v_cmp_nlt_f32
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: ; mask branch [[UNIFIED_RET:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: [[UNIFIED_RET]]:
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	; GCN: .Lfunc_end			; GCN: .Lfunc_end
	define amdgpu_kernel void @annotate_ret_noloop(<4 x float> addrspace(1)* noalias nocapture readonly %arg) #0 {			define amdgpu_kernel void @annotate_ret_noloop(<4 x float> addrspace(1)* noalias nocapture readonly %arg) #0 {
	bb:			bb:
	%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()			%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
	br label %bb1			br label %bb1

	bb1: ; preds = %bb			bb1: ; preds = %bb
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/si-lower-control-flow-unreachable-block.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	; GCN-LABEL: {{^}}lower_control_flow_unreachable_terminator:			; GCN-LABEL: {{^}}lower_control_flow_unreachable_terminator:
	; GCN: v_cmp_eq_u32			; GCN: v_cmp_eq_u32
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: ; mask branch [[RET:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: BB{{[0-9]+_[0-9]+}}: ; %unreachable			; GCN-NEXT: ; %bb.{{[0-9]+}}: ; %unreachable
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: ; divergent unreachable			; GCN: ; divergent unreachable

	; GCN-NEXT: [[RET]]: ; %UnifiedReturnBlock			; GCN-NEXT: ; %bb.{{[0-9]+}}: ; %UnifiedReturnBlock
	; GCN: s_endpgm			; GCN: s_endpgm

	define amdgpu_kernel void @lower_control_flow_unreachable_terminator() #0 {			define amdgpu_kernel void @lower_control_flow_unreachable_terminator() #0 {
	bb:			bb:
	%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()			%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()
	%tmp63 = icmp eq i32 %tmp15, 32			%tmp63 = icmp eq i32 %tmp15, 32
	br i1 %tmp63, label %unreachable, label %ret			br i1 %tmp63, label %unreachable, label %ret

	unreachable:			unreachable:
	store volatile i32 0, i32 addrspace(3)* undef, align 4			store volatile i32 0, i32 addrspace(3)* undef, align 4
	unreachable			unreachable

	ret:			ret:
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}lower_control_flow_unreachable_terminator_swap_block_order:			; GCN-LABEL: {{^}}lower_control_flow_unreachable_terminator_swap_block_order:
	; GCN: v_cmp_ne_u32			; GCN: v_cmp_ne_u32
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: ; mask branch [[RETURN:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %unreachable			; GCN-NEXT: ; %bb.{{[0-9]+}}: ; %unreachable
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: ; divergent unreachable			; GCN: ; divergent unreachable

	; GCN: [[RETURN]]:			; GCN: ; %bb.{{[0-9]+}}:
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	define amdgpu_kernel void @lower_control_flow_unreachable_terminator_swap_block_order() #0 {			define amdgpu_kernel void @lower_control_flow_unreachable_terminator_swap_block_order() #0 {
	bb:			bb:
	%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()			%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()
	%tmp63 = icmp eq i32 %tmp15, 32			%tmp63 = icmp eq i32 %tmp15, 32
	br i1 %tmp63, label %ret, label %unreachable			br i1 %tmp63, label %ret, label %unreachable

	ret:			ret:
	Show All 34 Lines

test/CodeGen/AMDGPU/si-lower-control-flow.mir

	Show All 26 Lines
	body: \|			body: \|
	; GCN-LABEL: name: preserve_undef_flag_si_if_src			; GCN-LABEL: name: preserve_undef_flag_si_if_src
	; GCN: bb.0:			; GCN: bb.0:
	; GCN: successors: %bb.1(0x40000000), %bb.2(0x40000000)			; GCN: successors: %bb.1(0x40000000), %bb.2(0x40000000)
	; GCN: [[COPY:%[0-9]+]]:sreg_64 = COPY $exec, implicit-def $exec			; GCN: [[COPY:%[0-9]+]]:sreg_64 = COPY $exec, implicit-def $exec
	; GCN: [[S_AND_B64_:%[0-9]+]]:sreg_64 = S_AND_B64 [[COPY]], undef %1:sreg_64, implicit-def dead $scc			; GCN: [[S_AND_B64_:%[0-9]+]]:sreg_64 = S_AND_B64 [[COPY]], undef %1:sreg_64, implicit-def dead $scc
	; GCN: [[S_XOR_B64_:%[0-9]+]]:sreg_64 = S_XOR_B64 [[S_AND_B64_]], [[COPY]], implicit-def dead $scc			; GCN: [[S_XOR_B64_:%[0-9]+]]:sreg_64 = S_XOR_B64 [[S_AND_B64_]], [[COPY]], implicit-def dead $scc
	; GCN: $exec = S_MOV_B64_term killed [[S_AND_B64_]]			; GCN: $exec = S_MOV_B64_term killed [[S_AND_B64_]]
	; GCN: SI_MASK_BRANCH %bb.2, implicit $exec			; GCN: S_CBRANCH_EXECZ %bb.2, implicit $exec
	; GCN: S_BRANCH %bb.1			; GCN: S_BRANCH %bb.1
	; GCN: bb.1:			; GCN: bb.1:
	; GCN: successors: %bb.2(0x80000000)			; GCN: successors: %bb.2(0x80000000)
	; GCN: bb.2:			; GCN: bb.2:
	; GCN: S_ENDPGM 0			; GCN: S_ENDPGM 0
	bb.0:			bb.0:
	successors: %bb.1, %bb.2			successors: %bb.1, %bb.2

	Show All 10 Lines

test/CodeGen/AMDGPU/skip-branch-taildup-ret.mir

	# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
	# RUN: llc -mtriple=amdgcn-amd-amdhsa -verify-machineinstrs -run-pass=si-insert-skips -amdgpu-skip-threshold=1000000 -o - %s \| FileCheck %s			# RUN: llc -mtriple=amdgcn-amd-amdhsa -verify-machineinstrs -run-pass=si-insert-skips -amdgpu-skip-threshold-legacy=1000000 -o - %s \| FileCheck %s

	---			---
	name: skip_branch_taildup_endpgm			name: skip_branch_taildup_endpgm
	machineFunctionInfo:			machineFunctionInfo:
	isEntryFunction: true			isEntryFunction: true
	body: \|			body: \|
	; CHECK-LABEL: name: skip_branch_taildup_endpgm			; CHECK-LABEL: name: skip_branch_taildup_endpgm
	; CHECK: bb.0:			; CHECK: bb.0:
	▲ Show 20 Lines • Show All 184 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/skip-branch-trap.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN -check-prefix=HSA-TRAP %s			; RUN: llc -mtriple=amdgcn--amdhsa -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN -check-prefix=HSA-TRAP %s

	; FIXME: merge with trap.ll			; FIXME: merge with trap.ll

	; An s_cbranch_execnz is required to avoid trapping if all lanes are 0			; An s_cbranch_execnz is required to avoid trapping if all lanes are 0
	; GCN-LABEL: {{^}}trap_divergent_branch:			; GCN-LABEL: {{^}}trap_divergent_branch:
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: s_cbranch_execz [[ENDPGM:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_execnz [[TRAP:BB[0-9]+_[0-9]+]]
	; GCN: s_branch [[TRAP:BB[0-9]+_[0-9]+]]			; GCN: ; %bb.{{[0-9]+}}:
	; GCN: [[ENDPGM]]:
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	; GCN: [[TRAP]]:			; GCN: [[TRAP]]:
	; GCN: s_trap 2			; GCN: s_trap 2
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	define amdgpu_kernel void @trap_divergent_branch(i32 addrspace(1)* nocapture readonly %arg) {			define amdgpu_kernel void @trap_divergent_branch(i32 addrspace(1)* nocapture readonly %arg) {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %id			%gep = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %id
	%divergent.val = load i32, i32 addrspace(1)* %gep			%divergent.val = load i32, i32 addrspace(1)* %gep
	%cmp = icmp eq i32 %divergent.val, 0			%cmp = icmp eq i32 %divergent.val, 0
	br i1 %cmp, label %bb, label %end			br i1 %cmp, label %bb, label %end

	bb:			bb:
	call void @llvm.trap()			call void @llvm.trap()
	br label %end			br label %end

	end:			end:
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}debugtrap_divergent_branch:			; GCN-LABEL: {{^}}debugtrap_divergent_branch:
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: s_cbranch_execz [[ENDPGM:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_execz [[ENDPGM:BB[0-9]+_[0-9]+]]
	; GCN: BB{{[0-9]+}}_{{[0-9]+}}:			; GCN: ; %bb.{{[0-9]+}}:
	; GCN: s_trap 3			; GCN: s_trap 3
	; GCN-NEXT: [[ENDPGM]]:			; GCN-NEXT: [[ENDPGM]]:
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	define amdgpu_kernel void @debugtrap_divergent_branch(i32 addrspace(1)* nocapture readonly %arg) {			define amdgpu_kernel void @debugtrap_divergent_branch(i32 addrspace(1)* nocapture readonly %arg) {
	%id = call i32 @llvm.amdgcn.workitem.id.x()			%id = call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %id			%gep = getelementptr inbounds i32, i32 addrspace(1)* %arg, i32 %id
	%divergent.val = load i32, i32 addrspace(1)* %gep			%divergent.val = load i32, i32 addrspace(1)* %gep
	%cmp = icmp eq i32 %divergent.val, 0			%cmp = icmp eq i32 %divergent.val, 0
	Show All 17 Lines

test/CodeGen/AMDGPU/skip-if-dead.ll

Show First 20 Lines • Show All 214 Lines • ▼ Show 20 Lines	exit:
store float %phi, float addrspace(1)* undef		store float %phi, float addrspace(1)* undef
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_kill_divergent_loop:		; CHECK-LABEL: {{^}}test_kill_divergent_loop:
; CHECK: v_cmp_eq_u32_e32 vcc, 0, v0		; CHECK: v_cmp_eq_u32_e32 vcc, 0, v0
; CHECK-NEXT: s_and_saveexec_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], vcc		; CHECK-NEXT: s_and_saveexec_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], vcc
; CHECK-NEXT: s_xor_b64 [[SAVEEXEC]], exec, [[SAVEEXEC]]		; CHECK-NEXT: s_xor_b64 [[SAVEEXEC]], exec, [[SAVEEXEC]]
; CHECK-NEXT: ; mask branch [[EXIT:BB[0-9]+_[0-9]+]]		; CHECK-NEXT: s_cbranch_execz [[EXIT:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: s_cbranch_execz [[EXIT]]

; CHECK: {{BB[0-9]+_[0-9]+}}: ; %bb.preheader		; CHECK: ; %bb.{{[0-9]+}}: ; %bb.preheader
; CHECK: s_mov_b32		; CHECK: s_mov_b32

; CHECK: [[LOOP_BB:BB[0-9]+_[0-9]+]]:		; CHECK: [[LOOP_BB:BB[0-9]+_[0-9]+]]:

; CHECK: v_mov_b32_e64 v7, -1		; CHECK: v_mov_b32_e64 v7, -1
; CHECK: v_nop_e64		; CHECK: v_nop_e64
; CHECK: v_cmpx_gt_f32_e32 vcc, 0, v7		; CHECK: v_cmpx_gt_f32_e32 vcc, 0, v7

▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
bb7: ; preds = %bb4		bb7: ; preds = %bb4
ret void		ret void
}		}

; CHECK-LABEL: {{^}}if_after_kill_block:		; CHECK-LABEL: {{^}}if_after_kill_block:
; CHECK: ; %bb.0:		; CHECK: ; %bb.0:
; CHECK: s_and_saveexec_b64		; CHECK: s_and_saveexec_b64
; CHECK: s_xor_b64		; CHECK: s_xor_b64
; CHECK-NEXT: mask branch [[BB4:BB[0-9]+_[0-9]+]]

; CHECK: v_cmpx_gt_f32_e32 vcc, 0,		; CHECK: v_cmpx_gt_f32_e32 vcc, 0,
; CHECK: [[BB4]]:		; CHECK: BB{{[0-9]+_[0-9]+}}:
; CHECK: s_or_b64 exec, exec		; CHECK: s_or_b64 exec, exec
; CHECK: image_sample_c		; CHECK: image_sample_c

; CHECK: v_cmp_neq_f32_e32 vcc, 0,		; CHECK: v_cmp_neq_f32_e32 vcc, 0,
; CHECK: s_and_saveexec_b64 s{{\[[0-9]+:[0-9]+\]}}, vcc		; CHECK: s_and_saveexec_b64 s{{\[[0-9]+:[0-9]+\]}}, vcc
; CHECK: mask branch [[END:BB[0-9]+_[0-9]+]]		; CHECK-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]
; CHECK-NEXT: s_cbranch_execz [[END]]
; CHECK-NOT: branch		; CHECK-NOT: branch

; CHECK: BB{{[0-9]+_[0-9]+}}: ; %bb8		; CHECK: ; %bb.{{[0-9]+}}: ; %bb8
; CHECK: buffer_store_dword		; CHECK: buffer_store_dword

; CHECK: [[END]]:		; CHECK: [[END]]:
; CHECK: s_endpgm		; CHECK: s_endpgm
define amdgpu_ps void @if_after_kill_block(float %arg, float %arg1, float %arg2, float %arg3) #0 {		define amdgpu_ps void @if_after_kill_block(float %arg, float %arg1, float %arg2, float %arg3) #0 {
bb:		bb:
%tmp = fcmp ult float %arg1, 0.000000e+00		%tmp = fcmp ult float %arg1, 0.000000e+00
br i1 %tmp, label %bb3, label %bb4		br i1 %tmp, label %bb3, label %bb4
Show All 25 Lines

test/CodeGen/AMDGPU/subreg-coalescer-undef-use.ll

	; RUN: llc -march=amdgcn -mcpu=tahiti -amdgpu-dce-in-ra=0 -o - %s \| FileCheck %s			; RUN: llc -march=amdgcn -mcpu=tahiti -amdgpu-dce-in-ra=0 -o - %s \| FileCheck %s
	; Don't crash when the use of an undefined value is only detected by the			; Don't crash when the use of an undefined value is only detected by the
	; register coalescer because it is hidden with subregister insert/extract.			; register coalescer because it is hidden with subregister insert/extract.
	target triple="amdgcn--"			target triple="amdgcn--"

	; CHECK-LABEL: foobar:			; CHECK-LABEL: foobar:
	; CHECK: s_load_dwordx2 s[4:5], s[0:1], 0x9			; CHECK: s_load_dwordx2 s[4:5], s[0:1], 0x9
	; CHECK-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0xb			; CHECK-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0xb
	; CHECK-NEXT: v_mbcnt_lo_u32_b32_e64			; CHECK-NEXT: v_mbcnt_lo_u32_b32_e64
	; CHECK-NEXT: s_mov_b32 s2, -1			; CHECK-NEXT: s_mov_b32 s2, -1
	; CHECK-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0			; CHECK-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
	; CHECK-NEXT: s_waitcnt lgkmcnt(0)			; CHECK-NEXT: s_waitcnt lgkmcnt(0)
	; CHECK-NEXT: v_mov_b32_e32 v1, s5			; CHECK-NEXT: v_mov_b32_e32 v1, s5
	; CHECK-NEXT: s_and_saveexec_b64 s[4:5], vcc			; CHECK-NEXT: s_and_saveexec_b64 s[4:5], vcc

	; CHECK: BB0_1:			; CHECK: ; %bb.1:
	; CHECK-NEXT: ; kill: def $vgpr0_vgpr1 killed $sgpr4_sgpr5 killed $exec			; CHECK-NEXT: ; kill: def $vgpr0_vgpr1 killed $sgpr4_sgpr5 killed $exec
	; CHECK-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3			; CHECK-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3

	; CHECK: BB0_2:			; CHECK: ; %bb.2:
	; CHECK: s_or_b64 exec, exec, s[4:5]			; CHECK: s_or_b64 exec, exec, s[4:5]
	; CHECK-NEXT: s_mov_b32 s3, 0xf000			; CHECK-NEXT: s_mov_b32 s3, 0xf000
	; CHECK-NEXT: buffer_store_dword v1, off, s[0:3], 0			; CHECK-NEXT: buffer_store_dword v1, off, s[0:3], 0
	; CHECK-NEXT: s_endpgm			; CHECK-NEXT: s_endpgm
	define amdgpu_kernel void @foobar(float %a0, float %a1, float addrspace(1)* %out) nounwind {			define amdgpu_kernel void @foobar(float %a0, float %a1, float addrspace(1)* %out) nounwind {
	entry:			entry:
	%v0 = insertelement <4 x float> undef, float %a0, i32 0			%v0 = insertelement <4 x float> undef, float %a0, i32 0
	%tid = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) #0			%tid = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) #0
	Show All 17 Lines

test/CodeGen/AMDGPU/uniform-cfg.ll

Show First 20 Lines • Show All 327 Lines • ▼ Show 20 Lines	endif:
ret void		ret void
}		}

; GCN-LABEL: {{^}}divergent_inside_uniform:		; GCN-LABEL: {{^}}divergent_inside_uniform:
; GCN: s_cmp_lg_u32 s{{[0-9]+}}, 0		; GCN: s_cmp_lg_u32 s{{[0-9]+}}, 0
; GCN: s_cbranch_scc1 [[ENDIF_LABEL:[0-9_A-Za-z]+]]		; GCN: s_cbranch_scc1 [[ENDIF_LABEL:[0-9_A-Za-z]+]]
; GCN: v_cmp_gt_u32_e32 vcc, 16, v{{[0-9]+}}		; GCN: v_cmp_gt_u32_e32 vcc, 16, v{{[0-9]+}}
; GCN: s_and_saveexec_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], vcc		; GCN: s_and_saveexec_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], vcc
; GCN: ; mask branch [[ENDIF_LABEL]]		; GCN: s_cbranch_execz [[ENDIF_LABEL]]
; GCN: v_mov_b32_e32 [[ONE:v[0-9]+]], 1		; GCN: v_mov_b32_e32 [[ONE:v[0-9]+]], 1
; GCN: buffer_store_dword [[ONE]]		; GCN: buffer_store_dword [[ONE]]
; GCN: [[ENDIF_LABEL]]:		; GCN: [[ENDIF_LABEL]]:
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @divergent_inside_uniform(i32 addrspace(1)* %out, i32 %cond) {		define amdgpu_kernel void @divergent_inside_uniform(i32 addrspace(1)* %out, i32 %cond) {
entry:		entry:
%u_cmp = icmp eq i32 %cond, 0		%u_cmp = icmp eq i32 %cond, 0
br i1 %u_cmp, label %if, label %endif		br i1 %u_cmp, label %if, label %endif
▲ Show 20 Lines • Show All 241 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/uniform-loop-inside-nonuniform.ll

; RUN: llc -march=amdgcn -mcpu=verde < %s \| FileCheck %s		; RUN: llc -march=amdgcn -mcpu=verde < %s \| FileCheck %s

; Test a simple uniform loop that lives inside non-uniform control flow.		; Test a simple uniform loop that lives inside non-uniform control flow.

; CHECK-LABEL: {{^}}test1:		; CHECK-LABEL: {{^}}test1:
; CHECK: v_cmp_ne_u32_e32 vcc, 0		; CHECK: v_cmp_ne_u32_e32 vcc, 0
; CHECK: s_and_saveexec_b64		; CHECK: s_and_saveexec_b64
; CHECK-NEXT: ; mask branch
; CHECK-NEXT: s_cbranch_execz BB{{[0-9]+_[0-9]+}}		; CHECK-NEXT: s_cbranch_execz BB{{[0-9]+_[0-9]+}}
; CHECK-NEXT: BB{{[0-9]+_[0-9]+}}: ; %loop_body.preheader		; CHECK-NEXT: ; %bb.{{[0-9]+}}: ; %loop_body.preheader

; CHECK: [[LOOP_BODY_LABEL:BB[0-9]+_[0-9]+]]:		; CHECK: [[LOOP_BODY_LABEL:BB[0-9]+_[0-9]+]]:
; CHECK: s_cbranch_vccz [[LOOP_BODY_LABEL]]		; CHECK: s_cbranch_vccz [[LOOP_BODY_LABEL]]

; CHECK: s_endpgm		; CHECK: s_endpgm
define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <2 x i32> %addr.base, i32 %y, i32 %p) {		define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <2 x i32> %addr.base, i32 %y, i32 %p) {
main_body:		main_body:
%cc = icmp eq i32 %p, 0		%cc = icmp eq i32 %p, 0
Show All 10 Lines	loop_body:
br i1 %lc, label %out, label %loop_body		br i1 %lc, label %out, label %loop_body

out:		out:
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test2:		; CHECK-LABEL: {{^}}test2:
; CHECK: s_and_saveexec_b64		; CHECK: s_and_saveexec_b64
; CHECK-NEXT: ; mask branch
; CHECK-NEXT: s_cbranch_execz		; CHECK-NEXT: s_cbranch_execz
define amdgpu_kernel void @test2(i32 addrspace(1)* %out, i32 %a, i32 %b) {		define amdgpu_kernel void @test2(i32 addrspace(1)* %out, i32 %a, i32 %b) {
main_body:		main_body:
%tid = call i32 @llvm.amdgcn.workitem.id.x() #1		%tid = call i32 @llvm.amdgcn.workitem.id.x() #1
%cc = icmp eq i32 %tid, 0		%cc = icmp eq i32 %tid, 0
br i1 %cc, label %done1, label %if		br i1 %cc, label %done1, label %if

if:		if:
Show All 24 Lines

test/CodeGen/AMDGPU/valu-i1.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs -enable-misched -asm-verbose -disable-block-placement < %s \| FileCheck -check-prefix=SI %s			; RUN: llc -march=amdgcn -verify-machineinstrs -enable-misched -asm-verbose -disable-block-placement < %s \| FileCheck -check-prefix=SI %s

	declare i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			declare i32 @llvm.amdgcn.workitem.id.x() nounwind readnone

	; SI-LABEL: {{^}}test_if:			; SI-LABEL: {{^}}test_if:
	; Make sure the i1 values created by the cfg structurizer pass are			; Make sure the i1 values created by the cfg structurizer pass are
	; moved using VALU instructions			; moved using VALU instructions


	; waitcnt should be inserted after exec modification			; waitcnt should be inserted after exec modification
	; SI: v_cmp_lt_i32_e32 vcc, 1,			; SI: v_cmp_lt_i32_e32 vcc, 1,
	; SI-NEXT: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0			; SI-NEXT: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0
	; SI-NEXT: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0			; SI-NEXT: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0
	; SI-NEXT: s_and_saveexec_b64 [[SAVE1:s\[[0-9]+:[0-9]+\]]], vcc			; SI-NEXT: s_and_saveexec_b64 [[SAVE1:s\[[0-9]+:[0-9]+\]]], vcc
	; SI-NEXT: s_xor_b64 [[SAVE2:s\[[0-9]+:[0-9]+\]]], exec, [[SAVE1]]			; SI-NEXT: s_xor_b64 [[SAVE2:s\[[0-9]+:[0-9]+\]]], exec, [[SAVE1]]
	; SI-NEXT: ; mask branch [[FLOW_BB:BB[0-9]+_[0-9]+]]
	; SI-NEXT: s_cbranch_execz [[FLOW_BB]]

	; SI-NEXT: BB{{[0-9]+}}_1: ; %LeafBlock3			; SI-NEXT: ; %bb.{{[0-9]+}}: ; %LeafBlock3
	; SI: s_mov_b64 s[{{[0-9]:[0-9]}}], -1			; SI: s_mov_b64 s[{{[0-9]:[0-9]}}], -1
	; SI: s_and_saveexec_b64			; SI: s_and_saveexec_b64
	; SI-NEXT: ; mask branch			; SI-NEXT: s_cbranch_execnz

	; v_mov should be after exec modification			; v_mov should be after exec modification
	; SI: [[FLOW_BB]]:			; SI: ; %bb.{{[0-9]+}}:
	; SI-NEXT: s_or_saveexec_b64 [[SAVE3:s\[[0-9]+:[0-9]+\]]], [[SAVE2]]			; SI-NEXT: s_or_saveexec_b64 [[SAVE3:s\[[0-9]+:[0-9]+\]]], [[SAVE2]]
	; SI-NEXT: s_xor_b64 exec, exec, [[SAVE3]]			; SI-NEXT: s_xor_b64 exec, exec, [[SAVE3]]
	; SI-NEXT: ; mask branch
	;			;
	define amdgpu_kernel void @test_if(i32 %b, i32 addrspace(1)* %src, i32 addrspace(1)* %dst) #1 {			define amdgpu_kernel void @test_if(i32 %b, i32 addrspace(1)* %src, i32 addrspace(1)* %dst) #1 {
	entry:			entry:
	%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone
	switch i32 %tid, label %default [			switch i32 %tid, label %default [
	i32 1, label %case1			i32 1, label %case1
	i32 2, label %case2			i32 2, label %case2
	]			]
	Show All 23 Lines

	end:			end:
	ret void			ret void
	}			}

	; SI-LABEL: {{^}}simple_test_v_if:			; SI-LABEL: {{^}}simple_test_v_if:
	; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}			; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}
	; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI-NEXT: ; mask branch [[EXIT:BB[0-9]+_[0-9]+]]			; SI-NEXT: s_cbranch_execz [[EXIT:BB[0-9]+_[0-9]+]]
	; SI-NEXT: s_cbranch_execz [[EXIT]]

	; SI-NEXT: BB{{[0-9]+_[0-9]+}}:			; SI-NEXT: ; %bb.{{[0-9]+}}:
	; SI: buffer_store_dword			; SI: buffer_store_dword

	; SI-NEXT: {{^}}[[EXIT]]:			; SI-NEXT: {{^}}[[EXIT]]:
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @simple_test_v_if(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {			define amdgpu_kernel void @simple_test_v_if(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone
	%is.0 = icmp ne i32 %tid, 0			%is.0 = icmp ne i32 %tid, 0
	br i1 %is.0, label %then, label %exit			br i1 %is.0, label %then, label %exit

	then:			then:
	%gep = getelementptr i32, i32 addrspace(1)* %dst, i32 %tid			%gep = getelementptr i32, i32 addrspace(1)* %dst, i32 %tid
	store i32 999, i32 addrspace(1)* %gep			store i32 999, i32 addrspace(1)* %gep
	br label %exit			br label %exit

	exit:			exit:
	ret void			ret void
	}			}

	; FIXME: It would be better to endpgm in the then block.			; FIXME: It would be better to endpgm in the then block.

	; SI-LABEL: {{^}}simple_test_v_if_ret_else_ret:			; SI-LABEL: {{^}}simple_test_v_if_ret_else_ret:
	; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}			; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}
	; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI-NEXT: ; mask branch [[EXIT:BB[0-9]+_[0-9]+]]			; SI-NEXT: s_cbranch_execz [[EXIT:BB[0-9]+_[0-9]+]]
	; SI-NEXT: s_cbranch_execz [[EXIT]]

	; SI-NEXT: BB{{[0-9]+_[0-9]+}}:			; SI-NEXT: ; %bb.{{[0-9]+}}:
	; SI: buffer_store_dword			; SI: buffer_store_dword

	; SI-NEXT: {{^}}[[EXIT]]:			; SI-NEXT: {{^}}[[EXIT]]:
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @simple_test_v_if_ret_else_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {			define amdgpu_kernel void @simple_test_v_if_ret_else_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%is.0 = icmp ne i32 %tid, 0			%is.0 = icmp ne i32 %tid, 0
	br i1 %is.0, label %then, label %exit			br i1 %is.0, label %then, label %exit
	Show All 10 Lines
	; Final block has more than a ret to execute. This was miscompiled			; Final block has more than a ret to execute. This was miscompiled
	; before function exit blocks were unified since the endpgm would			; before function exit blocks were unified since the endpgm would
	; terminate the then wavefront before reaching the store.			; terminate the then wavefront before reaching the store.

	; SI-LABEL: {{^}}simple_test_v_if_ret_else_code_ret:			; SI-LABEL: {{^}}simple_test_v_if_ret_else_code_ret:
	; SI: v_cmp_eq_u32_e32 vcc, 0, v{{[0-9]+}}			; SI: v_cmp_eq_u32_e32 vcc, 0, v{{[0-9]+}}
	; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]			; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]
	; SI: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]			; SI: s_cbranch_execnz [[EXIT:BB[0-9]+_[0-9]+]]

	; SI-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %exit			; SI-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %Flow
	; SI: ds_write_b32

	; SI-NEXT: {{^}}[[FLOW]]:
	; SI-NEXT: s_or_saveexec_b64			; SI-NEXT: s_or_saveexec_b64
	; SI-NEXT: s_xor_b64 exec, exec			; SI-NEXT: s_xor_b64 exec, exec
	; SI-NEXT: ; mask branch [[UNIFIED_RETURN:BB[0-9]+_[0-9]+]]			; SI-NEXT: s_cbranch_execz [[UNIFIED_RETURN:BB[0-9]+_[0-9]+]]
	; SI-NEXT: s_cbranch_execz [[UNIFIED_RETURN]]

	; SI-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %then			; SI-NEXT: ; %bb.{{[0-9]+}}: ; %then
	; SI: s_waitcnt			; SI: s_waitcnt
	; SI-NEXT: buffer_store_dword			; SI-NEXT: buffer_store_dword

	; SI-NEXT: {{^}}[[UNIFIED_RETURN]]: ; %UnifiedReturnBlock			; SI-NEXT: {{^}}[[UNIFIED_RETURN]]: ; %UnifiedReturnBlock
	; SI: s_endpgm			; SI: s_endpgm

				; SI-NEXT: {{^}}[[EXIT]]:
				; SI: ds_write_b32
	define amdgpu_kernel void @simple_test_v_if_ret_else_code_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {			define amdgpu_kernel void @simple_test_v_if_ret_else_code_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%is.0 = icmp ne i32 %tid, 0			%is.0 = icmp ne i32 %tid, 0
	br i1 %is.0, label %then, label %exit			br i1 %is.0, label %then, label %exit

	then:			then:
	%gep = getelementptr i32, i32 addrspace(1)* %dst, i32 %tid			%gep = getelementptr i32, i32 addrspace(1)* %dst, i32 %tid
	store i32 999, i32 addrspace(1)* %gep			store i32 999, i32 addrspace(1)* %gep
	ret void			ret void

	exit:			exit:
	store volatile i32 7, i32 addrspace(3)* undef			store volatile i32 7, i32 addrspace(3)* undef
	ret void			ret void
	}			}

	; SI-LABEL: {{^}}simple_test_v_loop:			; SI-LABEL: {{^}}simple_test_v_loop:
	; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}			; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}
	; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI-NEXT: ; mask branch
	; SI-NEXT: s_cbranch_execz [[LABEL_EXIT:BB[0-9]+_[0-9]+]]			; SI-NEXT: s_cbranch_execz [[LABEL_EXIT:BB[0-9]+_[0-9]+]]

	; SI: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}			; SI: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}

	; SI: [[LABEL_LOOP:BB[0-9]+_[0-9]+]]:			; SI: [[LABEL_LOOP:BB[0-9]+_[0-9]+]]:
	; SI: buffer_load_dword			; SI: buffer_load_dword
	; SI-DAG: buffer_store_dword			; SI-DAG: buffer_store_dword
	; SI-DAG: v_cmp_eq_u32_e32 vcc, 0x100			; SI-DAG: v_cmp_eq_u32_e32 vcc, 0x100
	Show All 25 Lines
	; SI-LABEL: {{^}}multi_vcond_loop:			; SI-LABEL: {{^}}multi_vcond_loop:

	; Load loop limit from buffer			; Load loop limit from buffer
	; Branch to exit if uniformly not taken			; Branch to exit if uniformly not taken
	; SI: ; %bb.0:			; SI: ; %bb.0:
	; SI: buffer_load_dword [[VBOUND:v[0-9]+]]			; SI: buffer_load_dword [[VBOUND:v[0-9]+]]
	; SI: v_cmp_lt_i32_e32 vcc			; SI: v_cmp_lt_i32_e32 vcc
	; SI: s_and_saveexec_b64 [[OUTER_CMP_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[OUTER_CMP_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI-NEXT: ; mask branch
	; SI-NEXT: s_cbranch_execz [[LABEL_EXIT:BB[0-9]+_[0-9]+]]			; SI-NEXT: s_cbranch_execz [[LABEL_EXIT:BB[0-9]+_[0-9]+]]

	; Initialize inner condition to false			; Initialize inner condition to false
	; SI: BB{{[0-9]+_[0-9]+}}: ; %bb10.preheader			; SI: ; %bb.{{[0-9]+}}: ; %bb10.preheader
	; SI: s_mov_b64 [[COND_STATE:s\[[0-9]+:[0-9]+\]]], 0{{$}}			; SI: s_mov_b64 [[COND_STATE:s\[[0-9]+:[0-9]+\]]], 0{{$}}

	; Clear exec bits for workitems that load -1s			; Clear exec bits for workitems that load -1s
	; SI: [[LABEL_LOOP:BB[0-9]+_[0-9]+]]:			; SI: [[LABEL_LOOP:BB[0-9]+_[0-9]+]]:
	; SI: buffer_load_dword [[B:v[0-9]+]]			; SI: buffer_load_dword [[B:v[0-9]+]]
	; SI: buffer_load_dword [[A:v[0-9]+]]			; SI: buffer_load_dword [[A:v[0-9]+]]
	; SI-DAG: v_cmp_ne_u32_e64 [[NEG1_CHECK_0:s\[[0-9]+:[0-9]+\]]], -1, [[A]]			; SI-DAG: v_cmp_ne_u32_e64 [[NEG1_CHECK_0:s\[[0-9]+:[0-9]+\]]], -1, [[A]]
	; SI-DAG: v_cmp_ne_u32_e32 [[NEG1_CHECK_1:vcc]], -1, [[B]]			; SI-DAG: v_cmp_ne_u32_e32 [[NEG1_CHECK_1:vcc]], -1, [[B]]
	; SI: s_and_b64 [[ORNEG1:s\[[0-9]+:[0-9]+\]]], [[NEG1_CHECK_1]], [[NEG1_CHECK_0]]			; SI: s_and_b64 [[ORNEG1:s\[[0-9]+:[0-9]+\]]], [[NEG1_CHECK_1]], [[NEG1_CHECK_0]]
	; SI: s_and_saveexec_b64 [[ORNEG2:s\[[0-9]+:[0-9]+\]]], [[ORNEG1]]			; SI: s_and_saveexec_b64 [[ORNEG2:s\[[0-9]+:[0-9]+\]]], [[ORNEG1]]
	; SI: s_cbranch_execz [[LABEL_FLOW:BB[0-9]+_[0-9]+]]			; SI: s_cbranch_execz [[LABEL_FLOW:BB[0-9]+_[0-9]+]]

	; SI: BB{{[0-9]+_[0-9]+}}: ; %bb20			; SI: ; %bb.{{[0-9]+}}: ; %bb20
	; SI: buffer_store_dword			; SI: buffer_store_dword

	; SI: [[LABEL_FLOW]]:			; SI: [[LABEL_FLOW]]:
	; SI-NEXT: ; in Loop: Header=[[LABEL_LOOP]]			; SI-NEXT: ; in Loop: Header=[[LABEL_LOOP]]
	; SI-NEXT: s_or_b64 exec, exec, [[ORNEG2]]			; SI-NEXT: s_or_b64 exec, exec, [[ORNEG2]]
	; SI-NEXT: s_and_b64 [[TMP1:s\[[0-9]+:[0-9]+\]]],			; SI-NEXT: s_and_b64 [[TMP1:s\[[0-9]+:[0-9]+\]]],
	; SI-NEXT: s_or_b64 [[TMP2:s\[[0-9]+:[0-9]+\]]], [[TMP1]], [[COND_STATE]]			; SI-NEXT: s_or_b64 [[TMP2:s\[[0-9]+:[0-9]+\]]], [[TMP1]], [[COND_STATE]]
	; SI-NEXT: s_mov_b64 [[COND_STATE]], [[TMP2]]			; SI-NEXT: s_mov_b64 [[COND_STATE]], [[TMP2]]
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/wave32.ll

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @test_vop3_cmp_u32_sop_or(i32 addrspace(1)* %arg) {
%sel = select i1 %or, i32 1, i32 2		%sel = select i1 %or, i32 1, i32 2
store i32 %sel, i32 addrspace(1)* %gep, align 4		store i32 %sel, i32 addrspace(1)* %gep, align 4
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_mask_if:		; GCN-LABEL: {{^}}test_mask_if:
; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, vcc_lo		; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, vcc_lo
; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], vcc{{$}}		; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], vcc{{$}}
; GCN: ; mask branch		; GCN: s_cbranch_execz
define amdgpu_kernel void @test_mask_if(i32 addrspace(1)* %arg) #0 {		define amdgpu_kernel void @test_mask_if(i32 addrspace(1)* %arg) #0 {
%lid = tail call i32 @llvm.amdgcn.workitem.id.x()		%lid = tail call i32 @llvm.amdgcn.workitem.id.x()
%cmp = icmp ugt i32 %lid, 10		%cmp = icmp ugt i32 %lid, 10
br i1 %cmp, label %if, label %endif		br i1 %cmp, label %if, label %endif

if:		if:
store i32 0, i32 addrspace(1)* %arg, align 4		store i32 0, i32 addrspace(1)* %arg, align 4
br label %endif		br label %endif

endif:		endif:
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_loop_with_if:		; GCN-LABEL: {{^}}test_loop_with_if:
; GFX1032: s_or_b32 s{{[0-9]+}}, vcc_lo, s{{[0-9]+}}		; GFX1032: s_or_b32 s{{[0-9]+}}, vcc_lo, s{{[0-9]+}}
; GFX1032: s_andn2_b32 exec_lo, exec_lo, s{{[0-9]+}}		; GFX1032: s_andn2_b32 exec_lo, exec_lo, s{{[0-9]+}}
; GFX1064: s_or_b64 s[{{[0-9:]+}}], vcc, s[{{[0-9:]+}}]		; GFX1064: s_or_b64 s[{{[0-9:]+}}], vcc, s[{{[0-9:]+}}]
; GFX1064: s_andn2_b64 exec, exec, s[{{[0-9:]+}}]		; GFX1064: s_andn2_b64 exec, exec, s[{{[0-9:]+}}]
; GCN: s_cbranch_execz		; GCN: s_cbranch_execz
; GCN: BB{{.*}}:		; GCN: BB{{.*}}:
; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, vcc_lo		; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, vcc_lo
; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], vcc{{$}}		; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], vcc{{$}}
; GCN: s_cbranch_execz		; GCN: s_cbranch_execz
; GCN: BB{{.*}}:		; GCN: ; %bb.{{[0-9]+}}:
; GCN: BB{{.*}}:		; GCN: BB{{.*}}:
; GFX1032: s_xor_b32 s{{[0-9]+}}, exec_lo, s{{[0-9]+}}		; GFX1032: s_xor_b32 s{{[0-9]+}}, exec_lo, s{{[0-9]+}}
; GFX1064: s_xor_b64 s[{{[0-9:]+}}], exec, s[{{[0-9:]+}}]		; GFX1064: s_xor_b64 s[{{[0-9:]+}}], exec, s[{{[0-9:]+}}]
; GCN: ; mask branch BB		; GCN: ; %bb.{{[0-9]+}}:
; GCN: BB{{.*}}:		; GCN: ; %bb.{{[0-9]+}}:
; GCN: BB{{.*}}:
; GFX1032: s_or_b32 exec_lo, exec_lo, s{{[0-9]+}}		; GFX1032: s_or_b32 exec_lo, exec_lo, s{{[0-9]+}}
; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, s{{[0-9]+}}		; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, s{{[0-9]+}}
; GFX1064: s_or_b64 exec, exec, s[{{[0-9:]+}}]		; GFX1064: s_or_b64 exec, exec, s[{{[0-9:]+}}]
; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], s[{{[0-9:]+}}]{{$}}		; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], s[{{[0-9:]+}}]{{$}}
; GCN: ; mask branch BB		; GCN: s_cbranch_execz BB
; GCN: BB{{.*}}:		; GCN: ; %bb.{{[0-9]+}}:
; GCN: BB{{.*}}:		; GCN: BB{{.*}}:
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @test_loop_with_if(i32 addrspace(1)* %arg) #0 {		define amdgpu_kernel void @test_loop_with_if(i32 addrspace(1)* %arg) #0 {
bb:		bb:
%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()		%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
br label %bb2		br label %bb2

bb1:		bb1:
Show All 24 Lines	bb13:
%tmp15 = add nsw i32 %tmp14, 1		%tmp15 = add nsw i32 %tmp14, 1
%tmp16 = icmp slt i32 %tmp14, 255		%tmp16 = icmp slt i32 %tmp14, 255
br i1 %tmp16, label %bb2, label %bb1		br i1 %tmp16, label %bb2, label %bb1
}		}

; GCN-LABEL: {{^}}test_loop_with_if_else_break:		; GCN-LABEL: {{^}}test_loop_with_if_else_break:
; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, vcc_lo		; GFX1032: s_and_saveexec_b32 s{{[0-9]+}}, vcc_lo
; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], vcc{{$}}		; GFX1064: s_and_saveexec_b64 s[{{[0-9:]+}}], vcc{{$}}
; GCN: ; mask branch
; GCN: s_cbranch_execz		; GCN: s_cbranch_execz
; GCN: BB{{.*}}:		; GCN: ; %bb.{{[0-9]+}}: ; %.preheader
; GCN: BB{{.*}}:		; GCN: BB{{.*}}:
; GFX1032: s_andn2_b32 s{{[0-9]+}}, s{{[0-9]+}}, exec_lo		; GFX1032: s_andn2_b32 s{{[0-9]+}}, s{{[0-9]+}}, exec_lo
; GFX1064: s_andn2_b64 s[{{[0-9:]+}}], s[{{[0-9:]+}}], exec		; GFX1064: s_andn2_b64 s[{{[0-9:]+}}], s[{{[0-9:]+}}], exec
; GFX1032: s_or_b32 s{{[0-9]+}}, vcc_lo, s{{[0-9]+}}		; GFX1032: s_or_b32 s{{[0-9]+}}, vcc_lo, s{{[0-9]+}}
; GFX1032: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}		; GFX1032: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}
; GFX1064: s_or_b64 s[{{[0-9:]+}}], vcc, s[{{[0-9:]+}}]		; GFX1064: s_or_b64 s[{{[0-9:]+}}], vcc, s[{{[0-9:]+}}]
; GFX1064: s_or_b64 s[{{[0-9:]+}}], s[{{[0-9:]+}}], s[{{[0-9:]+}}]		; GFX1064: s_or_b64 s[{{[0-9:]+}}], s[{{[0-9:]+}}], s[{{[0-9:]+}}]
; GCN: s_cbranch_execz		; GCN: s_cbranch_execz
▲ Show 20 Lines • Show All 903 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/wqm.ll

	Show First 20 Lines • Show All 416 Lines • ▼ Show 20 Lines
	;CHECK: %IF			;CHECK: %IF
	;CHECK: image_sample			;CHECK: image_sample
	;CHECK: image_sample			;CHECK: image_sample
	;CHECK: %Flow			;CHECK: %Flow
	;CHECK-NEXT: s_or_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]],			;CHECK-NEXT: s_or_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]],
	;CHECK-NEXT: s_and_b64 exec, exec, [[ORIG]]			;CHECK-NEXT: s_and_b64 exec, exec, [[ORIG]]
	;CHECK-NEXT: s_and_b64 [[SAVED]], exec, [[SAVED]]			;CHECK-NEXT: s_and_b64 [[SAVED]], exec, [[SAVED]]
	;CHECK-NEXT: s_xor_b64 exec, exec, [[SAVED]]			;CHECK-NEXT: s_xor_b64 exec, exec, [[SAVED]]
	;CHECK-NEXT: mask branch [[END_BB:BB[0-9]+_[0-9]+]]			;CHECK-NEXT: s_cbranch_execz [[END_BB:BB[0-9]+_[0-9]+]]
	;CHECK-NEXT: s_cbranch_execz [[END_BB]]			;CHECK-NEXT: ; %bb.{{[0-9]+}}: ; %ELSE
	;CHECK-NEXT: BB{{[0-9]+_[0-9]+}}: ; %ELSE
	;CHECK: store_dword			;CHECK: store_dword
	;CHECK: [[END_BB]]: ; %END			;CHECK: [[END_BB]]: ; %END
	;CHECK: s_or_b64 exec, exec,			;CHECK: s_or_b64 exec, exec,
	;CHECK: v_mov_b32_e32 v0			;CHECK: v_mov_b32_e32 v0
	;CHECK: ; return			;CHECK: ; return
	define amdgpu_ps float @test_control_flow_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 %c, i32 %z, float %data) {			define amdgpu_ps float @test_control_flow_1(<8 x i32> inreg %rsrc, <4 x i32> inreg %sampler, i32 %c, i32 %z, float %data) {
	main_body:			main_body:
	%cmp = icmp eq i32 %z, 0			%cmp = icmp eq i32 %z, 0
	▲ Show 20 Lines • Show All 419 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Invert the handling of skip insertion.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 221984

lib/Target/AMDGPU/AMDGPU.h

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

lib/Target/AMDGPU/CMakeLists.txt

lib/Target/AMDGPU/SIInsertSkips.cpp

lib/Target/AMDGPU/SILowerControlFlow.cpp

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp

test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll

test/CodeGen/AMDGPU/branch-condition-and.ll

test/CodeGen/AMDGPU/branch-relaxation.ll

test/CodeGen/AMDGPU/call-skip.ll

test/CodeGen/AMDGPU/collapse-endcf.ll

test/CodeGen/AMDGPU/control-flow-fastregalloc.ll

test/CodeGen/AMDGPU/convergent-inlineasm.ll

test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll

test/CodeGen/AMDGPU/else.ll

test/CodeGen/AMDGPU/hoist-cond.ll

test/CodeGen/AMDGPU/insert-skips-flat-vmem.mir

test/CodeGen/AMDGPU/insert-skips-gws.mir

test/CodeGen/AMDGPU/insert-skips-ignored-insts.mir

test/CodeGen/AMDGPU/insert-skips-kill-uncond.mir

test/CodeGen/AMDGPU/mubuf-legalize-operands.ll

test/CodeGen/AMDGPU/mul24-pass-ordering.ll

test/CodeGen/AMDGPU/ret_jump.ll

test/CodeGen/AMDGPU/si-annotate-cf-noloop.ll

test/CodeGen/AMDGPU/si-lower-control-flow-unreachable-block.ll

test/CodeGen/AMDGPU/si-lower-control-flow.mir

test/CodeGen/AMDGPU/skip-branch-taildup-ret.mir

test/CodeGen/AMDGPU/skip-branch-trap.ll

test/CodeGen/AMDGPU/skip-if-dead.ll

test/CodeGen/AMDGPU/subreg-coalescer-undef-use.ll

test/CodeGen/AMDGPU/uniform-cfg.ll

test/CodeGen/AMDGPU/uniform-loop-inside-nonuniform.ll

test/CodeGen/AMDGPU/valu-i1.ll

test/CodeGen/AMDGPU/wave32.ll

test/CodeGen/AMDGPU/wqm.ll

[AMDGPU] Invert the handling of skip insertion.
ClosedPublic