This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
-
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
-
DSInstructions.td
19/28
SIInsertWaitcnts.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
basic-branch.ll
-
branch-condition-and.ll
-
branch-relaxation.ll
-
control-flow-fastregalloc.ll
-
indirect-addressing-si.ll
-
infinite-loop.ll
-
llvm.amdgcn.buffer.store.format.ll
-
llvm.amdgcn.buffer.store.ll
-
llvm.amdgcn.ds.bpermute.ll
-
llvm.amdgcn.ds.permute.ll
-
llvm.amdgcn.ds.swizzle.ll
-
llvm.amdgcn.image.ll
-
llvm.amdgcn.s.dcache.inv.ll
-
llvm.amdgcn.s.dcache.inv.vol.ll
-
llvm.amdgcn.s.dcache.wb.ll
-
llvm.amdgcn.s.dcache.wb.vol.ll
-
llvm.amdgcn.s.waitcnt.ll
-
multi-divergent-exit-region.ll
-
ret_jump.ll
-
si-lower-control-flow-unreachable-block.ll
-
smrd-vccz-bug.ll
-
spill-m0.ll
-
valu-i1.ll

Differential D31161

[AMDGPU] New Waitcnt Insertion Pass
ClosedPublic

Authored by kanarayan on Mar 20 2017, 4:49 PM.

Download Raw Diff

Details

Reviewers

t-tye
arsenm
tstellar
rampitec
kzhuravl

Summary

This pass implements the algorithm deployed by our internal compiler for inserting waitcnt instructions. The pass performs cross basic-block analysis and tracks individual registers, and provides predicted performance improvements over the current implementation.

There are further improvements forthcoming, including relaxing overtly conservative assumptions about LDS access, integration of memory model pass, and more targeted tests for the corners.

Diff Detail

Event Timeline

kanarayan created this revision.Mar 20 2017, 4:49 PM

Herald added subscribers: tpr, dstuttard, tony-tye and 8 others. · View Herald TranscriptMar 20 2017, 4:49 PM

t-tye added a subscriber: t-tye.Mar 22 2017, 6:40 PM

tony-tye removed a subscriber: tony-tye.Mar 22 2017, 6:47 PM

kzhuravl added reviewers: tstellar, rampitec, arsenm, t-tye, kzhuravl.Mar 23 2017, 11:14 AM

@rampitec wrote:
Please add test with s_barrier.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
75	@rampitec wrote: Should not we get these numbers from TD files for target which we already have?
1018	@rampitec wrote: For a barrier it will always insert strongest: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) regardless of what was an argument of the barrier. This also seems to completely ignore atomic fences inserted around the barrier from the library, which shall be a real source of wait argument. Note, that semantics of needWaitcntBeforeBarrier() is not that we always need to insert wait with barrier, but that we may need to insert it. Also note that existing pass does not seem to do it for a barrier.
1614	@rampitec wrote: Need to check for XNACK support.

Please add test with s_barrier.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
75	Should not we get these numbers from TD files for target which we already have?
1018	For a barrier it will always insert strongest: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) regardless of what was an argument of the barrier. This also seems to completely ignore atomic fences inserted around the barrier from the library, which shall be a real source of wait argument. Note, that semantics of needWaitcntBeforeBarrier() is not that we always need to insert wait with barrier, but that we may need to insert it. Also note that existing pass does not seem to do it for a barrier.
1614	Need to check for XNACK support.

kzhuravl added inline comments.Mar 23 2017, 11:23 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1018	Atomics are currently handled in a separate pass, which we determined to be conservatively correct. We plan to integrate atomics into waitcnt insertion but later and in a separate patch. The needWaitcntBeforeBarrier tells you whether you need a waitcnt before the barrier or not. For >=GFX9, waitcnt is automatically inserted before the barrier, so we do not need to generate it, and needWaitcntBeforeBarrier returns false for >=GFX9. Also note that existing pass does not seem to do it for a barrier. https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AMDGPU/SIInsertWaits.cpp#L633

In D31161#709066, @rampitec wrote:

Please add test with s_barrier.

https://github.com/llvm-mirror/llvm/blob/master/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll

In D31161#709081, @kzhuravl wrote:

In D31161#709066, @rampitec wrote:

Please add test with s_barrier.

https://github.com/llvm-mirror/llvm/blob/master/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll

How about waitcnt operands in that test?

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1018	Barrier itself does not need fence. OpenCL barrier needs fence, but these are generated in the library.

This addresses two issues:

SQ_MAX_PGM_VGPRS and other constants are used to map the llvm register map to a data structure internal to this algorithm. For the register ampping used by the algorithm see the comments before the enum. They are maximum values across all targets. It is ideal to use dynamically sized arrays that fit the particular architecture target. As an interim step, I have changed these constants to enum values, asserted that no target has a larger register file in the main entry to this pass. (In response to comments from Tony and Konstantin) I have also updated the getRegInterval call. Please notice that the previous version was identical to the pass used by the old pass except I do the necessary adjustments for the mapping used by this algorithm. In a next step, I will remove the assert and the associated code.
Barriers no longer force a zero waitcnt. For GFX9 and above, barrier needs no additional waitcnt. For lesser targets, waitcnts are added only if needed. The following tests already test barrier: LLVM :: CodeGen/AMDGPU/addrspacecast.ll LLVM :: CodeGen/AMDGPU/array-ptr-calc-i32.ll LLVM :: CodeGen/AMDGPU/ds-negative-offset-addressing-mode-loop.ll LLVM :: CodeGen/AMDGPU/ds_read2.ll LLVM :: CodeGen/AMDGPU/indirect-private-64.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll LLVM :: CodeGen/AMDGPU/local-memory.amdgcn.ll LLVM :: CodeGen/AMDGPU/merge-stores.ll LLVM :: CodeGen/AMDGPU/schedule-vs-if-nested-loop-failure.ll LLVM :: CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll LLVM :: CodeGen/AMDGPU/store-barrier.ll LLVM :: CodeGen/AMDGPU/wait.ll

There is also a CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll

This patch does not yet address the XNACK changes in https://reviews.llvm.org/D30302

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1006	Not needed anymore.

rampitec added inline comments.Mar 24 2017, 3:08 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
819	I believe there is no SI_RETURN anymore, it is renamed.

Rename SI_RETURN to SI_RETURN_TO_EPILOG to reflect recent change.
Condition the nop insertion to break soft clauses on isXNACKEnabled(). Add a note to also condition this code on hasSoftClauses when that code is put back.
Fix tests added since last patch to reflect the new pass.
Remove experimental code.

Fix the last patch that inadvertently deleted "amdgpu_kernel" added recently.

Rename SI_RETURN to SI_RETURN_TO_EPILOG to reflect recent change.
Condition the nop insertion to break soft clauses on isXNACKEnabled(). Add a note to also condition this code on hasSoftClauses when that code is put back.
Fix tests added since last patch to reflect the new pass.
Remove experimental code.

Your pass does not take into account encoding changes for gfx9.
New pass name is confusing.
Make a note somewhere that the new pass is on by default, and you can switch it off by such and such option.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1070–1072	Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it)
1099–1100	Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it)

This revision now requires changes to proceed.Mar 28 2017, 10:13 AM

Use the architecture neutral interfaces to decode wait counts.

In response to the other comments, please note:

The old pass will be removed shortly. I agree the pass names are close enough to cause some confusion in the interim, but the names are apt.
Please refer to the comment here below (from AMDGPUTargetMachine.cpp) that indicates the new pass is the default and with -enable-si-insert-waitcnts=0 one can revert to the old pass.

// Option to enable new waitcnt insertion pass.
static cl::opt<bool> EnableSIInsertWaitcntsPass(

"enable-si-insert-waitcnts",
cl::desc("Use new waitcnt insertion pass"),
cl::init(true));

arsenm added inline comments.Mar 28 2017, 2:39 PM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
3	Extra blank line
89–92	This should be replaced with an iterator range or simply removed
511–516	Can't you just get the ::data operand and check if it is a use or def? Same for the other places checking getAtomicNoRetOp
761–763	C++ style comments
780	nullptr
786	Linemapping->LineMapping
787	MI.isDebugValue()
930	I'm concerned by relying on the memoperands since it's possible they were dropped. The uses LDS memory check could at least be factored into a predicate function
969–972	Ditto
1141–1147	This should check the gds operand, not the mem operands. This also needs MIR tests since we don't emit GDS operations now
1194	getNamedOperand
1609–1610	Braces around the block and addImm on next line
1688	Use BuildMI rather than the low level instruction creation APIs

In D31161#709509, @rampitec wrote:

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

these tests must go downstream however (separate change). we do not have fence implementation upstreamed yet.

In D31161#713546, @kzhuravl wrote:

In D31161#709509, @rampitec wrote:

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

these tests must go downstream however (separate change). we do not have fence implementation upstreamed yet.

It can be in the internal repo, but we need these tests to be sure we are producing right fences for OpenCL barriers.

In D31161#713550, @rampitec wrote:

In D31161#713546, @kzhuravl wrote:

In D31161#709509, @rampitec wrote:

Test with barrier surrounded with fences is needed. All relevant combinations of fences needs to be checked and pattern shall check that only needed counters are used, wait is produced, and that is only one wait. Please refer to OpenCL barrier() implementation for the code to check.

these tests must go downstream however (separate change). we do not have fence implementation upstreamed yet.

It can be in the internal repo, but we need these tests to be sure we are producing right fences for OpenCL barriers.

that is exactly what i meant by my previous comment (tests should go to amd-common branch).

In response to review comments from arsenm, the following updates/responses:

The following updates to the code:

Removed Extra blank line.
Removed macro defining iterator on enum and replace it with inline code.
Replace with C++ style comments.
Use nullptr.
Replace Linemapping with LineMapping.
Use MI.isDebugValue()
Use getNamedOperand
Braces around the block (but clang-format likes addImm on same line).

The following will be addressed in a subsequent put-back.

LDS access check & GDS tests: The code can be simplified further as suggested. The generated code is overly conservative but correct.
Uses of getAtomicNoRetOp: The code is looking for atomic operations that do not return. The atomic operations that do return are covered by the code that look for stores. Again, this is conservatively correct as is written, but needs to be fixed.

CreateMachineInstr - BuildMI was unusable in this context. Besides, X86 and some other code written for our target also use CreateMachineInstr.

The update includes the following after rebasing:

Re-organize the getRegInterval as follows:

Support the registers covered under isSGPRReg that are allocatable.
Assert that the register numbers are within bounds.

Check the GDS bit and avoid the loop on the memory operands. For DS operations, additionally check they are either load or a store.

Fix the LGKM bit for ds_bpermute, ds_permute operations that do not access DS.

Add comments on TODO items to the code. The following are some of the other items:

We currently list all !VGPR registers under isSGPRReg that causes confusion.
There are waintcnt flags/bits which do not appear to be consistently set (example: ds_swizzle). I need to verify the flag bits are set property and then could use the bits in the new pass.

clang-format

t-tye added inline comments.Apr 5 2017, 7:57 PM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
424	Could M0, EXEC, etc. also potentially be a source or dest for SMEM and VMEM load/store? Perhaps add an assert to ensure they never happen.

arsenm added inline comments.Apr 5 2017, 8:00 PM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
424	Those can't even be encoded for VMEM. M0 isn't in an allocatable class, and exec is reserved so neither one will appear there either. They are also disallowed by the operand constraints so will be a verifier error

According to the showed ISA for barrier with global and local memory fence this pass produces incorrect code. Please fix.

This revision now requires changes to proceed.Apr 7 2017, 6:48 PM

When applied to current trunk, this code can get into an infinite loop. Please see the sample LLVM IR at https://paste.debian.net/926533/

When compiled with llc -march=amdgcn -mcpu=bonaire, the new pass gets into an infinite loop. When compiled for Tonga, no infinite loop is encountered.

Fix the issue related to barrier.

Fix the looping issue found with the GL input. The bug only triggers for some targets under some conditions. This is due to an earlier fix for a VCCZ bug; that code needs to go away once the scheduler is fixed to handle this scenario. I need to add a lit test.

This update includes all the fixes so far, but keeps the old pass the default. We would like to get broader test coverage under the option first, and then add the tests and turn the new pass on by default.

LGTM

rampitec accepted this revision.Apr 11 2017, 9:49 PM

rampitec resigned from this revision.Apr 24 2017, 11:37 AM

This revision now requires changes to proceed.Apr 24 2017, 11:37 AM

kzhuravl resigned from this revision.Apr 24 2017, 12:33 PM

This revision is now accepted and ready to land.Apr 24 2017, 12:33 PM

arsenm closed this revision.Jan 27 2020, 12:10 PM

arsenm added inline comments.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
569–580	This can be deleted

Herald added subscribers: kerbowa, jfb, jvesely. · View Herald TranscriptJan 27 2020, 12:10 PM

foad mentioned this in D69661: [AMDGPU] Fix vccz after v_readlane/v_readfirstlane to vcc_lo/hi.Jan 28 2020, 2:54 AM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPU.h

4 lines

AMDGPUTargetMachine.cpp

12 lines

CMakeLists.txt

1 line

DSInstructions.td

2 lines

SIInsertWaitcnts.cpp

1874 lines

test/

CodeGen/

AMDGPU/

basic-branch.ll

2 lines

branch-condition-and.ll

3 lines

branch-relaxation.ll

2 lines

control-flow-fastregalloc.ll

11 lines

indirect-addressing-si.ll

9 lines

infinite-loop.ll

2 lines

llvm.amdgcn.buffer.store.format.ll

2 lines

llvm.amdgcn.buffer.store.ll

2 lines

llvm.amdgcn.ds.bpermute.ll

3 lines

llvm.amdgcn.ds.permute.ll

2 lines

llvm.amdgcn.ds.swizzle.ll

1 line

llvm.amdgcn.image.ll

2 lines

llvm.amdgcn.s.dcache.inv.ll

2 lines

llvm.amdgcn.s.dcache.inv.vol.ll

2 lines

llvm.amdgcn.s.dcache.wb.ll

2 lines

llvm.amdgcn.s.dcache.wb.vol.ll

2 lines

llvm.amdgcn.s.waitcnt.ll

4 lines

multi-divergent-exit-region.ll

1 line

ret_jump.ll

2 lines

si-lower-control-flow-unreachable-block.ll

3 lines

smrd-vccz-bug.ll

2 lines

spill-m0.ll

2 lines

valu-i1.ll

8 lines

Diff 94310

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	FunctionPass *createSILowerI1CopiesPass();			FunctionPass *createSILowerI1CopiesPass();
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
	FunctionPass *createSILoadStoreOptimizerPass(TargetMachine &tm);			FunctionPass *createSILoadStoreOptimizerPass(TargetMachine &tm);
	FunctionPass *createSIWholeQuadModePass();			FunctionPass *createSIWholeQuadModePass();
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
	FunctionPass *createSIFixSGPRCopiesPass();			FunctionPass *createSIFixSGPRCopiesPass();
	FunctionPass *createSIDebuggerInsertNopsPass();			FunctionPass *createSIDebuggerInsertNopsPass();
	FunctionPass *createSIInsertWaitsPass();			FunctionPass *createSIInsertWaitsPass();
				FunctionPass *createSIInsertWaitcntsPass();
	FunctionPass createAMDGPUCodeGenPreparePass(const GCNTargetMachine TM = nullptr);			FunctionPass createAMDGPUCodeGenPreparePass(const GCNTargetMachine TM = nullptr);

	ModulePass createAMDGPUAnnotateKernelFeaturesPass(const TargetMachine TM = nullptr);			ModulePass createAMDGPUAnnotateKernelFeaturesPass(const TargetMachine TM = nullptr);
	void initializeAMDGPUAnnotateKernelFeaturesPass(PassRegistry &);			void initializeAMDGPUAnnotateKernelFeaturesPass(PassRegistry &);
	extern char &AMDGPUAnnotateKernelFeaturesID;			extern char &AMDGPUAnnotateKernelFeaturesID;

	ModulePass *createAMDGPULowerIntrinsicsPass();			ModulePass *createAMDGPULowerIntrinsicsPass();
	void initializeAMDGPULowerIntrinsicsPass(PassRegistry &);			void initializeAMDGPULowerIntrinsicsPass(PassRegistry &);
	▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
	extern char &SIAnnotateControlFlowPassID;			extern char &SIAnnotateControlFlowPassID;

	void initializeSIDebuggerInsertNopsPass(PassRegistry&);			void initializeSIDebuggerInsertNopsPass(PassRegistry&);
	extern char &SIDebuggerInsertNopsID;			extern char &SIDebuggerInsertNopsID;

	void initializeSIInsertWaitsPass(PassRegistry&);			void initializeSIInsertWaitsPass(PassRegistry&);
	extern char &SIInsertWaitsID;			extern char &SIInsertWaitsID;

				void initializeSIInsertWaitcntsPass(PassRegistry&);
				extern char &SIInsertWaitcntsID;

	void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);			void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);
	extern char &AMDGPUUnifyDivergentExitNodesID;			extern char &AMDGPUUnifyDivergentExitNodesID;

	ImmutablePass *createAMDGPUAAWrapperPass();			ImmutablePass *createAMDGPUAAWrapperPass();
	void initializeAMDGPUAAWrapperPassPass(PassRegistry&);			void initializeAMDGPUAAWrapperPassPass(PassRegistry&);

	Target &getTheAMDGPUTarget();			Target &getTheAMDGPUTarget();
	Target &getTheGCNTarget();			Target &getTheGCNTarget();
	▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableSDWAPeephole(
cl::desc("Enable SDWA peepholer"),		cl::desc("Enable SDWA peepholer"),
cl::init(false));		cl::init(false));

// Enable address space based alias analysis		// Enable address space based alias analysis
static cl::opt<bool> EnableAMDGPUAliasAnalysis("enable-amdgpu-aa", cl::Hidden,		static cl::opt<bool> EnableAMDGPUAliasAnalysis("enable-amdgpu-aa", cl::Hidden,
cl::desc("Enable AMDGPU Alias Analysis"),		cl::desc("Enable AMDGPU Alias Analysis"),
cl::init(true));		cl::init(true));

		// Option to enable new waitcnt insertion pass.
		static cl::opt<bool> EnableSIInsertWaitcntsPass(
		"enable-si-insert-waitcnts",
		cl::desc("Use new waitcnt insertion pass"),
		cl::init(true));

extern "C" void LLVMInitializeAMDGPUTarget() {		extern "C" void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());		RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeSILowerI1CopiesPass(*PR);		initializeSILowerI1CopiesPass(*PR);
initializeSIFixSGPRCopiesPass(*PR);		initializeSIFixSGPRCopiesPass(*PR);
initializeSIFixVGPRCopiesPass(*PR);		initializeSIFixVGPRCopiesPass(*PR);
initializeSIFoldOperandsPass(*PR);		initializeSIFoldOperandsPass(*PR);
initializeSIPeepholeSDWAPass(*PR);		initializeSIPeepholeSDWAPass(*PR);
initializeSIShrinkInstructionsPass(*PR);		initializeSIShrinkInstructionsPass(*PR);
initializeSIFixControlFlowLiveIntervalsPass(*PR);		initializeSIFixControlFlowLiveIntervalsPass(*PR);
initializeSILoadStoreOptimizerPass(*PR);		initializeSILoadStoreOptimizerPass(*PR);
initializeAMDGPUAnnotateKernelFeaturesPass(*PR);		initializeAMDGPUAnnotateKernelFeaturesPass(*PR);
initializeAMDGPUAnnotateUniformValuesPass(*PR);		initializeAMDGPUAnnotateUniformValuesPass(*PR);
initializeAMDGPULowerIntrinsicsPass(*PR);		initializeAMDGPULowerIntrinsicsPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
initializeAMDGPUCodeGenPreparePass(*PR);		initializeAMDGPUCodeGenPreparePass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertWaitsPass(*PR);		initializeSIInsertWaitsPass(*PR);
		initializeSIInsertWaitcntsPass(*PR);
initializeSIWholeQuadModePass(*PR);		initializeSIWholeQuadModePass(*PR);
initializeSILowerControlFlowPass(*PR);		initializeSILowerControlFlowPass(*PR);
initializeSIInsertSkipsPass(*PR);		initializeSIInsertSkipsPass(*PR);
initializeSIDebuggerInsertNopsPass(*PR);		initializeSIDebuggerInsertNopsPass(*PR);
initializeSIOptimizeExecMaskingPass(*PR);		initializeSIOptimizeExecMaskingPass(*PR);
initializeAMDGPUUnifyDivergentExitNodesPass(*PR);		initializeAMDGPUUnifyDivergentExitNodesPass(*PR);
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
}		}
▲ Show 20 Lines • Show All 660 Lines • ▼ Show 20 Lines	void GCNPassConfig::addPreEmitPass() {
// are multiple scheduling regions in a basic block, the regions are scheduled		// are multiple scheduling regions in a basic block, the regions are scheduled
// bottom up, so when we begin to schedule a region we don't know what		// bottom up, so when we begin to schedule a region we don't know what
// instructions were emitted directly before it.		// instructions were emitted directly before it.
//		//
// Here we add a stand-alone hazard recognizer pass which can handle all		// Here we add a stand-alone hazard recognizer pass which can handle all
// cases.		// cases.
addPass(&PostRAHazardRecognizerID);		addPass(&PostRAHazardRecognizerID);

		if (EnableSIInsertWaitcntsPass)
		addPass(createSIInsertWaitcntsPass());
		else
addPass(createSIInsertWaitsPass());		addPass(createSIInsertWaitsPass());
addPass(createSIShrinkInstructionsPass());		addPass(createSIShrinkInstructionsPass());
addPass(&SIInsertSkipsPassID);		addPass(&SIInsertSkipsPassID);
addPass(createSIDebuggerInsertNopsPass());		addPass(createSIDebuggerInsertNopsPass());
addPass(&BranchRelaxationPassID);		addPass(&BranchRelaxationPassID);
}		}

TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {
return new GCNPassConfig(this, PM);		return new GCNPassConfig(this, PM);
}		}

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
SIDebuggerInsertNops.cpp		SIDebuggerInsertNops.cpp
SIFixControlFlowLiveIntervals.cpp		SIFixControlFlowLiveIntervals.cpp
SIFixSGPRCopies.cpp		SIFixSGPRCopies.cpp
SIFixVGPRCopies.cpp		SIFixVGPRCopies.cpp
SIFoldOperands.cpp		SIFoldOperands.cpp
SIFrameLowering.cpp		SIFrameLowering.cpp
SIInsertSkips.cpp		SIInsertSkips.cpp
SIInsertWaits.cpp		SIInsertWaits.cpp
		SIInsertWaitcnts.cpp
SIInstrInfo.cpp		SIInstrInfo.cpp
SIISelLowering.cpp		SIISelLowering.cpp
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
SILowerControlFlow.cpp		SILowerControlFlow.cpp
SILowerI1Copies.cpp		SILowerI1Copies.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
SIMachineScheduler.cpp		SIMachineScheduler.cpp
SIOptimizeExecMasking.cpp		SIOptimizeExecMasking.cpp
Show All 17 Lines

lib/Target/AMDGPU/DSInstructions.td

	Show First 20 Lines • Show All 239 Lines • ▼ Show 20 Lines
	class DS_1A1D_PERMUTE <string opName, SDPatternOperator node = null_frag>			class DS_1A1D_PERMUTE <string opName, SDPatternOperator node = null_frag>
	: DS_Pseudo<opName,			: DS_Pseudo<opName,
	(outs VGPR_32:$vdst),			(outs VGPR_32:$vdst),
	(ins VGPR_32:$addr, VGPR_32:$data0, offset:$offset),			(ins VGPR_32:$addr, VGPR_32:$data0, offset:$offset),
	"$vdst, $addr, $data0$offset",			"$vdst, $addr, $data0$offset",
	[(set i32:$vdst,			[(set i32:$vdst,
	(node (DS1Addr1Offset i32:$addr, i16:$offset), i32:$data0))] > {			(node (DS1Addr1Offset i32:$addr, i16:$offset), i32:$data0))] > {

				let LGKM_CNT = 0;

	let mayLoad = 0;			let mayLoad = 0;
	let mayStore = 0;			let mayStore = 0;
	let isConvergent = 1;			let isConvergent = 1;

	let has_data1 = 0;			let has_data1 = 0;
	let has_gds = 0;			let has_gds = 0;
	}			}

	▲ Show 20 Lines • Show All 687 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInsertWaitcnts.cpp

This file was added.

				//===-- SIInsertWaitcnts.cpp - Insert Wait Instructions --------------------===/
				//
				// The LLVM Compiler Infrastructure
				arsenmUnsubmitted Done Reply Inline Actions Extra blank line arsenm: Extra blank line
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief Insert wait instructions for memory reads and writes.
				///
				/// Memory reads and writes are issued asynchronously, so we need to insert
				/// S_WAITCNT instructions when we want to access any of their results or
				/// overwrite any register that's used asynchronously.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "SIDefines.h"
				#include "SIInstrInfo.h"
				#include "SIMachineFunctionInfo.h"
				#include "Utils/AMDGPUBaseInfo.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"

				#define DEBUG_TYPE "si-insert-waitcnts"

				using namespace llvm;

				namespace {

				// Class of object that encapsulates latest instruction counter score
				// associated with the operand. Used for determining whether
				// s_waitcnt instruction needs to be emited.

				#define CNT_MASK(t) (1u << (t))

				enum InstCounterType { VM_CNT = 0, LGKM_CNT, EXP_CNT, NUM_INST_CNTS };

				typedef std::pair<signed, signed> RegInterval;

				struct {
				int32_t VmcntMax;
				int32_t ExpcntMax;
				int32_t LgkmcntMax;
				int32_t NumVGPRsMax;
				int32_t NumSGPRsMax;
				} HardwareLimits;

				struct {
				unsigned VGPR0;
				unsigned VGPRL;
				unsigned SGPR0;
				unsigned SGPRL;
				} RegisterEncoding;

				enum WaitEventType {
				VMEM_ACCESS, // vector-memory read & write
				LDS_ACCESS, // lds read & write
				GDS_ACCESS, // gds read & write
				SQ_MESSAGE, // send message
				SMEM_ACCESS, // scalar-memory read & write
				EXP_GPR_LOCK, // export holding on its data src
				GDS_GPR_LOCK, // GDS holding on its data and addr src
				EXP_POS_ACCESS, // write to export position
				EXP_PARAM_ACCESS, // write to export parameter
				VMW_GPR_LOCK, // vector-memory write holding on its data src
				NUM_WAIT_EVENTS,
				};

				kzhuravlUnsubmitted Done Reply Inline Actions @rampitec wrote: Should not we get these numbers from TD files for target which we already have? kzhuravl: @rampitec wrote: Should not we get these numbers from TD files for target which we already have?
				rampitecUnsubmitted Done Reply Inline Actions Should not we get these numbers from TD files for target which we already have? rampitec: Should not we get these numbers from TD files for target which we already have?
				// The mapping is:
				// 0 .. SQ_MAX_PGM_VGPRS-1 real VGPRs
				// SQ_MAX_PGM_VGPRS .. NUM_ALL_VGPRS-1 extra VGPR-like slots
				// NUM_ALL_VGPRS .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs
				// We reserve a fixed number of VGPR slots in the scoring tables for
				// special tokens like SCMEM_LDS (needed for buffer load to LDS).
				enum RegisterMapping {
				SQ_MAX_PGM_VGPRS = 256, // Maximum programmable VGPRs across all targets.
				SQ_MAX_PGM_SGPRS = 256, // Maximum programmable SGPRs across all targets.
				NUM_EXTRA_VGPRS = 1, // A reserved slot for DS.
				EXTRA_VGPR_LDS = 0, // This is a placeholder the Shader algorithm uses.
				NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.
				};

				#define ForAllWaitEventType(w) \
				for (enum WaitEventType w = (enum WaitEventType)0; \
				(w) < (enum WaitEventType)NUM_WAIT_EVENTS; \
				arsenmUnsubmitted Done Reply Inline Actions This should be replaced with an iterator range or simply removed arsenm: This should be replaced with an iterator range or simply removed
				(w) = (enum WaitEventType)((w) + 1))

				// This is a per-basic-block object that maintains current score brackets
				// of each wait-counter, and a per-register scoreboard for each wait-couner.
				// We also maintain the latest score for every event type that can change the
				// waitcnt in order to know if there are multiple types of events within
				// the brackets. When multiple types of event happen in the bracket,
				// wait-count may get decreased out of order, therefore we need to put in
				// "s_waitcnt 0" before use.
				class BlockWaitcntBrackets {
				public:
				static int32_t getWaitCountMax(InstCounterType T) {
				switch (T) {
				case VM_CNT:
				return HardwareLimits.VmcntMax;
				case LGKM_CNT:
				return HardwareLimits.LgkmcntMax;
				case EXP_CNT:
				return HardwareLimits.ExpcntMax;
				default:
				break;
				}
				return 0;
				};

				void setScoreLB(InstCounterType T, int32_t Val) {
				assert(T < NUM_INST_CNTS);
				if (T >= NUM_INST_CNTS)
				return;
				ScoreLBs[T] = Val;
				};

				void setScoreUB(InstCounterType T, int32_t Val) {
				assert(T < NUM_INST_CNTS);
				if (T >= NUM_INST_CNTS)
				return;
				ScoreUBs[T] = Val;
				if (T == EXP_CNT) {
				int32_t UB = (int)(ScoreUBs[T] - getWaitCountMax(EXP_CNT));
				if (ScoreLBs[T] < UB)
				ScoreLBs[T] = UB;
				}
				};

				int32_t getScoreLB(InstCounterType T) {
				assert(T < NUM_INST_CNTS);
				if (T >= NUM_INST_CNTS)
				return 0;
				return ScoreLBs[T];
				};

				int32_t getScoreUB(InstCounterType T) {
				assert(T < NUM_INST_CNTS);
				if (T >= NUM_INST_CNTS)
				return 0;
				return ScoreUBs[T];
				};

				// Mapping from event to counter.
				InstCounterType eventCounter(WaitEventType E) {
				switch (E) {
				case VMEM_ACCESS:
				return VM_CNT;
				case LDS_ACCESS:
				case GDS_ACCESS:
				case SQ_MESSAGE:
				case SMEM_ACCESS:
				return LGKM_CNT;
				case EXP_GPR_LOCK:
				case GDS_GPR_LOCK:
				case VMW_GPR_LOCK:
				case EXP_POS_ACCESS:
				case EXP_PARAM_ACCESS:
				return EXP_CNT;
				default:
				llvm_unreachable("unhandled event type");
				}
				return NUM_INST_CNTS;
				}

				void setRegScore(int GprNo, InstCounterType T, int32_t Val) {
				if (GprNo < NUM_ALL_VGPRS) {
				if (GprNo > VgprUB) {
				VgprUB = GprNo;
				}
				VgprScores[T][GprNo] = Val;
				} else {
				assert(T == LGKM_CNT);
				if (GprNo - NUM_ALL_VGPRS > SgprUB) {
				SgprUB = GprNo - NUM_ALL_VGPRS;
				}
				SgprScores[GprNo - NUM_ALL_VGPRS] = Val;
				}
				}

				int32_t getRegScore(int GprNo, InstCounterType T) {
				if (GprNo < NUM_ALL_VGPRS) {
				return VgprScores[T][GprNo];
				}
				return SgprScores[GprNo - NUM_ALL_VGPRS];
				}

				void clear() {
				memset(ScoreLBs, 0, sizeof(ScoreLBs));
				memset(ScoreUBs, 0, sizeof(ScoreUBs));
				memset(EventUBs, 0, sizeof(EventUBs));
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				memset(VgprScores[T], 0, sizeof(VgprScores[T]));
				}
				memset(SgprScores, 0, sizeof(SgprScores));
				}

				RegInterval getRegInterval(const MachineInstr MI, const SIInstrInfo TII,
				const MachineRegisterInfo *MRI,
				const SIRegisterInfo *TRI, unsigned OpNo,
				bool Def) const;

				void setExpScore(const MachineInstr MI, const SIInstrInfo TII,
				const SIRegisterInfo TRI, const MachineRegisterInfo MRI,
				unsigned OpNo, int32_t Val);

				void setWaitAtBeginning() { WaitAtBeginning = true; }
				void clearWaitAtBeginning() { WaitAtBeginning = false; }
				bool getWaitAtBeginning() const { return WaitAtBeginning; }
				void setEventUB(enum WaitEventType W, int32_t Val) { EventUBs[W] = Val; }
				int32_t getMaxVGPR() const { return VgprUB; }
				int32_t getMaxSGPR() const { return SgprUB; }
				int32_t getEventUB(enum WaitEventType W) const {
				assert(W < NUM_WAIT_EVENTS);
				return EventUBs[W];
				}
				bool counterOutOfOrder(InstCounterType T);
				unsigned int updateByWait(InstCounterType T, int ScoreToWait);
				void updateByEvent(const SIInstrInfo TII, const SIRegisterInfo TRI,
				const MachineRegisterInfo *MRI, WaitEventType E,
				MachineInstr &MI);

				BlockWaitcntBrackets()
				: WaitAtBeginning(false), ValidLoop(false), MixedExpTypes(false),
				LoopRegion(NULL), PostOrder(0), Waitcnt(NULL), VgprUB(0), SgprUB(0) {
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				memset(VgprScores[T], 0, sizeof(VgprScores[T]));
				}
				}
				~BlockWaitcntBrackets(){};

				bool hasPendingSMEM() const {
				return (EventUBs[SMEM_ACCESS] > ScoreLBs[LGKM_CNT] &&
				EventUBs[SMEM_ACCESS] <= ScoreUBs[LGKM_CNT]);
				}

				bool hasPendingFlat() const {
				return ((LastFlat[LGKM_CNT] > ScoreLBs[LGKM_CNT] &&
				LastFlat[LGKM_CNT] <= ScoreUBs[LGKM_CNT]) \|\|
				(LastFlat[VM_CNT] > ScoreLBs[VM_CNT] &&
				LastFlat[VM_CNT] <= ScoreUBs[VM_CNT]));
				}

				void setPendingFlat() {
				LastFlat[VM_CNT] = ScoreUBs[VM_CNT];
				LastFlat[LGKM_CNT] = ScoreUBs[LGKM_CNT];
				}

				int pendingFlat(InstCounterType Ct) const { return LastFlat[Ct]; }

				void setLastFlat(InstCounterType Ct, int Val) { LastFlat[Ct] = Val; }

				bool getRevisitLoop() const { return RevisitLoop; }
				void setRevisitLoop(bool RevisitLoopIn) { RevisitLoop = RevisitLoopIn; }

				void setPostOrder(int32_t PostOrderIn) { PostOrder = PostOrderIn; }
				int32_t getPostOrder() const { return PostOrder; }

				void setWaitcnt(MachineInstr *WaitcntIn) { Waitcnt = WaitcntIn; }
				void clearWaitcnt() { Waitcnt = NULL; }
				MachineInstr *getWaitcnt() const { return Waitcnt; }

				bool mixedExpTypes() const { return MixedExpTypes; }
				void setMixedExpTypes(bool MixedExpTypesIn) {
				MixedExpTypes = MixedExpTypesIn;
				}

				void print(raw_ostream &);
				void dump() { print(dbgs()); }

				private:
				bool WaitAtBeginning;
				bool RevisitLoop;
				bool ValidLoop;
				bool MixedExpTypes;
				MachineLoop *LoopRegion;
				int32_t PostOrder;
				MachineInstr *Waitcnt;
				int32_t ScoreLBs[NUM_INST_CNTS] = {0};
				int32_t ScoreUBs[NUM_INST_CNTS] = {0};
				int32_t EventUBs[NUM_WAIT_EVENTS] = {0};
				// Remember the last flat memory operation.
				int32_t LastFlat[NUM_INST_CNTS] = {0};
				// wait_cnt scores for every vgpr.
				// Keep track of the VgprUB and SgprUB to make merge at join efficient.
				int32_t VgprUB;
				int32_t SgprUB;
				int32_t VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS];
				// Wait cnt scores for every sgpr, only lgkmcnt is relevant.
				int32_t SgprScores[SQ_MAX_PGM_SGPRS] = {0};
				};

				// This is a per-loop-region object that records waitcnt status at the end of
				// loop footer from the previous iteration. We also maintain an iteration
				// count to track the number of times the loop has been visited. When it
				// doesn't converge naturally, we force convergence by inserting s_waitcnt 0
				// at the end of the loop footer.
				class LoopWaitcntData {
				public:
				void incIterCnt() { IterCnt++; }
				void resetIterCnt() { IterCnt = 0; }
				int32_t getIterCnt() { return IterCnt; }

				LoopWaitcntData() : LfWaitcnt(NULL), IterCnt(0) {}
				~LoopWaitcntData(){};

				void setWaitcnt(MachineInstr *WaitcntIn) { LfWaitcnt = WaitcntIn; }
				MachineInstr *getWaitcnt() const { return LfWaitcnt; }

				void print() {
				DEBUG(dbgs() << " iteration " << IterCnt << '\n';);
				return;
				}

				private:
				// s_waitcnt added at the end of loop footer to stablize wait scores
				// at the end of the loop footer.
				MachineInstr *LfWaitcnt;
				// Number of iterations the loop has been visited, not including the initial
				// walk over.
				int32_t IterCnt;
				};

				class SIInsertWaitcnts : public MachineFunctionPass {

				private:
				const SISubtarget *ST;
				const SIInstrInfo *TII;
				const SIRegisterInfo *TRI;
				const MachineRegisterInfo *MRI;
				const MachineLoopInfo *MLI;
				AMDGPU::IsaInfo::IsaVersion IV;
				AMDGPUAS AMDGPUASI;

				DenseSet<MachineBasicBlock *> BlockVisitedSet;
				DenseSet<MachineInstr *> CompilerGeneratedWaitcntSet;

				DenseMap<MachineBasicBlock *, std::unique_ptr<BlockWaitcntBrackets>>
				BlockWaitcntBracketsMap;

				DenseSet<MachineBasicBlock *> BlockWaitcntProcessedSet;

				DenseMap<MachineLoop *, std::unique_ptr<LoopWaitcntData>> LoopWaitcntDataMap;

				std::vector<std::unique_ptr<BlockWaitcntBrackets>> KillWaitBrackets;

				public:
				static char ID;

				SIInsertWaitcnts()
				: MachineFunctionPass(ID), ST(nullptr), TII(nullptr), TRI(nullptr),
				MRI(nullptr), MLI(nullptr) {}

				bool runOnMachineFunction(MachineFunction &MF) override;

				StringRef getPassName() const override {
				return "SI insert wait instructions";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.setPreservesCFG();
				AU.addRequired<MachineLoopInfo>();
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				void addKillWaitBracket(BlockWaitcntBrackets *Bracket) {
				// The waitcnt information is copied because it changes as the block is
				// traversed.
				KillWaitBrackets.push_back(make_unique<BlockWaitcntBrackets>(*Bracket));
				}

				MachineInstr *generateSWaitCntInstBefore(MachineInstr &MI,
				BlockWaitcntBrackets *ScoreBrackets);
				void updateEventWaitCntAfter(MachineInstr &Inst,
				BlockWaitcntBrackets *ScoreBrackets);
				void mergeInputScoreBrackets(MachineBasicBlock &Block);
				MachineBasicBlock loopBottom(const MachineLoop Loop);
				void insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block);
				void insertWaitcntBeforeCF(MachineBasicBlock &Block, MachineInstr *Inst);
				};

				} // End anonymous namespace.

				RegInterval BlockWaitcntBrackets::getRegInterval(const MachineInstr *MI,
				const SIInstrInfo *TII,
				const MachineRegisterInfo *MRI,
				const SIRegisterInfo *TRI,
				unsigned OpNo,
				bool Def) const {
				const MachineOperand &Op = MI->getOperand(OpNo);
				if (!Op.isReg() \|\| !TRI->isInAllocatableClass(Op.getReg()) \|\|
				(Def && !Op.isDef()))
				return { -1, -1 };

				// A use via a PW operand does not need a waitcnt.
				// A partial write is not a WAW.
				assert(!Op.getSubReg() \|\| !Op.isUndef());

				RegInterval Result;
				const MachineRegisterInfo &MRIA = *MRI;

				unsigned Reg = TRI->getEncodingValue(Op.getReg());

				if (TRI->isVGPR(MRIA, Op.getReg())) {
				assert(Reg >= RegisterEncoding.VGPR0 && Reg <= RegisterEncoding.VGPRL);
				Result.first = Reg - RegisterEncoding.VGPR0;
				assert(Result.first >= 0 && Result.first < SQ_MAX_PGM_VGPRS);
				}
				else if (TRI->isSGPRReg(MRIA, Op.getReg())) {
				assert(Reg >= RegisterEncoding.SGPR0 && Reg < SQ_MAX_PGM_SGPRS);
				Result.first = Reg - RegisterEncoding.SGPR0 + NUM_ALL_VGPRS;
				assert(Result.first >= NUM_ALL_VGPRS &&
				Result.first < SQ_MAX_PGM_SGPRS + NUM_ALL_VGPRS);
				}
				// TODO: Handle TTMP
				t-tyeUnsubmitted Not Done Reply Inline Actions Could M0, EXEC, etc. also potentially be a source or dest for SMEM and VMEM load/store? Perhaps add an assert to ensure they never happen. t-tye: Could M0, EXEC, etc. also potentially be a source or dest for SMEM and VMEM load/store?
				arsenmUnsubmitted Not Done Reply Inline Actions Those can't even be encoded for VMEM. M0 isn't in an allocatable class, and exec is reserved so neither one will appear there either. They are also disallowed by the operand constraints so will be a verifier error arsenm: Those can't even be encoded for VMEM. M0 isn't in an allocatable class, and exec is reserved so…
				// else if (TRI->isTTMP(MRIA, Reg.getReg())) ...
				else
				return { -1, -1 };

				const MachineInstr &MIA = *MI;
				const TargetRegisterClass *RC = TII->getOpRegClass(MIA, OpNo);
				unsigned Size = RC->getSize();
				Result.second = Result.first + (Size / 4);

				return Result;
				}

				void BlockWaitcntBrackets::setExpScore(const MachineInstr *MI,
				const SIInstrInfo *TII,
				const SIRegisterInfo *TRI,
				const MachineRegisterInfo *MRI,
				unsigned OpNo, int32_t Val) {
				RegInterval Interval = getRegInterval(MI, TII, MRI, TRI, OpNo, false);
				DEBUG({
				const MachineOperand &Opnd = MI->getOperand(OpNo);
				assert(TRI->isVGPR(*MRI, Opnd.getReg()));
				});
				for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
				setRegScore(RegNo, EXP_CNT, Val);
				}
				}

				void BlockWaitcntBrackets::updateByEvent(const SIInstrInfo *TII,
				const SIRegisterInfo *TRI,
				const MachineRegisterInfo *MRI,
				WaitEventType E, MachineInstr &Inst) {
				const MachineRegisterInfo &MRIA = *MRI;
				InstCounterType T = eventCounter(E);
				int32_t CurrScore = getScoreUB(T) + 1;
				// EventUB and ScoreUB need to be update regardless if this event changes
				// the score of a register or not.
				// Examples including vm_cnt when buffer-store or lgkm_cnt when send-message.
				EventUBs[E] = CurrScore;
				setScoreUB(T, CurrScore);

				if (T == EXP_CNT) {
				// Check for mixed export types. If they are mixed, then a waitcnt exp(0)
				// is required.
				if (!MixedExpTypes) {
				MixedExpTypes = counterOutOfOrder(EXP_CNT);
				}

				// Put score on the source vgprs. If this is a store, just use those
				// specific register(s).
				if (TII->isDS(Inst) && (Inst.mayStore() \|\| Inst.mayLoad())) {
				// All GDS operations must protect their address register (same as
				// export.)
				if (Inst.getOpcode() != AMDGPU::DS_APPEND &&
				Inst.getOpcode() != AMDGPU::DS_CONSUME) {
				setExpScore(
				&Inst, TII, TRI, MRI,
				AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::addr),
				CurrScore);
				}
				if (Inst.mayStore()) {
				setExpScore(
				&Inst, TII, TRI, MRI,
				AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::data0),
				CurrScore);
				if (AMDGPU::getNamedOperandIdx(Inst.getOpcode(),
				AMDGPU::OpName::data1) != -1) {
				setExpScore(&Inst, TII, TRI, MRI,
				AMDGPU::getNamedOperandIdx(Inst.getOpcode(),
				AMDGPU::OpName::data1),
				CurrScore);
				}
				} else if (AMDGPU::getAtomicNoRetOp(Inst.getOpcode()) != -1 &&
				Inst.getOpcode() != AMDGPU::DS_GWS_INIT &&
				Inst.getOpcode() != AMDGPU::DS_GWS_SEMA_V &&
				Inst.getOpcode() != AMDGPU::DS_GWS_SEMA_BR &&
				Inst.getOpcode() != AMDGPU::DS_GWS_SEMA_P &&
				Inst.getOpcode() != AMDGPU::DS_GWS_BARRIER &&
				Inst.getOpcode() != AMDGPU::DS_APPEND &&
				Inst.getOpcode() != AMDGPU::DS_CONSUME &&
				Inst.getOpcode() != AMDGPU::DS_ORDERED_COUNT) {
				for (unsigned I = 0, E = Inst.getNumOperands(); I != E; ++I) {
				const MachineOperand &Op = Inst.getOperand(I);
				if (Op.isReg() && !Op.isDef() && TRI->isVGPR(MRIA, Op.getReg())) {
				setExpScore(&Inst, TII, TRI, MRI, I, CurrScore);
				}
				}
				}
				} else if (TII->isFLAT(Inst)) {
				if (Inst.mayStore()) {
				setExpScore(
				&Inst, TII, TRI, MRI,
				AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::data),
				arsenmUnsubmitted Not Done Reply Inline Actions Can't you just get the ::data operand and check if it is a use or def? Same for the other places checking getAtomicNoRetOp arsenm: Can't you just get the ::data operand and check if it is a use or def? Same for the other…
				CurrScore);
				} else if (AMDGPU::getAtomicNoRetOp(Inst.getOpcode()) != -1) {
				setExpScore(
				&Inst, TII, TRI, MRI,
				AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::data),
				CurrScore);
				}
				} else if (TII->isMIMG(Inst)) {
				if (Inst.mayStore()) {
				setExpScore(&Inst, TII, TRI, MRI, 0, CurrScore);
				} else if (AMDGPU::getAtomicNoRetOp(Inst.getOpcode()) != -1) {
				setExpScore(
				&Inst, TII, TRI, MRI,
				AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::data),
				CurrScore);
				}
				} else if (TII->isMTBUF(Inst)) {
				if (Inst.mayStore()) {
				setExpScore(&Inst, TII, TRI, MRI, 0, CurrScore);
				}
				} else if (TII->isMUBUF(Inst)) {
				if (Inst.mayStore()) {
				setExpScore(&Inst, TII, TRI, MRI, 0, CurrScore);
				} else if (AMDGPU::getAtomicNoRetOp(Inst.getOpcode()) != -1) {
				setExpScore(
				&Inst, TII, TRI, MRI,
				AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::data),
				CurrScore);
				}
				} else {
				if (TII->isEXP(Inst)) {
				// For export the destination registers are really temps that
				// can be used as the actual source after export patching, so
				// we need to treat them like sources and set the EXP_CNT
				// score.
				for (unsigned I = 0, E = Inst.getNumOperands(); I != E; ++I) {
				MachineOperand &DefMO = Inst.getOperand(I);
				if (DefMO.isReg() && DefMO.isDef() &&
				TRI->isVGPR(MRIA, DefMO.getReg())) {
				setRegScore(TRI->getEncodingValue(DefMO.getReg()), EXP_CNT,
				CurrScore);
				}
				}
				}
				for (unsigned I = 0, E = Inst.getNumOperands(); I != E; ++I) {
				MachineOperand &MO = Inst.getOperand(I);
				if (MO.isReg() && !MO.isDef() && TRI->isVGPR(MRIA, MO.getReg())) {
				setExpScore(&Inst, TII, TRI, MRI, I, CurrScore);
				}
				}
				}
				#if 0 // TODO: check if this is handled by MUBUF code above.
				} else if (Inst.getOpcode() == AMDGPU::BUFFER_STORE_DWORD \|\|
				Inst.getOpcode() == AMDGPU::BUFFER_STORE_DWORDX2 \|\|
				Inst.getOpcode() == AMDGPU::BUFFER_STORE_DWORDX4) {
				MachineOperand *MO = TII->getNamedOperand(Inst, AMDGPU::OpName::data);
				unsigned OpNo;//TODO: find the OpNo for this operand;
				RegInterval Interval = getRegInterval(&Inst, TII, MRI, TRI, OpNo, false);
				for (signed RegNo = Interval.first; RegNo < Interval.second;
				++RegNo) {
				setRegScore(RegNo + NUM_ALL_VGPRS, t, CurrScore);
				}
				#endif
				} else {
				arsenmUnsubmitted Not Done Reply Inline Actions This can be deleted arsenm: This can be deleted
				// Match the score to the destination registers.
				for (unsigned I = 0, E = Inst.getNumOperands(); I != E; ++I) {
				RegInterval Interval = getRegInterval(&Inst, TII, MRI, TRI, I, true);
				if (T == VM_CNT && Interval.first >= NUM_ALL_VGPRS)
				continue;
				for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
				setRegScore(RegNo, T, CurrScore);
				}
				}
				if (TII->isDS(Inst) && Inst.mayStore()) {
				setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
				}
				}
				}

				void BlockWaitcntBrackets::print(raw_ostream &OS) {
				OS << '\n';
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				int LB = getScoreLB(T);
				int UB = getScoreUB(T);

				switch (T) {
				case VM_CNT:
				OS << " VM_CNT(" << UB - LB << "): ";
				break;
				case LGKM_CNT:
				OS << " LGKM_CNT(" << UB - LB << "): ";
				break;
				case EXP_CNT:
				OS << " EXP_CNT(" << UB - LB << "): ";
				break;
				default:
				OS << " UNKNOWN(" << UB - LB << "): ";
				break;
				}

				if (LB < UB) {
				// Print vgpr scores.
				for (int J = 0; J <= getMaxVGPR(); J++) {
				int RegScore = getRegScore(J, T);
				if (RegScore <= LB)
				continue;
				int RelScore = RegScore - LB - 1;
				if (J < SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS) {
				OS << RelScore << ":v" << J << " ";
				} else {
				OS << RelScore << ":ds ";
				}
				}
				// Also need to print sgpr scores for lgkm_cnt.
				if (T == LGKM_CNT) {
				for (int J = 0; J <= getMaxSGPR(); J++) {
				int RegScore = getRegScore(J + NUM_ALL_VGPRS, LGKM_CNT);
				if (RegScore <= LB)
				continue;
				int RelScore = RegScore - LB - 1;
				OS << RelScore << ":s" << J << " ";
				}
				}
				}
				OS << '\n';
				}
				OS << '\n';
				return;
				}

				unsigned int BlockWaitcntBrackets::updateByWait(InstCounterType T,
				int ScoreToWait) {
				unsigned int NeedWait = 0;
				if (ScoreToWait == -1) {
				// The score to wait is unknown. This implies that it was not encountered
				// during the path of the CFG walk done during the current traversal but
				// may be seen on a different path. Emit an s_wait counter with a
				// conservative value of 0 for the counter.
				NeedWait = CNT_MASK(T);
				setScoreLB(T, getScoreUB(T));
				return NeedWait;
				}

				// If the score of src_operand falls within the bracket, we need an
				// s_waitcnt instruction.
				const int32_t LB = getScoreLB(T);
				const int32_t UB = getScoreUB(T);
				if ((UB >= ScoreToWait) && (ScoreToWait > LB)) {
				if (T == VM_CNT && hasPendingFlat()) {
				// If there is a pending FLAT operation, and this is a VM waitcnt,
				// then we need to force a waitcnt 0 for VM.
				NeedWait = CNT_MASK(T);
				setScoreLB(T, getScoreUB(T));
				} else if (counterOutOfOrder(T)) {
				// Counter can get decremented out-of-order when there
				// are multiple types event in the brack. Also emit an s_wait counter
				// with a conservative value of 0 for the counter.
				NeedWait = CNT_MASK(T);
				setScoreLB(T, getScoreUB(T));
				} else {
				NeedWait = CNT_MASK(T);
				setScoreLB(T, ScoreToWait);
				}
				}

				return NeedWait;
				}

				// Where there are multiple types of event in the bracket of a counter,
				// the decrement may go out of order.
				bool BlockWaitcntBrackets::counterOutOfOrder(InstCounterType T) {
				switch (T) {
				case VM_CNT:
				return false;
				case LGKM_CNT: {
				if (EventUBs[SMEM_ACCESS] > ScoreLBs[LGKM_CNT] &&
				EventUBs[SMEM_ACCESS] <= ScoreUBs[LGKM_CNT]) {
				// Scalar memory read always can go out of order.
				return true;
				}
				int NumEventTypes = 0;
				if (EventUBs[LDS_ACCESS] > ScoreLBs[LGKM_CNT] &&
				EventUBs[LDS_ACCESS] <= ScoreUBs[LGKM_CNT]) {
				NumEventTypes++;
				}
				if (EventUBs[GDS_ACCESS] > ScoreLBs[LGKM_CNT] &&
				EventUBs[GDS_ACCESS] <= ScoreUBs[LGKM_CNT]) {
				NumEventTypes++;
				}
				if (EventUBs[SQ_MESSAGE] > ScoreLBs[LGKM_CNT] &&
				EventUBs[SQ_MESSAGE] <= ScoreUBs[LGKM_CNT]) {
				NumEventTypes++;
				}
				if (NumEventTypes <= 1) {
				return false;
				}
				break;
				}
				case EXP_CNT: {
				// If there has been a mixture of export types, then a waitcnt exp(0) is
				// required.
				if (MixedExpTypes)
				return true;
				int NumEventTypes = 0;
				if (EventUBs[EXP_GPR_LOCK] > ScoreLBs[EXP_CNT] &&
				EventUBs[EXP_GPR_LOCK] <= ScoreUBs[EXP_CNT]) {
				NumEventTypes++;
				}
				if (EventUBs[GDS_GPR_LOCK] > ScoreLBs[EXP_CNT] &&
				EventUBs[GDS_GPR_LOCK] <= ScoreUBs[EXP_CNT]) {
				NumEventTypes++;
				}
				if (EventUBs[VMW_GPR_LOCK] > ScoreLBs[EXP_CNT] &&
				EventUBs[VMW_GPR_LOCK] <= ScoreUBs[EXP_CNT]) {
				NumEventTypes++;
				}
				if (EventUBs[EXP_PARAM_ACCESS] > ScoreLBs[EXP_CNT] &&
				EventUBs[EXP_PARAM_ACCESS] <= ScoreUBs[EXP_CNT]) {
				NumEventTypes++;
				}

				if (EventUBs[EXP_POS_ACCESS] > ScoreLBs[EXP_CNT] &&
				EventUBs[EXP_POS_ACCESS] <= ScoreUBs[EXP_CNT]) {
				NumEventTypes++;
				}

				if (NumEventTypes <= 1) {
				return false;
				}
				break;
				}
				default:
				break;
				}
				return true;
				}

				INITIALIZE_PASS_BEGIN(SIInsertWaitcnts, DEBUG_TYPE, "SI Insert Waitcnts", false,
				false)
				INITIALIZE_PASS_END(SIInsertWaitcnts, DEBUG_TYPE, "SI Insert Waitcnts", false,
				false)

				char SIInsertWaitcnts::ID = 0;

				char &llvm::SIInsertWaitcntsID = SIInsertWaitcnts::ID;

				arsenmUnsubmitted Done Reply Inline Actions C++ style comments arsenm: C++ style comments
				FunctionPass *llvm::createSIInsertWaitcntsPass() {
				return new SIInsertWaitcnts();
				}

				static bool readsVCCZ(const MachineInstr &MI) {
				unsigned Opc = MI.getOpcode();
				return (Opc == AMDGPU::S_CBRANCH_VCCNZ \|\| Opc == AMDGPU::S_CBRANCH_VCCZ) &&
				!MI.getOperand(1).isUndef();
				}

				/// \brief Generate s_waitcnt instruction to be placed before cur_Inst.
				/// Instructions of a given type are returned in order,
				/// but instructions of different types can complete out of order.
				/// We rely on this in-order completion
				/// and simply assign a score to the memory access instructions.
				/// We keep track of the active "score bracket" to determine
				/// if an access of a memory read requires an s_waitcnt
				arsenmUnsubmitted Done Reply Inline Actions nullptr arsenm: nullptr
				/// and if so what the value of each counter is.
				/// The "score bracket" is bound by the lower bound and upper bound
				/// scores (_score_LB and _score_ub respectively).
				MachineInstr *SIInsertWaitcnts::generateSWaitCntInstBefore(
				MachineInstr &MI, BlockWaitcntBrackets *ScoreBrackets) {
				// To emit, or not to emit - that's the question!
				arsenmUnsubmitted Done Reply Inline Actions Linemapping->LineMapping arsenm: Linemapping->LineMapping
				// Start with an assumption that there is no need to emit.
				arsenmUnsubmitted Done Reply Inline Actions MI.isDebugValue() arsenm: MI.isDebugValue()
				unsigned int EmitSwaitcnt = 0;
				// s_waitcnt instruction to return; default is NULL.
				MachineInstr *SWaitInst = nullptr;
				// No need to wait before phi. If a phi-move exists, then the wait should
				// has been inserted before the move. If a phi-move does not exist, then
				// wait should be inserted before the real use. The same is true for
				// sc-merge. It is not a coincident that all these cases correspond to the
				// instructions that are skipped in the assembling loop.
				bool NeedLineMapping = false; // TODO: Check on this.
				if (MI.isDebugValue() &&
				// TODO: any other opcode?
				!NeedLineMapping) {
				return SWaitInst;
				}

				// See if an s_waitcnt is forced at block entry, or is needed at
				// program end.
				if (ScoreBrackets->getWaitAtBeginning()) {
				// Note that we have already cleared the state, so we don't need to update
				// it.
				ScoreBrackets->clearWaitAtBeginning();
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				EmitSwaitcnt \|= CNT_MASK(T);
				ScoreBrackets->setScoreLB(T, ScoreBrackets->getScoreUB(T));
				}
				}

				// See if this instruction has a forced S_WAITCNT VM.
				// TODO: Handle other cases of NeedsWaitcntVmBefore()
				else if (MI.getOpcode() == AMDGPU::BUFFER_WBINVL1 \|\|
				MI.getOpcode() == AMDGPU::BUFFER_WBINVL1_SC \|\|
				rampitecUnsubmitted Done Reply Inline Actions I believe there is no SI_RETURN anymore, it is renamed. rampitec: I believe there is no SI_RETURN anymore, it is renamed.
				MI.getOpcode() == AMDGPU::BUFFER_WBINVL1_VOL) {
				EmitSwaitcnt \|=
				ScoreBrackets->updateByWait(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));
				}

				// All waits must be resolved at call return.
				// NOTE: this could be improved with knowledge of all call sites or
				// with knowledge of the called routines.
				if (MI.getOpcode() == AMDGPU::RETURN \|\|
				MI.getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG) {
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				if (ScoreBrackets->getScoreUB(T) > ScoreBrackets->getScoreLB(T)) {
				ScoreBrackets->setScoreLB(T, ScoreBrackets->getScoreUB(T));
				EmitSwaitcnt \|= CNT_MASK(T);
				}
				}
				}
				// Resolve vm waits before gs-done.
				else if ((MI.getOpcode() == AMDGPU::S_SENDMSG \|\|
				MI.getOpcode() == AMDGPU::S_SENDMSGHALT) &&
				((MI.getOperand(0).getImm() & AMDGPU::SendMsg::ID_MASK_) ==
				AMDGPU::SendMsg::ID_GS_DONE)) {
				if (ScoreBrackets->getScoreUB(VM_CNT) > ScoreBrackets->getScoreLB(VM_CNT)) {
				ScoreBrackets->setScoreLB(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));
				EmitSwaitcnt \|= CNT_MASK(VM_CNT);
				}
				}
				#if 0 // TODO: the following blocks of logic when we have fence.
				else if (MI.getOpcode() == SC_FENCE) {
				const unsigned int group_size =
				context->shader_info->GetMaxThreadGroupSize();
				// group_size == 0 means thread group size is unknown at compile time
				const bool group_is_multi_wave =
				(group_size == 0 \|\| group_size > target_info->GetWaveFrontSize());
				const bool fence_is_global = !((SCInstInternalMisc*)Inst)->IsGroupFence();

				for (unsigned int i = 0; i < Inst->NumSrcOperands(); i++) {
				SCRegType src_type = Inst->GetSrcType(i);
				switch (src_type) {
				case SCMEM_LDS:
				if (group_is_multi_wave \|\|
				context->OptFlagIsOn(OPT_R1100_LDSMEM_FENCE_CHICKEN_BIT)) {
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(LGKM_CNT,
				ScoreBrackets->getScoreUB(LGKM_CNT));
				// LDS may have to wait for VM_CNT after buffer load to LDS
				if (target_info->HasBufferLoadToLDS()) {
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(VM_CNT,
				ScoreBrackets->getScoreUB(VM_CNT));
				}
				}
				break;

				case SCMEM_GDS:
				if (group_is_multi_wave \|\| fence_is_global) {
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(EXP_CNT,
				ScoreBrackets->getScoreUB(EXP_CNT));
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(LGKM_CNT,
				ScoreBrackets->getScoreUB(LGKM_CNT));
				}
				break;

				case SCMEM_UAV:
				case SCMEM_TFBUF:
				case SCMEM_RING:
				case SCMEM_SCATTER:
				if (group_is_multi_wave \|\| fence_is_global) {
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(EXP_CNT,
				ScoreBrackets->getScoreUB(EXP_CNT));
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(VM_CNT,
				ScoreBrackets->getScoreUB(VM_CNT));
				}
				break;

				case SCMEM_SCRATCH:
				default:
				break;
				}
				}
				}
				#endif

				// Export & GDS instructions do not read the EXEC mask until after the export
				// is granted (which can occur well after the instruction is issued).
				// The shader program must flush all EXP operations on the export-count
				// before overwriting the EXEC mask.
				else {
				if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
				// Export and GDS are tracked individually, either may trigger a waitcnt
				// for EXEC.
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				EXP_CNT, ScoreBrackets->getEventUB(EXP_GPR_LOCK));
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				EXP_CNT, ScoreBrackets->getEventUB(EXP_PARAM_ACCESS));
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				EXP_CNT, ScoreBrackets->getEventUB(EXP_POS_ACCESS));
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				EXP_CNT, ScoreBrackets->getEventUB(GDS_GPR_LOCK));
				}

				#if 0 // TODO: the following code to handle CALL.
				// The argument passing for CALLs should suffice for VM_CNT and LGKM_CNT.
				// However, there is a problem with EXP_CNT, because the call cannot
				// easily tell if a register is used in the function, and if it did, then
				// the referring instruction would have to have an S_WAITCNT, which is
				// dependent on all call sites. So Instead, force S_WAITCNT for EXP_CNTs
				// before the call.
				if (MI.getOpcode() == SC_CALL) {
				if (ScoreBrackets->getScoreUB(EXP_CNT) >
				ScoreBrackets->getScoreLB(EXP_CNT)) {
				ScoreBrackets->setScoreLB(EXP_CNT, ScoreBrackets->getScoreUB(EXP_CNT));
				arsenmUnsubmitted Not Done Reply Inline Actions I'm concerned by relying on the memoperands since it's possible they were dropped. The uses LDS memory check could at least be factored into a predicate function arsenm: I'm concerned by relying on the memoperands since it's possible they were dropped. The uses LDS…
				EmitSwaitcnt \|= CNT_MASK(EXP_CNT);
				}
				}
				#endif

				// Look at the source operands of every instruction to see if
				// any of them results from a previous memory operation that affects
				// its current usage. If so, an s_waitcnt instruction needs to be
				// emitted.
				// If the source operand was defined by a load, add the s_waitcnt
				// instruction.
				for (const MachineMemOperand *Memop : MI.memoperands()) {
				unsigned AS = Memop->getAddrSpace();
				if (AS != AMDGPUASI.LOCAL_ADDRESS)
				continue;
				unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
				// VM_CNT is only relevant to vgpr or LDS.
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));
				}
				for (unsigned I = 0, E = MI.getNumOperands(); I != E; ++I) {
				const MachineOperand &Op = MI.getOperand(I);
				const MachineRegisterInfo &MRIA = *MRI;
				RegInterval Interval =
				ScoreBrackets->getRegInterval(&MI, TII, MRI, TRI, I, false);
				for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
				if (TRI->isVGPR(MRIA, Op.getReg())) {
				// VM_CNT is only relevant to vgpr or LDS.
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));
				}
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				LGKM_CNT, ScoreBrackets->getRegScore(RegNo, LGKM_CNT));
				}
				}
				// End of for loop that looks at all source operands to decide vm_wait_cnt
				// and lgk_wait_cnt.

				// Two cases are handled for destination operands:
				// 1) If the destination operand was defined by a load, add the s_waitcnt
				// instruction to guarantee the right WAW order.
				// 2) If a destination operand that was used by a recent export/store ins,
				arsenmUnsubmitted Not Done Reply Inline Actions Ditto arsenm: Ditto
				// add s_waitcnt on exp_cnt to guarantee the WAR order.
				if (MI.mayStore()) {
				for (const MachineMemOperand *Memop : MI.memoperands()) {
				unsigned AS = Memop->getAddrSpace();
				if (AS != AMDGPUASI.LOCAL_ADDRESS)
				continue;
				unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				EXP_CNT, ScoreBrackets->getRegScore(RegNo, EXP_CNT));
				}
				}
				for (unsigned I = 0, E = MI.getNumOperands(); I != E; ++I) {
				MachineOperand &Def = MI.getOperand(I);
				const MachineRegisterInfo &MRIA = *MRI;
				RegInterval Interval =
				ScoreBrackets->getRegInterval(&MI, TII, MRI, TRI, I, true);
				for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
				if (TRI->isVGPR(MRIA, Def.getReg())) {
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				EXP_CNT, ScoreBrackets->getRegScore(RegNo, EXP_CNT));
				}
				EmitSwaitcnt \|= ScoreBrackets->updateByWait(
				LGKM_CNT, ScoreBrackets->getRegScore(RegNo, LGKM_CNT));
				}
				} // End of for loop that looks at all dest operands.
				}

				// TODO: Tie force zero to a compiler triage option.
				bool ForceZero = false;

				rampitecUnsubmitted Not Done Reply Inline Actions Not needed anymore. rampitec: Not needed anymore.
				if (MI.getOpcode() == AMDGPU::S_BARRIER && ST->needWaitcntBeforeBarrier()) {
				EmitSwaitcnt = true;
				}

				// TODO: Remove this work-around, enable the assert for Bug 457939
				// after fixing the scheduler. Also, the Shader Compiler code is
				// independent of target.
				if (readsVCCZ(MI) && ST->getGeneration() <= SISubtarget::SEA_ISLANDS) {
				if (ScoreBrackets->getScoreLB(LGKM_CNT) <
				ScoreBrackets->getScoreUB(LGKM_CNT) &&
				ScoreBrackets->hasPendingSMEM()) {
				// Wait on everything, not just LGKM. vccz reads usually come from
				kzhuravlUnsubmitted Done Reply Inline Actions @rampitec wrote: For a barrier it will always insert strongest: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) regardless of what was an argument of the barrier. This also seems to completely ignore atomic fences inserted around the barrier from the library, which shall be a real source of wait argument. Note, that semantics of needWaitcntBeforeBarrier() is not that we always need to insert wait with barrier, but that we may need to insert it. Also note that existing pass does not seem to do it for a barrier. kzhuravl: @rampitec wrote: For a barrier it will always insert strongest: s_waitcnt vmcnt(0) expcnt(0)…
				rampitecUnsubmitted Done Reply Inline Actions For a barrier it will always insert strongest: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) regardless of what was an argument of the barrier. This also seems to completely ignore atomic fences inserted around the barrier from the library, which shall be a real source of wait argument. Note, that semantics of needWaitcntBeforeBarrier() is not that we always need to insert wait with barrier, but that we may need to insert it. Also note that existing pass does not seem to do it for a barrier. rampitec: For a barrier it will always insert strongest: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)…
				kzhuravlUnsubmitted Done Reply Inline Actions Atomics are currently handled in a separate pass, which we determined to be conservatively correct. We plan to integrate atomics into waitcnt insertion but later and in a separate patch. The needWaitcntBeforeBarrier tells you whether you need a waitcnt before the barrier or not. For >=GFX9, waitcnt is automatically inserted before the barrier, so we do not need to generate it, and needWaitcntBeforeBarrier returns false for >=GFX9. Also note that existing pass does not seem to do it for a barrier. https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AMDGPU/SIInsertWaits.cpp#L633 kzhuravl: Atomics are currently handled in a separate pass, which we determined to be conservatively…
				rampitecUnsubmitted Not Done Reply Inline Actions Barrier itself does not need fence. OpenCL barrier needs fence, but these are generated in the library. rampitec: Barrier itself does not need fence. OpenCL barrier needs fence, but these are generated in the…
				// terminators, and we always wait on everything at the end of the
				// block, so if we only wait on LGKM here, we might end up with
				// another s_waitcnt inserted right after this if there are non-LGKM
				// instructions still outstanding.
				ForceZero = true;
				EmitSwaitcnt = true;
				}
				}

				// Does this operand processing indicate s_wait counter update?
				if (EmitSwaitcnt) {
				int CntVal[NUM_INST_CNTS];

				bool UseDefaultWaitcntStrategy = true;
				if (ForceZero) {
				// Force all waitcnts to 0.
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				ScoreBrackets->setScoreLB(T, ScoreBrackets->getScoreUB(T));
				}
				CntVal[VM_CNT] = 0;
				CntVal[EXP_CNT] = 0;
				CntVal[LGKM_CNT] = 0;
				UseDefaultWaitcntStrategy = false;
				}

				if (UseDefaultWaitcntStrategy) {
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				if (EmitSwaitcnt & CNT_MASK(T)) {
				int Delta =
				ScoreBrackets->getScoreUB(T) - ScoreBrackets->getScoreLB(T);
				int MaxDelta = ScoreBrackets->getWaitCountMax(T);
				if (Delta >= MaxDelta) {
				Delta = -1;
				if (T != EXP_CNT) {
				ScoreBrackets->setScoreLB(
				T, ScoreBrackets->getScoreUB(T) - MaxDelta);
				}
				EmitSwaitcnt &= ~CNT_MASK(T);
				}
				CntVal[T] = Delta;
				} else {
				// If we are not waiting for a particular counter then encode
				// it as -1 which means "don't care."
				CntVal[T] = -1;
				}
				}
				}

				// If we are not waiting on any counter we can skip the wait altogether.
				if (EmitSwaitcnt != 0) {
				MachineInstr *OldWaitcnt = ScoreBrackets->getWaitcnt();
				int Imm = (!OldWaitcnt) ? 0 : OldWaitcnt->getOperand(0).getImm();
				kzhuravlUnsubmitted Done Reply Inline Actions Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it) kzhuravl: Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it)
				if (!OldWaitcnt \|\| (AMDGPU::decodeVmcnt(IV, Imm) !=
				(CntVal[VM_CNT] & AMDGPU::getVmcntBitMask(IV))) \|\|
				(AMDGPU::decodeExpcnt(IV, Imm) !=
				(CntVal[EXP_CNT] & AMDGPU::getExpcntBitMask(IV))) \|\|
				(AMDGPU::decodeLgkmcnt(IV, Imm) !=
				(CntVal[LGKM_CNT] & AMDGPU::getLgkmcntBitMask(IV)))) {
				MachineLoop *ContainingLoop = MLI->getLoopFor(MI.getParent());
				if (ContainingLoop) {
				MachineBasicBlock *TBB = ContainingLoop->getTopBlock();
				BlockWaitcntBrackets *ScoreBracket =
				BlockWaitcntBracketsMap[TBB].get();
				if (!ScoreBracket) {
				assert(BlockVisitedSet.find(TBB) == BlockVisitedSet.end());
				BlockWaitcntBracketsMap[TBB] = make_unique<BlockWaitcntBrackets>();
				ScoreBracket = BlockWaitcntBracketsMap[TBB].get();
				}
				ScoreBracket->setRevisitLoop(true);
				DEBUG(dbgs() << "set-revisit: block"
				<< ContainingLoop->getTopBlock()->getNumber() << '\n';);
				}
				}

				// Update an existing waitcount, or make a new one.
				MachineFunction &MF = *MI.getParent()->getParent();
				if (OldWaitcnt && OldWaitcnt->getOpcode() != AMDGPU::S_WAITCNT) {
				SWaitInst = OldWaitcnt;
				} else {
				SWaitInst = MF.CreateMachineInstr(TII->get(AMDGPU::S_WAITCNT),
				kzhuravlUnsubmitted Done Reply Inline Actions Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it) kzhuravl: Use helper functions from AMDGPUBaseInfo.h (they also have logic for gfx9 in it)
				MI.getDebugLoc());
				CompilerGeneratedWaitcntSet.insert(SWaitInst);
				}

				const MachineOperand &Op =
				MachineOperand::CreateImm(AMDGPU::encodeWaitcnt(
				IV, CntVal[VM_CNT], CntVal[EXP_CNT], CntVal[LGKM_CNT]));
				SWaitInst->addOperand(MF, Op);

				if (CntVal[EXP_CNT] == 0) {
				ScoreBrackets->setMixedExpTypes(false);
				}
				}
				}

				return SWaitInst;
				}

				void SIInsertWaitcnts::insertWaitcntBeforeCF(MachineBasicBlock &MBB,
				MachineInstr *Waitcnt) {
				if (MBB.empty()) {
				MBB.push_back(Waitcnt);
				return;
				}

				MachineBasicBlock::iterator It = MBB.end();
				MachineInstr MI = &(--It);
				if (MI->isBranch()) {
				MBB.insert(It, Waitcnt);
				} else {
				MBB.push_back(Waitcnt);
				}

				return;
				}

				void SIInsertWaitcnts::updateEventWaitCntAfter(
				MachineInstr &Inst, BlockWaitcntBrackets *ScoreBrackets) {
				// Now look at the instruction opcode. If it is a memory access
				// instruction, update the upper-bound of the appropriate counter's
				// bracket and the destination operand scores.
				// TODO: Use the (TSFlags & SIInstrFlags::LGKM_CNT) property everywhere.
				if (TII->isDS(Inst) && (Inst.mayLoad() \|\| Inst.mayStore())) {
				if (TII->getNamedOperand(Inst, AMDGPU::OpName::gds)->getImm() != 0) {
				ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_ACCESS, Inst);
				ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_GPR_LOCK, Inst);
				} else {
				arsenmUnsubmitted Done Reply Inline Actions This should check the gds operand, not the mem operands. This also needs MIR tests since we don't emit GDS operations now arsenm: This should check the gds operand, not the mem operands. This also needs MIR tests since we…
				ScoreBrackets->updateByEvent(TII, TRI, MRI, LDS_ACCESS, Inst);
				}
				} else if (TII->isFLAT(Inst)) {
				assert(Inst.mayLoad() \|\| Inst.mayStore());
				ScoreBrackets->updateByEvent(TII, TRI, MRI, VMEM_ACCESS, Inst);
				ScoreBrackets->updateByEvent(TII, TRI, MRI, LDS_ACCESS, Inst);

				// This is a flat memory operation. Check to see if it has memory
				// tokens for both LDS and Memory, and if so mark it as a flat.
				bool FoundLDSMem = false;
				for (const MachineMemOperand *Memop : Inst.memoperands()) {
				unsigned AS = Memop->getAddrSpace();
				if (AS == AMDGPUASI.LOCAL_ADDRESS \|\| AS == AMDGPUASI.FLAT_ADDRESS)
				FoundLDSMem = true;
				}

				// This is a flat memory operation, so note it - it will require
				// that both the VM and LGKM be flushed to zero if it is pending when
				// a VM or LGKM dependency occurs.
				if (FoundLDSMem) {
				ScoreBrackets->setPendingFlat();
				}
				} else if (SIInstrInfo::isVMEM(Inst) &&
				// TODO: get a better carve out.
				Inst.getOpcode() != AMDGPU::BUFFER_WBINVL1 &&
				Inst.getOpcode() != AMDGPU::BUFFER_WBINVL1_SC &&
				Inst.getOpcode() != AMDGPU::BUFFER_WBINVL1_VOL) {
				ScoreBrackets->updateByEvent(TII, TRI, MRI, VMEM_ACCESS, Inst);
				if ( // TODO: assumed yes -- target_info->MemWriteNeedsExpWait() &&
				(Inst.mayStore() \|\| AMDGPU::getAtomicNoRetOp(Inst.getOpcode()))) {
				ScoreBrackets->updateByEvent(TII, TRI, MRI, VMW_GPR_LOCK, Inst);
				}
				} else if (TII->isSMRD(Inst)) {
				ScoreBrackets->updateByEvent(TII, TRI, MRI, SMEM_ACCESS, Inst);
				} else {
				switch (Inst.getOpcode()) {
				case AMDGPU::S_SENDMSG:
				case AMDGPU::S_SENDMSGHALT:
				ScoreBrackets->updateByEvent(TII, TRI, MRI, SQ_MESSAGE, Inst);
				break;
				case AMDGPU::EXP:
				case AMDGPU::EXP_DONE: {
				int Imm = TII->getNamedOperand(Inst, AMDGPU::OpName::tgt)->getImm();
				if (Imm >= 32 && Imm <= 63)
				ScoreBrackets->updateByEvent(TII, TRI, MRI, EXP_PARAM_ACCESS, Inst);
				else if (Imm >= 12 && Imm <= 15)
				ScoreBrackets->updateByEvent(TII, TRI, MRI, EXP_POS_ACCESS, Inst);
				arsenmUnsubmitted Done Reply Inline Actions getNamedOperand arsenm: getNamedOperand
				else
				ScoreBrackets->updateByEvent(TII, TRI, MRI, EXP_GPR_LOCK, Inst);
				break;
				}
				case AMDGPU::S_MEMTIME:
				case AMDGPU::S_MEMREALTIME:
				ScoreBrackets->updateByEvent(TII, TRI, MRI, SMEM_ACCESS, Inst);
				break;
				default:
				break;
				}
				}
				}

				void SIInsertWaitcnts::mergeInputScoreBrackets(MachineBasicBlock &Block) {
				BlockWaitcntBrackets *ScoreBrackets = BlockWaitcntBracketsMap[&Block].get();
				int32_t MaxPending[NUM_INST_CNTS] = {0};
				int32_t MaxFlat[NUM_INST_CNTS] = {0};
				bool MixedExpTypes = false;

				// Clear the score bracket state.
				ScoreBrackets->clear();

				// Compute the number of pending elements on block entry.

				// IMPORTANT NOTE: If iterative handling of loops is added, the code will
				// need to handle single BBs with backedges to themselves. This means that
				// they will need to retain and not clear their initial state.

				// See if there are any uninitialized predecessors. If so, emit an
				// s_waitcnt 0 at the beginning of the block.
				for (MachineBasicBlock *pred : Block.predecessors()) {
				BlockWaitcntBrackets *PredScoreBrackets =
				BlockWaitcntBracketsMap[pred].get();
				bool Visited = BlockVisitedSet.find(pred) != BlockVisitedSet.end();
				if (!Visited \|\| PredScoreBrackets->getWaitAtBeginning()) {
				break;
				}
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				int span =
				PredScoreBrackets->getScoreUB(T) - PredScoreBrackets->getScoreLB(T);
				MaxPending[T] = std::max(MaxPending[T], span);
				span =
				PredScoreBrackets->pendingFlat(T) - PredScoreBrackets->getScoreLB(T);
				MaxFlat[T] = std::max(MaxFlat[T], span);
				}

				MixedExpTypes \|= PredScoreBrackets->mixedExpTypes();
				}

				// TODO: Is SC Block->IsMainExit() same as Block.succ_empty()?
				// Also handle kills for exit block.
				if (Block.succ_empty() && !KillWaitBrackets.empty()) {
				for (unsigned int I = 0; I < KillWaitBrackets.size(); I++) {
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				int Span = KillWaitBrackets[I]->getScoreUB(T) -
				KillWaitBrackets[I]->getScoreLB(T);
				MaxPending[T] = std::max(MaxPending[T], Span);
				Span = KillWaitBrackets[I]->pendingFlat(T) -
				KillWaitBrackets[I]->getScoreLB(T);
				MaxFlat[T] = std::max(MaxFlat[T], Span);
				}

				MixedExpTypes \|= KillWaitBrackets[I]->mixedExpTypes();
				}
				}

				// Special handling for GDS_GPR_LOCK and EXP_GPR_LOCK.
				for (MachineBasicBlock *Pred : Block.predecessors()) {
				BlockWaitcntBrackets *PredScoreBrackets =
				BlockWaitcntBracketsMap[Pred].get();
				bool Visited = BlockVisitedSet.find(Pred) != BlockVisitedSet.end();
				if (!Visited \|\| PredScoreBrackets->getWaitAtBeginning()) {
				break;
				}

				int GDSSpan = PredScoreBrackets->getEventUB(GDS_GPR_LOCK) -
				PredScoreBrackets->getScoreLB(EXP_CNT);
				MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], GDSSpan);
				int EXPSpan = PredScoreBrackets->getEventUB(EXP_GPR_LOCK) -
				PredScoreBrackets->getScoreLB(EXP_CNT);
				MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], EXPSpan);
				}

				// TODO: Is SC Block->IsMainExit() same as Block.succ_empty()?
				if (Block.succ_empty() && !KillWaitBrackets.empty()) {
				for (unsigned int I = 0; I < KillWaitBrackets.size(); I++) {
				int GDSSpan = KillWaitBrackets[I]->getEventUB(GDS_GPR_LOCK) -
				KillWaitBrackets[I]->getScoreLB(EXP_CNT);
				MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], GDSSpan);
				int EXPSpan = KillWaitBrackets[I]->getEventUB(EXP_GPR_LOCK) -
				KillWaitBrackets[I]->getScoreLB(EXP_CNT);
				MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], EXPSpan);
				}
				}

				#if 0
				// LC does not (unlike) add a waitcnt at beginning. Leaving it as marker.
				// TODO: how does LC distinguish between function entry and main entry?
				// If this is the entry to a function, force a wait.
				MachineBasicBlock &Entry = Block.getParent()->front();
				if (Entry.getNumber() == Block.getNumber()) {
				ScoreBrackets->setWaitAtBeginning();
				return;
				}
				#endif

				// Now set the current Block's brackets to the largest ending bracket.
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				ScoreBrackets->setScoreUB(T, MaxPending[T]);
				ScoreBrackets->setScoreLB(T, 0);
				ScoreBrackets->setLastFlat(T, MaxFlat[T]);
				}

				ScoreBrackets->setMixedExpTypes(MixedExpTypes);

				// Set the register scoreboard.
				for (MachineBasicBlock *Pred : Block.predecessors()) {
				if (BlockVisitedSet.find(Pred) == BlockVisitedSet.end()) {
				break;
				}

				BlockWaitcntBrackets *PredScoreBrackets =
				BlockWaitcntBracketsMap[Pred].get();

				// Now merge the gpr_reg_score information
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				int PredLB = PredScoreBrackets->getScoreLB(T);
				int PredUB = PredScoreBrackets->getScoreUB(T);
				if (PredLB < PredUB) {
				int PredScale = MaxPending[T] - PredUB;
				// Merge vgpr scores.
				for (int J = 0; J <= PredScoreBrackets->getMaxVGPR(); J++) {
				int PredRegScore = PredScoreBrackets->getRegScore(J, T);
				if (PredRegScore <= PredLB)
				continue;
				int NewRegScore = PredScale + PredRegScore;
				ScoreBrackets->setRegScore(
				J, T, std::max(ScoreBrackets->getRegScore(J, T), NewRegScore));
				}
				// Also need to merge sgpr scores for lgkm_cnt.
				if (T == LGKM_CNT) {
				for (int J = 0; J <= PredScoreBrackets->getMaxSGPR(); J++) {
				int PredRegScore =
				PredScoreBrackets->getRegScore(J + NUM_ALL_VGPRS, LGKM_CNT);
				if (PredRegScore <= PredLB)
				continue;
				int NewRegScore = PredScale + PredRegScore;
				ScoreBrackets->setRegScore(
				J + NUM_ALL_VGPRS, LGKM_CNT,
				std::max(
				ScoreBrackets->getRegScore(J + NUM_ALL_VGPRS, LGKM_CNT),
				NewRegScore));
				}
				}
				}
				}

				// Also merge the WaitEvent information.
				ForAllWaitEventType(W) {
				enum InstCounterType T = PredScoreBrackets->eventCounter(W);
				int PredEventUB = PredScoreBrackets->getEventUB(W);
				if (PredEventUB > PredScoreBrackets->getScoreLB(T)) {
				int NewEventUB =
				MaxPending[T] + PredEventUB - PredScoreBrackets->getScoreUB(T);
				if (NewEventUB > 0) {
				ScoreBrackets->setEventUB(
				W, std::max(ScoreBrackets->getEventUB(W), NewEventUB));
				}
				}
				}
				}

				// TODO: Is SC Block->IsMainExit() same as Block.succ_empty()?
				// Set the register scoreboard.
				if (Block.succ_empty() && !KillWaitBrackets.empty()) {
				for (unsigned int I = 0; I < KillWaitBrackets.size(); I++) {
				// Now merge the gpr_reg_score information.
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				int PredLB = KillWaitBrackets[I]->getScoreLB(T);
				int PredUB = KillWaitBrackets[I]->getScoreUB(T);
				if (PredLB < PredUB) {
				int PredScale = MaxPending[T] - PredUB;
				// Merge vgpr scores.
				for (int J = 0; J <= KillWaitBrackets[I]->getMaxVGPR(); J++) {
				int PredRegScore = KillWaitBrackets[I]->getRegScore(J, T);
				if (PredRegScore <= PredLB)
				continue;
				int NewRegScore = PredScale + PredRegScore;
				ScoreBrackets->setRegScore(
				J, T, std::max(ScoreBrackets->getRegScore(J, T), NewRegScore));
				}
				// Also need to merge sgpr scores for lgkm_cnt.
				if (T == LGKM_CNT) {
				for (int J = 0; J <= KillWaitBrackets[I]->getMaxSGPR(); J++) {
				int PredRegScore =
				KillWaitBrackets[I]->getRegScore(J + NUM_ALL_VGPRS, LGKM_CNT);
				if (PredRegScore <= PredLB)
				continue;
				int NewRegScore = PredScale + PredRegScore;
				ScoreBrackets->setRegScore(
				J + NUM_ALL_VGPRS, LGKM_CNT,
				std::max(
				ScoreBrackets->getRegScore(J + NUM_ALL_VGPRS, LGKM_CNT),
				NewRegScore));
				}
				}
				}
				}

				// Also merge the WaitEvent information.
				ForAllWaitEventType(W) {
				enum InstCounterType T = KillWaitBrackets[I]->eventCounter(W);
				int PredEventUB = KillWaitBrackets[I]->getEventUB(W);
				if (PredEventUB > KillWaitBrackets[I]->getScoreLB(T)) {
				int NewEventUB =
				MaxPending[T] + PredEventUB - KillWaitBrackets[I]->getScoreUB(T);
				if (NewEventUB > 0) {
				ScoreBrackets->setEventUB(
				W, std::max(ScoreBrackets->getEventUB(W), NewEventUB));
				}
				}
				}
				}
				}

				// Special case handling of GDS_GPR_LOCK and EXP_GPR_LOCK. Merge this for the
				// sequencing predecessors, because changes to EXEC require waitcnts due to
				// the delayed nature of these operations.
				for (MachineBasicBlock *Pred : Block.predecessors()) {
				if (BlockVisitedSet.find(Pred) == BlockVisitedSet.end()) {
				break;
				}

				BlockWaitcntBrackets *PredScoreBrackets =
				BlockWaitcntBracketsMap[Pred].get();

				int pred_gds_ub = PredScoreBrackets->getEventUB(GDS_GPR_LOCK);
				if (pred_gds_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {
				int new_gds_ub = MaxPending[EXP_CNT] + pred_gds_ub -
				PredScoreBrackets->getScoreUB(EXP_CNT);
				if (new_gds_ub > 0) {
				ScoreBrackets->setEventUB(
				GDS_GPR_LOCK,
				std::max(ScoreBrackets->getEventUB(GDS_GPR_LOCK), new_gds_ub));
				}
				}
				int pred_exp_ub = PredScoreBrackets->getEventUB(EXP_GPR_LOCK);
				if (pred_exp_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {
				int new_exp_ub = MaxPending[EXP_CNT] + pred_exp_ub -
				PredScoreBrackets->getScoreUB(EXP_CNT);
				if (new_exp_ub > 0) {
				ScoreBrackets->setEventUB(
				EXP_GPR_LOCK,
				std::max(ScoreBrackets->getEventUB(EXP_GPR_LOCK), new_exp_ub));
				}
				}
				}
				}

				/// Return the "bottom" block of a loop. This differs from
				/// MachineLoop::getBottomBlock in that it works even if the loop is
				/// discontiguous.
				MachineBasicBlock SIInsertWaitcnts::loopBottom(const MachineLoop Loop) {
				MachineBasicBlock *Bottom = Loop->getHeader();
				for (MachineBasicBlock *MBB : Loop->blocks())
				if (MBB->getNumber() > Bottom->getNumber())
				Bottom = MBB;
				return Bottom;
				}

				// Generate s_waitcnt instructions where needed.
				void SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
				MachineBasicBlock &Block) {
				// Initialize the state information.
				mergeInputScoreBrackets(Block);

				BlockWaitcntBrackets *ScoreBrackets = BlockWaitcntBracketsMap[&Block].get();

				DEBUG({
				dbgs() << "Block" << Block.getNumber();
				ScoreBrackets->dump();
				});

				bool InsertNOP = false;

				// Walk over the instructions.
				for (MachineBasicBlock::iterator Iter = Block.begin(), E = Block.end();
				Iter != E;) {
				MachineInstr &Inst = *Iter;
				// Remove any previously existing waitcnts.
				if (Inst.getOpcode() == AMDGPU::S_WAITCNT) {
				// TODO: Register the old waitcnt and optimize the following waitcnts.
				// Leaving the previously existing waitcnts is conservatively correct.
				if (CompilerGeneratedWaitcntSet.find(&Inst) ==
				CompilerGeneratedWaitcntSet.end())
				++Iter;
				else {
				ScoreBrackets->setWaitcnt(&Inst);
				++Iter;
				Inst.removeFromParent();
				}
				continue;
				}

				// Check to see if this is an S_BARRIER, and if an implicit S_WAITCNT 0
				// occurs before the instruction. Doing it here prevents any additional
				// S_WAITCNTs from being emitted if the instruction was marked as
				// requiring a WAITCNT beforehand.
				if (Inst.getOpcode() == AMDGPU::S_BARRIER &&
				ST->needWaitcntBeforeBarrier()) {
				ScoreBrackets->updateByWait(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));
				ScoreBrackets->updateByWait(EXP_CNT, ScoreBrackets->getScoreUB(EXP_CNT));
				ScoreBrackets->updateByWait(LGKM_CNT,
				ScoreBrackets->getScoreUB(LGKM_CNT));
				}

				// Kill instructions generate a conditional branch to the endmain block.
				// Merge the current waitcnt state into the endmain block information.
				// TODO: Are there other flavors of KILL instruction?
				if (Inst.getOpcode() == AMDGPU::KILL) {
				addKillWaitBracket(ScoreBrackets);
				}

				bool VCCZBugWorkAround = false;
				if (readsVCCZ(Inst)) {
				if (ScoreBrackets->getScoreLB(LGKM_CNT) <
				ScoreBrackets->getScoreUB(LGKM_CNT) &&
				ScoreBrackets->hasPendingSMEM()) {
				#if 0
				// TODO: Enable this assert and fix the scheduler.
				// Shader Compiler assert is also independent of target.
				// If you hit this, it most likely means that a S_LOAD_DWORDX was issued
				// between a def of vcc and the consumer of vccz/vccnz. This is not
				// expected to happen in practice because the 'wave_cf' phase (which is
				// where all of the uses of vccz get generated) runs after
				// the 'pre_ra_scheduler' phase.
				assert(0 && !"SMRD instruction could retire during the live range of VCCZ, "
				"therefore Hardware Bug 457939 could be triggered" );
				#endif
				if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS)
				VCCZBugWorkAround = true;
				}
				}

				// Generate an s_waitcnt instruction to be placed before
				// cur_Inst, if needed.
				MachineInstr *SWaitInst = generateSWaitCntInstBefore(Inst, ScoreBrackets);

				if (SWaitInst) {
				Block.insert(Inst, SWaitInst);
				if (ScoreBrackets->getWaitcnt() != SWaitInst) {
				DEBUG(dbgs() << "insertWaitcntInBlock\n"
				<< "Old Instr: " << Inst << '\n'
				<< "New Instr: " << *SWaitInst << '\n';);
				}
				}

				updateEventWaitCntAfter(Inst, ScoreBrackets);

				#if 0 // TODO: implement resource type check controlled by options with ub = LB.
				// If this instruction generates a S_SETVSKIP because it is an
				// indexed resource, and we are on Tahiti, then it will also force
				// an S_WAITCNT vmcnt(0)
				if (RequireCheckResourceType(Inst, context)) {
				// Force the score to as if an S_WAITCNT vmcnt(0) is emitted.
				ScoreBrackets->setScoreLB(VM_CNT,
				ScoreBrackets->getScoreUB(VM_CNT));
				}
				#endif

				ScoreBrackets->clearWaitcnt();

				if (SWaitInst) {
				DEBUG({ SWaitInst->print(dbgs() << '\n'); });
				}
				DEBUG({
				Inst.print(dbgs());
				ScoreBrackets->dump();
				});

				// Check to see if this is a GWS instruction. If so, and if this is CI or
				// VI, then the generated code sequence will include an S_WAITCNT 0.
				// TODO: Are these the only GWS instructions?
				if (Inst.getOpcode() == AMDGPU::DS_GWS_INIT \|\|
				Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_V \|\|
				Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_BR \|\|
				Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_P \|\|
				Inst.getOpcode() == AMDGPU::DS_GWS_BARRIER) {
				// TODO: && context->target_info->GwsRequiresMemViolTest() ) {
				ScoreBrackets->updateByWait(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));
				ScoreBrackets->updateByWait(EXP_CNT, ScoreBrackets->getScoreUB(EXP_CNT));
				ScoreBrackets->updateByWait(LGKM_CNT,
				ScoreBrackets->getScoreUB(LGKM_CNT));
				}

				// TODO: Remove this work-around after fixing the scheduler and enable the
				// assert above.
				if (VCCZBugWorkAround) {
				// Restore the vccz bit. Any time a value is written to vcc, the vcc
				// bit is updated, so we can restore the bit by reading the value of
				// vcc and then writing it back to the register.
				BuildMI(Block, Inst, Inst.getDebugLoc(), TII->get(AMDGPU::S_MOV_B64),
				AMDGPU::VCC)
				.addReg(AMDGPU::VCC);
				}

				if (ST->getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) {

				// This avoids a s_nop after a waitcnt has just been inserted.
				if (!SWaitInst && InsertNOP) {
				arsenmUnsubmitted Done Reply Inline Actions Braces around the block and addImm on next line arsenm: Braces around the block and addImm on next line
				BuildMI(Block, Inst, DebugLoc(), TII->get(AMDGPU::S_NOP))
				.addImm(0);
				}
				InsertNOP = false;
				kzhuravlUnsubmitted Done Reply Inline Actions @rampitec wrote: Need to check for XNACK support. kzhuravl: @rampitec wrote: Need to check for XNACK support.
				rampitecUnsubmitted Done Reply Inline Actions Need to check for XNACK support. rampitec: Need to check for XNACK support.

				// Any occurrence of consecutive VMEM or SMEM instructions forms a VMEM
				// or SMEM clause, respectively.
				//
				// The temporary workaround is to break the clauses with S_NOP.
				//
				// The proper solution would be to allocate registers such that all source
				// and destination registers don't overlap, e.g. this is illegal:
				// r0 = load r2
				// r2 = load r0
				bool IsSMEM = false;
				bool IsVMEM = false;
				if (TII->isSMRD(Inst))
				IsSMEM = true;
				else if (TII->usesVM_CNT(Inst))
				IsVMEM = true;

				++Iter;
				if (Iter == E)
				break;

				MachineInstr &Next = *Iter;

				// TODO: How about consecutive SMEM instructions?
				// The comments above says break the clause but the code does not.
				// if ((TII->isSMRD(next) && isSMEM) \|\|
				if (!IsSMEM && TII->usesVM_CNT(Next) && IsVMEM &&
				// TODO: Enable this check when hasSoftClause is upstreamed.
				// ST->hasSoftClauses() &&
				ST->isXNACKEnabled()) {
				// Insert a NOP to break the clause.
				InsertNOP = true;
				continue;
				}

				// There must be "S_NOP 0" between an instruction writing M0 and
				// S_SENDMSG.
				if ((Next.getOpcode() == AMDGPU::S_SENDMSG \|\|
				Next.getOpcode() == AMDGPU::S_SENDMSGHALT) &&
				Inst.definesRegister(AMDGPU::M0))
				InsertNOP = true;

				continue;
				}

				++Iter;
				}

				// Check if we need to force convergence at loop footer.
				MachineLoop *ContainingLoop = MLI->getLoopFor(&Block);
				if (ContainingLoop && loopBottom(ContainingLoop) == &Block) {
				LoopWaitcntData *WaitcntData = LoopWaitcntDataMap[ContainingLoop].get();
				WaitcntData->print();
				DEBUG(dbgs() << '\n';);

				// The iterative waitcnt insertion algorithm aims for optimal waitcnt
				// placement and doesn't always guarantee convergence for a loop. Each
				// loop should take at most 2 iterations for it to converge naturally.
				// When this max is reached and result doesn't converge, we force
				// convergence by inserting a s_waitcnt at the end of loop footer.
				if (WaitcntData->getIterCnt() > 2) {
				// To ensure convergence, need to make wait events at loop footer be no
				// more than those from the previous iteration.
				// As a simplification, Instead of tracking individual scores and
				// generate the precise wait count, just wait on 0.
				bool HasPending = false;
				MachineInstr *SWaitInst = WaitcntData->getWaitcnt();
				for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
				T = (enum InstCounterType)(T + 1)) {
				if (ScoreBrackets->getScoreUB(T) > ScoreBrackets->getScoreLB(T)) {
				ScoreBrackets->setScoreLB(T, ScoreBrackets->getScoreUB(T));
				HasPending = true;
				}
				}
				arsenmUnsubmitted Not Done Reply Inline Actions Use BuildMI rather than the low level instruction creation APIs arsenm: Use BuildMI rather than the low level instruction creation APIs

				if (HasPending) {
				if (!SWaitInst) {
				SWaitInst = Block.getParent()->CreateMachineInstr(
				TII->get(AMDGPU::S_WAITCNT), DebugLoc());
				CompilerGeneratedWaitcntSet.insert(SWaitInst);
				const MachineOperand &Op = MachineOperand::CreateImm(0);
				SWaitInst->addOperand(MF, Op);
				#if 0 // TODO: Format the debug output
				OutputTransformBanner("insertWaitcntInBlock",0,"Create:",context);
				OutputTransformAdd(SWaitInst, context);
				#endif
				}
				#if 0 // TODO: ??
				_DEV( REPORTED_STATS->force_waitcnt_converge = 1; )
				#endif
				}

				if (SWaitInst) {
				DEBUG({
				SWaitInst->print(dbgs());
				dbgs() << "\nAdjusted score board:";
				ScoreBrackets->dump();
				});

				// Add this waitcnt to the block. It is either newly created or
				// created in previous iterations and added back since block traversal
				// always remove waitcnt.
				insertWaitcntBeforeCF(Block, SWaitInst);
				WaitcntData->setWaitcnt(SWaitInst);
				}
				}
				}
				}

				bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
				ST = &MF.getSubtarget<SISubtarget>();
				TII = ST->getInstrInfo();
				TRI = &TII->getRegisterInfo();
				MRI = &MF.getRegInfo();
				MLI = &getAnalysis<MachineLoopInfo>();
				IV = AMDGPU::IsaInfo::getIsaVersion(ST->getFeatureBits());
				AMDGPUASI = ST->getAMDGPUAS();

				HardwareLimits.VmcntMax = AMDGPU::getVmcntBitMask(IV);
				HardwareLimits.ExpcntMax = AMDGPU::getExpcntBitMask(IV);
				HardwareLimits.LgkmcntMax = AMDGPU::getLgkmcntBitMask(IV);

				HardwareLimits.NumVGPRsMax = ST->getAddressableNumVGPRs();
				HardwareLimits.NumSGPRsMax = ST->getAddressableNumSGPRs();
				assert(HardwareLimits.NumVGPRsMax <= SQ_MAX_PGM_VGPRS);
				assert(HardwareLimits.NumSGPRsMax <= SQ_MAX_PGM_SGPRS);

				RegisterEncoding.VGPR0 = TRI->getEncodingValue(AMDGPU::VGPR0);
				RegisterEncoding.VGPRL = RegisterEncoding.VGPR0 + HardwareLimits.NumVGPRsMax - 1;
				RegisterEncoding.SGPR0 = TRI->getEncodingValue(AMDGPU::SGPR0);
				RegisterEncoding.SGPRL = RegisterEncoding.SGPR0 + HardwareLimits.NumSGPRsMax - 1;

				// Walk over the blocks in reverse post-dominator order, inserting
				// s_waitcnt where needed.
				ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
				bool Modified = false;
				for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator
				I = RPOT.begin(),
				E = RPOT.end(), J = RPOT.begin();
				I != E;) {
				MachineBasicBlock &MBB = **I;

				BlockVisitedSet.insert(&MBB);

				BlockWaitcntBrackets *ScoreBrackets = BlockWaitcntBracketsMap[&MBB].get();
				if (!ScoreBrackets) {
				BlockWaitcntBracketsMap[&MBB] = make_unique<BlockWaitcntBrackets>();
				ScoreBrackets = BlockWaitcntBracketsMap[&MBB].get();
				}
				ScoreBrackets->setPostOrder(MBB.getNumber());
				MachineLoop *ContainingLoop = MLI->getLoopFor(&MBB);
				if (ContainingLoop && LoopWaitcntDataMap[ContainingLoop] == nullptr)
				LoopWaitcntDataMap[ContainingLoop] = make_unique<LoopWaitcntData>();

				// If we are walking into the block from before the loop, then guarantee
				// at least 1 re-walk over the loop to propagate the information, even if
				// no S_WAITCNT instructions were generated.
				if (ContainingLoop && ContainingLoop->getTopBlock() == &MBB && J < I &&
				(BlockWaitcntProcessedSet.find(&MBB) ==
				BlockWaitcntProcessedSet.end())) {
				BlockWaitcntBracketsMap[&MBB]->setRevisitLoop(true);
				DEBUG(dbgs() << "set-revisit: block"
				<< ContainingLoop->getTopBlock()->getNumber() << '\n';);
				}

				// Walk over the instructions.
				insertWaitcntInBlock(MF, MBB);

				// Flag that waitcnts have been processed at least once.
				BlockWaitcntProcessedSet.insert(&MBB);

				// See if we want to revisit the loop.
				if (ContainingLoop && loopBottom(ContainingLoop) == &MBB) {
				MachineBasicBlock *EntryBB = ContainingLoop->getTopBlock();
				BlockWaitcntBrackets *EntrySB = BlockWaitcntBracketsMap[EntryBB].get();
				if (EntrySB && EntrySB->getRevisitLoop()) {
				EntrySB->setRevisitLoop(false);
				J = I;
				int32_t PostOrder = EntrySB->getPostOrder();
				// TODO: Avoid this loop. Find another way to set I.
				for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator
				X = RPOT.begin(),
				Y = RPOT.end();
				X != Y; ++X) {
				MachineBasicBlock &MBBX = **X;
				if (MBBX.getNumber() == PostOrder) {
				I = X;
				break;
				}
				}
				LoopWaitcntData *WaitcntData = LoopWaitcntDataMap[ContainingLoop].get();
				WaitcntData->incIterCnt();
				DEBUG(dbgs() << "revisit: block" << EntryBB->getNumber() << '\n';);
				continue;
				} else {
				LoopWaitcntData *WaitcntData = LoopWaitcntDataMap[ContainingLoop].get();
				// Loop converged, reset iteration count. If this loop gets revisited,
				// it must be from an outer loop, the counter will restart, this will
				// ensure we don't force convergence on such revisits.
				WaitcntData->resetIterCnt();
				}
				}

				J = I;
				++I;
				}

				SmallVector<MachineBasicBlock *, 4> EndPgmBlocks;

				bool HaveScalarStores = false;

				for (MachineFunction::iterator BI = MF.begin(), BE = MF.end(); BI != BE;
				++BI) {

				MachineBasicBlock &MBB = *BI;

				for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;
				++I) {

				if (!HaveScalarStores && TII->isScalarStore(*I))
				HaveScalarStores = true;

				if (I->getOpcode() == AMDGPU::S_ENDPGM \|\|
				I->getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG)
				EndPgmBlocks.push_back(&MBB);
				}
				}

				if (HaveScalarStores) {
				// If scalar writes are used, the cache must be flushed or else the next
				// wave to reuse the same scratch memory can be clobbered.
				//
				// Insert s_dcache_wb at wave termination points if there were any scalar
				// stores, and only if the cache hasn't already been flushed. This could be
				// improved by looking across blocks for flushes in postdominating blocks
				// from the stores but an explicitly requested flush is probably very rare.
				for (MachineBasicBlock *MBB : EndPgmBlocks) {
				bool SeenDCacheWB = false;

				for (MachineBasicBlock::iterator I = MBB->begin(), E = MBB->end(); I != E;
				++I) {

				if (I->getOpcode() == AMDGPU::S_DCACHE_WB)
				SeenDCacheWB = true;
				else if (TII->isScalarStore(*I))
				SeenDCacheWB = false;

				// FIXME: It would be better to insert this before a waitcnt if any.
				if ((I->getOpcode() == AMDGPU::S_ENDPGM \|\|
				I->getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG) &&
				!SeenDCacheWB) {
				Modified = true;
				BuildMI(*MBB, I, I->getDebugLoc(), TII->get(AMDGPU::S_DCACHE_WB));
				}
				}
				}
				}

				return Modified;
				}

test/CodeGen/AMDGPU/basic-branch.ll

	Show All 28 Lines

	; GCN-LABEL: {{^}}test_brcc_i1:			; GCN-LABEL: {{^}}test_brcc_i1:
	; GCN: buffer_load_ubyte			; GCN: buffer_load_ubyte
	; GCN: v_and_b32_e32 v{{[0-9]+}}, 1,			; GCN: v_and_b32_e32 v{{[0-9]+}}, 1,
	; GCN: v_cmp_eq_u32_e32 vcc,			; GCN: v_cmp_eq_u32_e32 vcc,
	; GCN: s_cbranch_vccnz [[END:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_vccnz [[END:BB[0-9]+_[0-9]+]]

	; GCN: buffer_store_dword			; GCN: buffer_store_dword
	; GCNOPT-NEXT: s_waitcnt vmcnt(0) expcnt(0)
	; TODO: This waitcnt can be eliminated

	; GCN: {{^}}[[END]]:			; GCN: {{^}}[[END]]:
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @test_brcc_i1(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in, i1 %val) #0 {			define amdgpu_kernel void @test_brcc_i1(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in, i1 %val) #0 {
	%cmp0 = icmp ne i1 %val, 0			%cmp0 = icmp ne i1 %val, 0
	br i1 %cmp0, label %store, label %end			br i1 %cmp0, label %store, label %end

	store:			store:
	store i32 222, i32 addrspace(1)* %out			store i32 222, i32 addrspace(1)* %out
	ret void			ret void

	end:			end:
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

test/CodeGen/AMDGPU/branch-condition-and.ll

	Show All 13 Lines
	; GCN-DAG: v_cmp_lt_f32_e32 vcc,			; GCN-DAG: v_cmp_lt_f32_e32 vcc,
	; GCN: s_and_b64 [[AND:s\[[0-9]+:[0-9]+\]]], vcc, [[OTHERCC]]			; GCN: s_and_b64 [[AND:s\[[0-9]+:[0-9]+\]]], vcc, [[OTHERCC]]
	; GCN: s_and_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]], [[AND]]			; GCN: s_and_saveexec_b64 [[SAVED:s\[[0-9]+:[0-9]+\]]], [[AND]]
	; GCN: s_xor_b64 {{s\[[0-9]+:[0-9]+\]}}, exec, [[SAVED]]			; GCN: s_xor_b64 {{s\[[0-9]+:[0-9]+\]}}, exec, [[SAVED]]
	; GCN: ; mask branch [[BB5:BB[0-9]+_[0-9]+]]			; GCN: ; mask branch [[BB5:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: BB{{[0-9]+_[0-9]+}}: ; %bb4			; GCN-NEXT: BB{{[0-9]+_[0-9]+}}: ; %bb4
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: s_waitcnt

	; GCN-NEXT: [[BB5]]			; GCN: [[BB5]]
	; GCN: s_or_b64 exec, exec			; GCN: s_or_b64 exec, exec
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	; GCN-NEXT: .Lfunc_end			; GCN-NEXT: .Lfunc_end
	define amdgpu_ps void @ham(float %arg, float %arg1) #0 {			define amdgpu_ps void @ham(float %arg, float %arg1) #0 {
	bb:			bb:
	%tmp = fcmp ogt float %arg, 0.000000e+00			%tmp = fcmp ogt float %arg, 0.000000e+00
	%tmp2 = fcmp ogt float %arg1, 0.000000e+00			%tmp2 = fcmp ogt float %arg1, 0.000000e+00
	%tmp3 = and i1 %tmp, %tmp2			%tmp3 = and i1 %tmp, %tmp2
	Show All 12 Lines

test/CodeGen/AMDGPU/branch-relaxation.ll

	Show First 20 Lines • Show All 217 Lines • ▼ Show 20 Lines
	; GCN-NEXT: s_getpc_b64 vcc			; GCN-NEXT: s_getpc_b64 vcc
	; GCN-NEXT: s_add_u32 vcc_lo, vcc_lo, [[BB3:BB[0-9]_[0-9]+]]-([[LONG_JUMP0]]+4)			; GCN-NEXT: s_add_u32 vcc_lo, vcc_lo, [[BB3:BB[0-9]_[0-9]+]]-([[LONG_JUMP0]]+4)
	; GCN-NEXT: s_addc_u32 vcc_hi, vcc_hi, 0{{$}}			; GCN-NEXT: s_addc_u32 vcc_hi, vcc_hi, 0{{$}}
	; GCN-NEXT: s_setpc_b64 vcc			; GCN-NEXT: s_setpc_b64 vcc

	; GCN-NEXT: [[BB2]]: ; %bb2			; GCN-NEXT: [[BB2]]: ; %bb2
	; GCN: v_mov_b32_e32 [[BB2_K:v[0-9]+]], 17			; GCN: v_mov_b32_e32 [[BB2_K:v[0-9]+]], 17
	; GCN: buffer_store_dword [[BB2_K]]			; GCN: buffer_store_dword [[BB2_K]]
	; GCN: s_waitcnt vmcnt(0)

	; GCN-NEXT: [[LONG_JUMP1:BB[0-9]+_[0-9]+]]: ; %bb2			; GCN-NEXT: [[LONG_JUMP1:BB[0-9]+_[0-9]+]]: ; %bb2
	; GCN-NEXT: s_getpc_b64 vcc			; GCN-NEXT: s_getpc_b64 vcc
	; GCN-NEXT: s_add_u32 vcc_lo, vcc_lo, [[BB4:BB[0-9]_[0-9]+]]-([[LONG_JUMP1]]+4)			; GCN-NEXT: s_add_u32 vcc_lo, vcc_lo, [[BB4:BB[0-9]_[0-9]+]]-([[LONG_JUMP1]]+4)
	; GCN-NEXT: s_addc_u32 vcc_hi, vcc_hi, 0{{$}}			; GCN-NEXT: s_addc_u32 vcc_hi, vcc_hi, 0{{$}}
	; GCN-NEXT: s_setpc_b64 vcc			; GCN-NEXT: s_setpc_b64 vcc

	; GCN: [[BB3]]: ; %bb3			; GCN: [[BB3]]: ; %bb3
	▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines

	; GCN-NEXT: [[IF]]: ; %if			; GCN-NEXT: [[IF]]: ; %if
	; GCN: buffer_store_dword			; GCN: buffer_store_dword
	; GCN: s_cmp_lg_u32			; GCN: s_cmp_lg_u32
	; GCN: s_cbranch_scc1 [[ENDIF]]			; GCN: s_cbranch_scc1 [[ENDIF]]

	; GCN-NEXT: ; BB#2: ; %if_uniform			; GCN-NEXT: ; BB#2: ; %if_uniform
	; GCN: buffer_store_dword			; GCN: buffer_store_dword
	; GCN: s_waitcnt vmcnt(0)

	; GCN-NEXT: [[ENDIF]]: ; %endif			; GCN-NEXT: [[ENDIF]]: ; %endif
	; GCN-NEXT: s_or_b64 exec, exec, [[MASK]]			; GCN-NEXT: s_or_b64 exec, exec, [[MASK]]
	; GCN-NEXT: s_sleep 5			; GCN-NEXT: s_sleep 5
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	define amdgpu_kernel void @uniform_inside_divergent(i32 addrspace(1)* %out, i32 %cond) #0 {			define amdgpu_kernel void @uniform_inside_divergent(i32 addrspace(1)* %out, i32 %cond) #0 {
	entry:			entry:
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/control-flow-fastregalloc.ll

	Show All 31 Lines
	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:8 ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:8 ; 4-byte Folded Spill

	; Spill load			; Spill load
	; GCN: buffer_store_dword [[LOAD0]], off, s[0:3], s7 offset:[[LOAD0_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[LOAD0]], off, s[0:3], s7 offset:[[LOAD0_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}			; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}

	; GCN: s_waitcnt vmcnt(0) expcnt(0)
	; GCN: mask branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN: mask branch [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: {{^}}BB{{[0-9]+}}_1: ; %if			; GCN: {{^}}BB{{[0-9]+}}_1: ; %if
	; GCN: s_mov_b32 m0, -1			; GCN: s_mov_b32 m0, -1
	; GCN: ds_read_b32 [[LOAD1:v[0-9]+]]			; GCN: ds_read_b32 [[LOAD1:v[0-9]+]]
				; GCN: s_waitcnt lgkmcnt(0)
	; GCN: buffer_load_dword [[RELOAD_LOAD0:v[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword [[RELOAD_LOAD0:v[0-9]+]], off, s[0:3], s7 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: s_waitcnt vmcnt(0)

	; Spill val register			; Spill val register
	; GCN: v_add_i32_e32 [[VAL:v[0-9]+]], vcc, [[LOAD1]], [[RELOAD_LOAD0]]			; GCN: v_add_i32_e32 [[VAL:v[0-9]+]], vcc, [[LOAD1]], [[RELOAD_LOAD0]]
	; GCN: buffer_store_dword [[VAL]], off, s[0:3], s7 offset:[[VAL_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL]], off, s[0:3], s7 offset:[[VAL_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN: s_waitcnt vmcnt(0)

	; VMEM: [[ENDIF]]:			; VMEM: [[ENDIF]]:
	; Reload and restore exec mask			; Reload and restore exec mask
				; VGPR: s_waitcnt lgkmcnt(0)
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]



	; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload			; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload
	; VMEM: s_waitcnt vmcnt(0)			; VMEM: s_waitcnt vmcnt(0)
	; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC_LO]]			; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC_LO]]
	▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines

	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:20 ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:20 ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:24 ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:24 ; 4-byte Folded Spill

	; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}			; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}

	; GCN: s_waitcnt vmcnt(0) expcnt(0)
	; GCN-NEXT: ; mask branch [[END:BB[0-9]+_[0-9]+]]			; GCN-NEXT: ; mask branch [[END:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: s_cbranch_execz [[END]]			; GCN-NEXT: s_cbranch_execz [[END]]


	; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:
	; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload
	; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]			; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]
	; GCN: v_cmp_ne_u32_e32 vcc,			; GCN: v_cmp_ne_u32_e32 vcc,
	; GCN: s_and_b64 vcc, exec, vcc			; GCN: s_and_b64 vcc, exec, vcc
	; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], s7 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], s7 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN: s_waitcnt vmcnt(0) expcnt(0)
	; GCN-NEXT: s_cbranch_vccnz [[LOOP]]			; GCN-NEXT: s_cbranch_vccnz [[LOOP]]


	; GCN: [[END]]:			; GCN: [[END]]:
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]

	; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:20 ; 4-byte Folded Reload			; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:20 ; 4-byte Folded Reload
	▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]

	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: s_mov_b64 exec, [[CMP0]]			; GCN: s_mov_b64 exec, [[CMP0]]
	; GCN: s_waitcnt vmcnt(0) expcnt(0)

	; FIXME: It makes no sense to put this skip here			; FIXME: It makes no sense to put this skip here
	; GCN-NEXT: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]			; GCN-NEXT: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]
	; GCN: s_cbranch_execz [[FLOW]]			; GCN: s_cbranch_execz [[FLOW]]
	; GCN-NEXT: s_branch [[ELSE:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_branch [[ELSE:BB[0-9]+_[0-9]+]]

	; GCN: [[FLOW]]: ; %Flow			; GCN: [[FLOW]]: ; %Flow
	; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	Show All 21 Lines

	; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_LO:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_LO:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_LO]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_HI:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_HI:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_HI]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: buffer_store_dword [[FLOW_VAL]], off, s[0:3], s7 offset:[[RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[FLOW_VAL]], off, s[0:3], s7 offset:[[RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN: s_xor_b64 exec, exec, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO]]:[[FLOW_S_RELOAD_SAVEEXEC_HI]]{{\]}}			; GCN: s_xor_b64 exec, exec, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO]]:[[FLOW_S_RELOAD_SAVEEXEC_HI]]{{\]}}
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0)
	; GCN-NEXT: ; mask branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN-NEXT: ; mask branch [[ENDIF:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: s_cbranch_execz [[ENDIF]]			; GCN-NEXT: s_cbranch_execz [[ENDIF]]


	; GCN: BB{{[0-9]+}}_2: ; %if			; GCN: BB{{[0-9]+}}_2: ; %if
	; GCN: ds_read_b32			; GCN: ds_read_b32
	; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload
	; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]			; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]
	; GCN: buffer_store_dword [[ADD]], off, s[0:3], s7 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[ADD]], off, s[0:3], s7 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill
	; GCN: s_waitcnt vmcnt(0) expcnt(0)
	; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: [[ELSE]]: ; %else			; GCN: [[ELSE]]: ; %else
	; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], s7 offset:4 ; 4-byte Folded Reload
	; GCN: v_subrev_i32_e32 [[SUB:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]			; GCN: v_subrev_i32_e32 [[SUB:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]
	; GCN: buffer_store_dword [[ADD]], off, s[0:3], s7 offset:[[FLOW_RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[ADD]], off, s[0:3], s7 offset:[[FLOW_RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN: s_waitcnt vmcnt(0) expcnt(0)
	; GCN-NEXT: s_branch [[FLOW]]			; GCN-NEXT: s_branch [[FLOW]]

	; GCN: [[ENDIF]]:			; GCN: [[ENDIF]]:
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_HI_LANE]]


	; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_LO_OFFSET]] ; 4-byte Folded Reload			; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], s7 offset:[[FLOW_SAVEEXEC_LO_OFFSET]] ; 4-byte Folded Reload
	Show All 38 Lines

test/CodeGen/AMDGPU/indirect-addressing-si.ll

	Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines
	}			}

	; GCN-LABEL: {{^}}extract_neg_offset_vgpr:			; GCN-LABEL: {{^}}extract_neg_offset_vgpr:
	; The offset depends on the register that holds the first element of the vector.			; The offset depends on the register that holds the first element of the vector.

	; FIXME: The waitcnt for the argument load can go after the loop			; FIXME: The waitcnt for the argument load can go after the loop
	; IDXMODE: s_set_gpr_idx_on 0, src0			; IDXMODE: s_set_gpr_idx_on 0, src0
	; GCN: s_mov_b64 s{{\[[0-9]+:[0-9]+\]}}, exec			; GCN: s_mov_b64 s{{\[[0-9]+:[0-9]+\]}}, exec
	; GCN: s_waitcnt lgkmcnt(0)			; GCN: [[LOOPBB:BB[0-9]+_[0-9]+]]:

	; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]], v{{[0-9]+}}			; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]], v{{[0-9]+}}

	; MOVREL: s_add_i32 m0, [[READLANE]], 0xfffffe0			; MOVREL: s_add_i32 m0, [[READLANE]], 0xfffffe0
	; MOVREL: s_and_saveexec_b64 vcc, vcc			; MOVREL: s_and_saveexec_b64 vcc, vcc
	; MOVREL: v_movrels_b32_e32 [[RESULT:v[0-9]+]], v1			; MOVREL: v_movrels_b32_e32 [[RESULT:v[0-9]+]], v1

	; IDXMODE: s_addk_i32 [[ADD_IDX:s[0-9]+]], 0xfe00			; IDXMODE: s_addk_i32 [[ADD_IDX:s[0-9]+]], 0xfe00
	; IDXMODE: s_set_gpr_idx_idx [[ADD_IDX]]			; IDXMODE: s_set_gpr_idx_idx [[ADD_IDX]]
	▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines
	; The offset depends on the register that holds the first element of the vector.			; The offset depends on the register that holds the first element of the vector.

	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT0:v[0-9]+]], 1{{$}}			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT0:v[0-9]+]], 1{{$}}
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], 2{{$}}			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], 2{{$}}
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT2:v[0-9]+]], 3{{$}}			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT2:v[0-9]+]], 3{{$}}
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 4{{$}}			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 4{{$}}

	; GCN: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec			; GCN: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec
	; GCN: s_waitcnt lgkmcnt(0)

	; GCN: [[LOOPBB:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOPBB:BB[0-9]+_[0-9]+]]:
	; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]]			; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]]

	; MOVREL: s_add_i32 m0, [[READLANE]], 0xfffffe00			; MOVREL: s_add_i32 m0, [[READLANE]], 0xfffffe00
	; MOVREL: s_and_saveexec_b64 vcc, vcc			; MOVREL: s_and_saveexec_b64 vcc, vcc
	; MOVREL: v_movreld_b32_e32 [[VEC_ELT0]], 5			; MOVREL: v_movreld_b32_e32 [[VEC_ELT0]], 5

	; IDXMODE: s_addk_i32 [[ADD_IDX:s[0-9]+]], 0xfe00{{$}}			; IDXMODE: s_addk_i32 [[ADD_IDX:s[0-9]+]], 0xfe00{{$}}
	Show All 22 Lines
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], 2{{$}}			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], 2{{$}}
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT2:v[0-9]+]], 3{{$}}			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT2:v[0-9]+]], 3{{$}}
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 4{{$}}			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 4{{$}}
	; GCN-DAG: v_mov_b32_e32 [[VAL:v[0-9]+]], 0x1f4{{$}}			; GCN-DAG: v_mov_b32_e32 [[VAL:v[0-9]+]], 0x1f4{{$}}

	; IDXMODE: s_set_gpr_idx_on 0, dst			; IDXMODE: s_set_gpr_idx_on 0, dst

	; GCN: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec			; GCN: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec
	; GCN: s_waitcnt lgkmcnt(0)

	; The offset depends on the register that holds the first element of the vector.			; The offset depends on the register that holds the first element of the vector.
	; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]]			; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]]

	; MOVREL: s_add_i32 m0, [[READLANE]], -16			; MOVREL: s_add_i32 m0, [[READLANE]], -16
	; MOVREL: v_movreld_b32_e32 [[VEC_ELT0]], [[VAL]]			; MOVREL: v_movreld_b32_e32 [[VEC_ELT0]], [[VAL]]

	; IDXMODE: s_add_i32 [[ADD_IDX:s[0-9]+]], [[READLANE]], -16			; IDXMODE: s_add_i32 [[ADD_IDX:s[0-9]+]], [[READLANE]], -16
	Show All 23 Lines
	; GCN-DAG: s_mov_b32 [[S_ELT1:s[0-9]+]], 9			; GCN-DAG: s_mov_b32 [[S_ELT1:s[0-9]+]], 9
	; GCN-DAG: s_mov_b32 [[S_ELT0:s[0-9]+]], 7			; GCN-DAG: s_mov_b32 [[S_ELT0:s[0-9]+]], 7
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT0:v[0-9]+]], [[S_ELT0]]			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT0:v[0-9]+]], [[S_ELT0]]
	; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], [[S_ELT1]]			; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], [[S_ELT1]]

	; IDXMODE: s_set_gpr_idx_on 0, src0			; IDXMODE: s_set_gpr_idx_on 0, src0

	; GCN: s_mov_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], exec			; GCN: s_mov_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], exec
	; GCN: s_waitcnt vmcnt(0)

	; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:
				; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]			; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]			; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]

	; MOVREL: s_mov_b32 m0, [[READLANE]]			; MOVREL: s_mov_b32 m0, [[READLANE]]
	; MOVREL: s_and_saveexec_b64 vcc, vcc			; MOVREL: s_and_saveexec_b64 vcc, vcc
	; MOVREL: v_movrels_b32_e32 [[MOVREL0:v[0-9]+]], [[VEC_ELT0]]			; MOVREL: v_movrels_b32_e32 [[MOVREL0:v[0-9]+]], [[VEC_ELT0]]

	; IDXMODE: s_set_gpr_idx_idx [[READLANE]]			; IDXMODE: s_set_gpr_idx_idx [[READLANE]]
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT3:[0-9]+]], s[[S_ELT3]]			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT3:[0-9]+]], s[[S_ELT3]]
	; GCN: v_mov_b32_e32 v[[VEC_ELT2:[0-9]+]], s{{[0-9]+}}			; GCN: v_mov_b32_e32 v[[VEC_ELT2:[0-9]+]], s{{[0-9]+}}
	; GCN: v_mov_b32_e32 v[[VEC_ELT1:[0-9]+]], s{{[0-9]+}}			; GCN: v_mov_b32_e32 v[[VEC_ELT1:[0-9]+]], s{{[0-9]+}}
	; GCN: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]			; GCN: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]

	; IDXMODE: s_set_gpr_idx_on 0, dst			; IDXMODE: s_set_gpr_idx_on 0, dst

	; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:
				; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]			; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]			; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]

	; MOVREL: s_mov_b32 m0, [[READLANE]]			; MOVREL: s_mov_b32 m0, [[READLANE]]
	; MOVREL: s_and_saveexec_b64 vcc, vcc			; MOVREL: s_and_saveexec_b64 vcc, vcc
	; MOVREL-NEXT: v_movreld_b32_e32 v[[VEC_ELT0]], [[INS0]]			; MOVREL-NEXT: v_movreld_b32_e32 v[[VEC_ELT0]], [[INS0]]

	; IDXMODE: s_set_gpr_idx_idx [[READLANE]]			; IDXMODE: s_set_gpr_idx_idx [[READLANE]]
	▲ Show 20 Lines • Show All 336 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/infinite-loop.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s

	; SI-LABEL: {{^}}infinite_loop:			; SI-LABEL: {{^}}infinite_loop:
	; SI: v_mov_b32_e32 [[REG:v[0-9]+]], 0x3e7			; SI: v_mov_b32_e32 [[REG:v[0-9]+]], 0x3e7
	; SI: BB0_1:			; SI: BB0_1:
				; SI: s_waitcnt lgkmcnt(0)
	; SI: buffer_store_dword [[REG]]			; SI: buffer_store_dword [[REG]]
	; SI: s_waitcnt vmcnt(0) expcnt(0)
	; SI: s_branch BB0_1			; SI: s_branch BB0_1
	define amdgpu_kernel void @infinite_loop(i32 addrspace(1)* %out) {			define amdgpu_kernel void @infinite_loop(i32 addrspace(1)* %out) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	store i32 999, i32 addrspace(1)* %out, align 4			store i32 999, i32 addrspace(1)* %out, align 4
	br label %for.body			br label %for.body
	}			}

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.format.ll

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	main_body:
call void @llvm.amdgcn.buffer.store.format.v4f32(<4 x float> %1, <4 x i32> %0, i32 %3, i32 %2, i1 0, i1 0)		call void @llvm.amdgcn.buffer.store.format.v4f32(<4 x float> %1, <4 x i32> %0, i32 %3, i32 %2, i1 0, i1 0)
ret void		ret void
}		}

; Ideally, the register allocator would avoid the wait here		; Ideally, the register allocator would avoid the wait here
;		;
;CHECK-LABEL: {{^}}buffer_store_wait:		;CHECK-LABEL: {{^}}buffer_store_wait:
;CHECK: buffer_store_format_xyzw v[0:3], v4, s[0:3], 0 idxen		;CHECK: buffer_store_format_xyzw v[0:3], v4, s[0:3], 0 idxen
;CHECK: s_waitcnt vmcnt(0) expcnt(0)		;CHECK: s_waitcnt expcnt(0)
;CHECK: buffer_load_format_xyzw v[0:3], v5, s[0:3], 0 idxen		;CHECK: buffer_load_format_xyzw v[0:3], v5, s[0:3], 0 idxen
;CHECK: s_waitcnt vmcnt(0)		;CHECK: s_waitcnt vmcnt(0)
;CHECK: buffer_store_format_xyzw v[0:3], v6, s[0:3], 0 idxen		;CHECK: buffer_store_format_xyzw v[0:3], v6, s[0:3], 0 idxen
define amdgpu_ps void @buffer_store_wait(<4 x i32> inreg, <4 x float>, i32, i32, i32) {		define amdgpu_ps void @buffer_store_wait(<4 x i32> inreg, <4 x float>, i32, i32, i32) {
main_body:		main_body:
call void @llvm.amdgcn.buffer.store.format.v4f32(<4 x float> %1, <4 x i32> %0, i32 %2, i32 0, i1 0, i1 0)		call void @llvm.amdgcn.buffer.store.format.v4f32(<4 x float> %1, <4 x i32> %0, i32 %2, i32 0, i1 0, i1 0)
%data = call <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32> %0, i32 %3, i32 0, i1 0, i1 0)		%data = call <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32> %0, i32 %3, i32 0, i1 0, i1 0)
call void @llvm.amdgcn.buffer.store.format.v4f32(<4 x float> %data, <4 x i32> %0, i32 %4, i32 0, i1 0, i1 0)		call void @llvm.amdgcn.buffer.store.format.v4f32(<4 x float> %data, <4 x i32> %0, i32 %4, i32 0, i1 0, i1 0)
Show All 26 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.ll

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	main_body:
call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> %1, <4 x i32> %0, i32 %3, i32 %2, i1 0, i1 0)		call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> %1, <4 x i32> %0, i32 %3, i32 %2, i1 0, i1 0)
ret void		ret void
}		}

; Ideally, the register allocator would avoid the wait here		; Ideally, the register allocator would avoid the wait here
;		;
;CHECK-LABEL: {{^}}buffer_store_wait:		;CHECK-LABEL: {{^}}buffer_store_wait:
;CHECK: buffer_store_dwordx4 v[0:3], v4, s[0:3], 0 idxen		;CHECK: buffer_store_dwordx4 v[0:3], v4, s[0:3], 0 idxen
;CHECK: s_waitcnt vmcnt(0) expcnt(0)		;CHECK: s_waitcnt expcnt(0)
;CHECK: buffer_load_dwordx4 v[0:3], v5, s[0:3], 0 idxen		;CHECK: buffer_load_dwordx4 v[0:3], v5, s[0:3], 0 idxen
;CHECK: s_waitcnt vmcnt(0)		;CHECK: s_waitcnt vmcnt(0)
;CHECK: buffer_store_dwordx4 v[0:3], v6, s[0:3], 0 idxen		;CHECK: buffer_store_dwordx4 v[0:3], v6, s[0:3], 0 idxen
define amdgpu_ps void @buffer_store_wait(<4 x i32> inreg, <4 x float>, i32, i32, i32) {		define amdgpu_ps void @buffer_store_wait(<4 x i32> inreg, <4 x float>, i32, i32, i32) {
main_body:		main_body:
call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> %1, <4 x i32> %0, i32 %2, i32 0, i1 0, i1 0)		call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> %1, <4 x i32> %0, i32 %2, i32 0, i1 0, i1 0)
%data = call <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32> %0, i32 %3, i32 0, i1 0, i1 0)		%data = call <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32> %0, i32 %3, i32 0, i1 0, i1 0)
call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> %data, <4 x i32> %0, i32 %4, i32 0, i1 0, i1 0)		call void @llvm.amdgcn.buffer.store.v4f32(<4 x float> %data, <4 x i32> %0, i32 %4, i32 0, i1 0, i1 0)
Show All 26 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.ds.bpermute.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s

	declare i32 @llvm.amdgcn.ds.bpermute(i32, i32) #0			declare i32 @llvm.amdgcn.ds.bpermute(i32, i32) #0

	; FUNC-LABEL: {{^}}ds_bpermute:			; FUNC-LABEL: {{^}}ds_bpermute:
	; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}			; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
	; CHECK: s_waitcnt lgkmcnt
	define amdgpu_kernel void @ds_bpermute(i32 addrspace(1)* %out, i32 %index, i32 %src) nounwind {			define amdgpu_kernel void @ds_bpermute(i32 addrspace(1)* %out, i32 %index, i32 %src) nounwind {
	%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %index, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %index, i32 %src) #0
	store i32 %bpermute, i32 addrspace(1)* %out, align 4			store i32 %bpermute, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}ds_bpermute_imm_offset:			; CHECK-LABEL: {{^}}ds_bpermute_imm_offset:
	; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4			; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4
	; CHECK: s_waitcnt lgkmcnt
	define amdgpu_kernel void @ds_bpermute_imm_offset(i32 addrspace(1)* %out, i32 %base_index, i32 %src) nounwind {			define amdgpu_kernel void @ds_bpermute_imm_offset(i32 addrspace(1)* %out, i32 %base_index, i32 %src) nounwind {
	%index = add i32 %base_index, 4			%index = add i32 %base_index, 4
	%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %index, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %index, i32 %src) #0
	store i32 %bpermute, i32 addrspace(1)* %out, align 4			store i32 %bpermute, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}ds_bpermute_imm_index:			; CHECK-LABEL: {{^}}ds_bpermute_imm_index:
	; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:64			; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:64
	; CHECK: s_waitcnt lgkmcnt
	define amdgpu_kernel void @ds_bpermute_imm_index(i32 addrspace(1)* %out, i32 %base_index, i32 %src) nounwind {			define amdgpu_kernel void @ds_bpermute_imm_index(i32 addrspace(1)* %out, i32 %base_index, i32 %src) nounwind {
	%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 64, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 64, i32 %src) #0
	store i32 %bpermute, i32 addrspace(1)* %out, align 4			store i32 %bpermute, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	attributes #0 = { nounwind readnone convergent }			attributes #0 = { nounwind readnone convergent }

test/CodeGen/AMDGPU/llvm.amdgcn.ds.permute.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s

	declare i32 @llvm.amdgcn.ds.permute(i32, i32) #0			declare i32 @llvm.amdgcn.ds.permute(i32, i32) #0

	; CHECK-LABEL: {{^}}ds_permute:			; CHECK-LABEL: {{^}}ds_permute:
	; CHECK: ds_permute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}			; CHECK: ds_permute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
	; CHECK: s_waitcnt lgkmcnt
	define amdgpu_kernel void @ds_permute(i32 addrspace(1)* %out, i32 %index, i32 %src) nounwind {			define amdgpu_kernel void @ds_permute(i32 addrspace(1)* %out, i32 %index, i32 %src) nounwind {
	%bpermute = call i32 @llvm.amdgcn.ds.permute(i32 %index, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.permute(i32 %index, i32 %src) #0
	store i32 %bpermute, i32 addrspace(1)* %out, align 4			store i32 %bpermute, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}ds_permute_imm_offset:			; CHECK-LABEL: {{^}}ds_permute_imm_offset:
	; CHECK: ds_permute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4			; CHECK: ds_permute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4
	; CHECK: s_waitcnt lgkmcnt
	define amdgpu_kernel void @ds_permute_imm_offset(i32 addrspace(1)* %out, i32 %base_index, i32 %src) nounwind {			define amdgpu_kernel void @ds_permute_imm_offset(i32 addrspace(1)* %out, i32 %base_index, i32 %src) nounwind {
	%index = add i32 %base_index, 4			%index = add i32 %base_index, 4
	%bpermute = call i32 @llvm.amdgcn.ds.permute(i32 %index, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.permute(i32 %index, i32 %src) #0
	store i32 %bpermute, i32 addrspace(1)* %out, align 4			store i32 %bpermute, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	attributes #0 = { nounwind readnone convergent }			attributes #0 = { nounwind readnone convergent }

test/CodeGen/AMDGPU/llvm.amdgcn.ds.swizzle.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=hawaii -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=hawaii -verify-machineinstrs < %s \| FileCheck %s
	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s

	declare i32 @llvm.amdgcn.ds.swizzle(i32, i32) #0			declare i32 @llvm.amdgcn.ds.swizzle(i32, i32) #0

	; FUNC-LABEL: {{^}}ds_swizzle:			; FUNC-LABEL: {{^}}ds_swizzle:
	; CHECK: ds_swizzle_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:100			; CHECK: ds_swizzle_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:100
	; CHECK: s_waitcnt lgkmcnt
	define amdgpu_kernel void @ds_swizzle(i32 addrspace(1)* %out, i32 %src) nounwind {			define amdgpu_kernel void @ds_swizzle(i32 addrspace(1)* %out, i32 %src) nounwind {
	%swizzle = call i32 @llvm.amdgcn.ds.swizzle(i32 %src, i32 100) #0			%swizzle = call i32 @llvm.amdgcn.ds.swizzle(i32 %src, i32 100) #0
	store i32 %swizzle, i32 addrspace(1)* %out, align 4			store i32 %swizzle, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	attributes #0 = { nounwind readnone convergent }			attributes #0 = { nounwind readnone convergent }

test/CodeGen/AMDGPU/llvm.amdgcn.image.ll

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	main_body:
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r0, float %r1, float %r2, float %r3, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r0, float %r1, float %r2, float %r3, i1 true, i1 true) #0
ret void		ret void
}		}

; Ideally, the register allocator would avoid the wait here		; Ideally, the register allocator would avoid the wait here
;		;
; GCN-LABEL: {{^}}image_store_wait:		; GCN-LABEL: {{^}}image_store_wait:
; GCN: image_store v[0:3], v4, s[0:7] dmask:0xf unorm		; GCN: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
; GCN: s_waitcnt vmcnt(0) expcnt(0)		; GCN: s_waitcnt expcnt(0)
; GCN: image_load v[0:3], v4, s[8:15] dmask:0xf unorm		; GCN: image_load v[0:3], v4, s[8:15] dmask:0xf unorm
; GCN: s_waitcnt vmcnt(0)		; GCN: s_waitcnt vmcnt(0)
; GCN: image_store v[0:3], v4, s[16:23] dmask:0xf unorm		; GCN: image_store v[0:3], v4, s[16:23] dmask:0xf unorm
define amdgpu_ps void @image_store_wait(<8 x i32> inreg %arg, <8 x i32> inreg %arg1, <8 x i32> inreg %arg2, <4 x float> %arg3, i32 %arg4) #0 {		define amdgpu_ps void @image_store_wait(<8 x i32> inreg %arg, <8 x i32> inreg %arg1, <8 x i32> inreg %arg2, <4 x float> %arg3, i32 %arg4) #0 {
main_body:		main_body:
call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %arg3, i32 %arg4, <8 x i32> %arg, i32 15, i1 false, i1 false, i1 false, i1 false)		call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %arg3, i32 %arg4, <8 x i32> %arg, i32 15, i1 false, i1 false, i1 false, i1 false)
%data = call <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32 %arg4, <8 x i32> %arg1, i32 15, i1 false, i1 false, i1 false, i1 false)		%data = call <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32 %arg4, <8 x i32> %arg1, i32 15, i1 false, i1 false, i1 false, i1 false)
call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %data, i32 %arg4, <8 x i32> %arg2, i32 15, i1 false, i1 false, i1 false, i1 false)		call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %data, i32 %arg4, <8 x i32> %arg2, i32 15, i1 false, i1 false, i1 false, i1 false)
Show All 39 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.inv.ll

	Show All 14 Lines
	}			}

	; GCN-LABEL: {{^}}test_s_dcache_inv_insert_wait:			; GCN-LABEL: {{^}}test_s_dcache_inv_insert_wait:
	; GCN-NEXT: ; BB#0:			; GCN-NEXT: ; BB#0:
	; GCN: s_dcache_inv			; GCN: s_dcache_inv
	; GCN: s_waitcnt lgkmcnt(0) ; encoding			; GCN: s_waitcnt lgkmcnt(0) ; encoding
	define amdgpu_kernel void @test_s_dcache_inv_insert_wait() #0 {			define amdgpu_kernel void @test_s_dcache_inv_insert_wait() #0 {
	call void @llvm.amdgcn.s.dcache.inv()			call void @llvm.amdgcn.s.dcache.inv()
	call void @llvm.amdgcn.s.waitcnt(i32 0)			call void @llvm.amdgcn.s.waitcnt(i32 127)
	br label %end			br label %end

	end:			end:
	store volatile i32 3, i32 addrspace(1)* undef			store volatile i32 3, i32 addrspace(1)* undef
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.inv.vol.ll

	Show All 14 Lines
	}			}

	; GCN-LABEL: {{^}}test_s_dcache_inv_vol_insert_wait:			; GCN-LABEL: {{^}}test_s_dcache_inv_vol_insert_wait:
	; GCN-NEXT: ; BB#0:			; GCN-NEXT: ; BB#0:
	; GCN-NEXT: s_dcache_inv_vol			; GCN-NEXT: s_dcache_inv_vol
	; GCN: s_waitcnt lgkmcnt(0) ; encoding			; GCN: s_waitcnt lgkmcnt(0) ; encoding
	define amdgpu_kernel void @test_s_dcache_inv_vol_insert_wait() #0 {			define amdgpu_kernel void @test_s_dcache_inv_vol_insert_wait() #0 {
	call void @llvm.amdgcn.s.dcache.inv.vol()			call void @llvm.amdgcn.s.dcache.inv.vol()
	call void @llvm.amdgcn.s.waitcnt(i32 0)			call void @llvm.amdgcn.s.waitcnt(i32 127)
	br label %end			br label %end

	end:			end:
	store volatile i32 3, i32 addrspace(1)* undef			store volatile i32 3, i32 addrspace(1)* undef
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.wb.ll

	Show All 12 Lines
	}			}

	; VI-LABEL: {{^}}test_s_dcache_wb_insert_wait:			; VI-LABEL: {{^}}test_s_dcache_wb_insert_wait:
	; VI-NEXT: ; BB#0:			; VI-NEXT: ; BB#0:
	; VI-NEXT: s_dcache_wb			; VI-NEXT: s_dcache_wb
	; VI: s_waitcnt lgkmcnt(0) ; encoding			; VI: s_waitcnt lgkmcnt(0) ; encoding
	define amdgpu_kernel void @test_s_dcache_wb_insert_wait() #0 {			define amdgpu_kernel void @test_s_dcache_wb_insert_wait() #0 {
	call void @llvm.amdgcn.s.dcache.wb()			call void @llvm.amdgcn.s.dcache.wb()
	call void @llvm.amdgcn.s.waitcnt(i32 0)			call void @llvm.amdgcn.s.waitcnt(i32 127)
	br label %end			br label %end

	end:			end:
	store volatile i32 3, i32 addrspace(1)* undef			store volatile i32 3, i32 addrspace(1)* undef
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.wb.vol.ll

	Show All 12 Lines
	}			}

	; VI-LABEL: {{^}}test_s_dcache_wb_vol_insert_wait:			; VI-LABEL: {{^}}test_s_dcache_wb_vol_insert_wait:
	; VI-NEXT: ; BB#0:			; VI-NEXT: ; BB#0:
	; VI-NEXT: s_dcache_wb_vol			; VI-NEXT: s_dcache_wb_vol
	; VI: s_waitcnt lgkmcnt(0) ; encoding			; VI: s_waitcnt lgkmcnt(0) ; encoding
	define amdgpu_kernel void @test_s_dcache_wb_vol_insert_wait() #0 {			define amdgpu_kernel void @test_s_dcache_wb_vol_insert_wait() #0 {
	call void @llvm.amdgcn.s.dcache.wb.vol()			call void @llvm.amdgcn.s.dcache.wb.vol()
	call void @llvm.amdgcn.s.waitcnt(i32 0)			call void @llvm.amdgcn.s.waitcnt(i32 127)
	br label %end			br label %end

	end:			end:
	store volatile i32 3, i32 addrspace(1)* undef			store volatile i32 3, i32 addrspace(1)* undef
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

test/CodeGen/AMDGPU/llvm.amdgcn.s.waitcnt.ll

Show All 12 Lines	define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <4 x float> %d0, <4 x float> %d1, i32 %c0, i32 %c1) {
ret void		ret void
}		}

; Test that the intrinsic is merged with automatically generated waits and		; Test that the intrinsic is merged with automatically generated waits and
; emitted as late as possible.		; emitted as late as possible.
;		;
; CHECK-LABEL: {{^}}test2:		; CHECK-LABEL: {{^}}test2:
; CHECK: image_load		; CHECK: image_load
; CHECK-NOT: s_waitcnt vmcnt(0){{$}}		; CHECK-NEXT: s_waitcnt
; CHECK: s_waitcnt		; CHECK: s_waitcnt vmcnt(0){{$}}
; CHECK-NEXT: image_store		; CHECK-NEXT: image_store
define amdgpu_ps void @test2(<8 x i32> inreg %rsrc, i32 %c) {		define amdgpu_ps void @test2(<8 x i32> inreg %rsrc, i32 %c) {
%t = call <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32 %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)		%t = call <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32 %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00		call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00
%c.1 = mul i32 %c, 2		%c.1 = mul i32 %c, 2
call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %t, i32 %c.1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)		call void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float> %t, i32 %c.1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
ret void		ret void
}		}

declare void @llvm.amdgcn.s.waitcnt(i32) #0		declare void @llvm.amdgcn.s.waitcnt(i32) #0

declare <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #1		declare <4 x float> @llvm.amdgcn.image.load.v4f32.i32.v8i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #1
declare void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float>, i32, <8 x i32>, i32, i1, i1, i1, i1) #0		declare void @llvm.amdgcn.image.store.v4f32.i32.v8i32(<4 x float>, i32, <8 x i32>, i32, i1, i1, i1, i1) #0

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readonly }		attributes #1 = { nounwind readonly }

test/CodeGen/AMDGPU/multi-divergent-exit-region.ll

	Show First 20 Lines • Show All 356 Lines • ▼ Show 20 Lines

	; GCN: v_mov_b32_e32 v0, 2.0			; GCN: v_mov_b32_e32 v0, 2.0
	; GCN: s_or_b64 exec, exec			; GCN: s_or_b64 exec, exec
	; GCN: s_and_b64 exec, exec			; GCN: s_and_b64 exec, exec
	; GCN: v_mov_b32_e32 v0, 1.0			; GCN: v_mov_b32_e32 v0, 1.0

	; GCN: {{^BB[0-9]+_[0-9]+}}: ; %UnifiedReturnBlock			; GCN: {{^BB[0-9]+_[0-9]+}}: ; %UnifiedReturnBlock
	; GCN-NEXT: s_or_b64 exec, exec			; GCN-NEXT: s_or_b64 exec, exec
				; GCN-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GCN-NEXT: ; return			; GCN-NEXT: ; return

	define amdgpu_ps float @uniform_branch_to_multi_divergent_region_exit_ret_ret_return_value(i32 inreg %sgpr, i32 %vgpr) #0 {			define amdgpu_ps float @uniform_branch_to_multi_divergent_region_exit_ret_ret_return_value(i32 inreg %sgpr, i32 %vgpr) #0 {
	entry:			entry:
	%uniform.cond = icmp slt i32 %sgpr, 2			%uniform.cond = icmp slt i32 %sgpr, 2
	br i1 %uniform.cond, label %LeafBlock, label %LeafBlock1			br i1 %uniform.cond, label %LeafBlock, label %LeafBlock1

	LeafBlock: ; preds = %entry			LeafBlock: ; preds = %entry
	▲ Show 20 Lines • Show All 338 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/ret_jump.ll

	Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines

	; GCN: ; BB#{{[0-9]+}}: ; %else			; GCN: ; BB#{{[0-9]+}}: ; %else
	; GCN: s_and_saveexec_b64 [[SAVE_EXEC:s\[[0-9]+:[0-9]+\]]], vcc			; GCN: s_and_saveexec_b64 [[SAVE_EXEC:s\[[0-9]+:[0-9]+\]]], vcc
	; GCN-NEXT: s_xor_b64 [[XOR_EXEC:s\[[0-9]+:[0-9]+\]]], exec, [[SAVE_EXEC]]			; GCN-NEXT: s_xor_b64 [[XOR_EXEC:s\[[0-9]+:[0-9]+\]]], exec, [[SAVE_EXEC]]
	; GCN-NEXT: ; mask branch [[FLOW1:BB[0-9]+_[0-9]+]]			; GCN-NEXT: ; mask branch [[FLOW1:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: ; %unreachable.bb			; GCN-NEXT: ; %unreachable.bb
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: s_waitcnt
	; GCN: ; divergent unreachable			; GCN: ; divergent unreachable

	; GCN: ; %ret.bb			; GCN: ; %ret.bb
	; GCN: store_dword			; GCN: store_dword

	; GCN: ; %UnifiedReturnBlock			; GCN: ; %UnifiedReturnBlock
	; GCN-NEXT: s_or_b64 exec, exec			; GCN-NEXT: s_or_b64 exec, exec
				; GCN-NEXT: s_waitcnt
	; GCN-NEXT: ; return			; GCN-NEXT: ; return
	; GCN-NEXT: .Lfunc_end			; GCN-NEXT: .Lfunc_end
	define amdgpu_ps <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @uniform_br_nontrivial_ret_divergent_br_nontrivial_unreachable([9 x <16 x i8>] addrspace(2)* byval %arg, [17 x <16 x i8>] addrspace(2)* byval %arg1, [17 x <8 x i32>] addrspace(2)* byval %arg2, i32 addrspace(2)* byval %arg3, float inreg %arg4, i32 inreg %arg5, <2 x i32> %arg6, <2 x i32> %arg7, <2 x i32> %arg8, <3 x i32> %arg9, <2 x i32> %arg10, <2 x i32> %arg11, <2 x i32> %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, i32 inreg %arg18, i32 %arg19, float %arg20, i32 %arg21) #0 {			define amdgpu_ps <{ i32, i32, i32, i32, i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @uniform_br_nontrivial_ret_divergent_br_nontrivial_unreachable([9 x <16 x i8>] addrspace(2)* byval %arg, [17 x <16 x i8>] addrspace(2)* byval %arg1, [17 x <8 x i32>] addrspace(2)* byval %arg2, i32 addrspace(2)* byval %arg3, float inreg %arg4, i32 inreg %arg5, <2 x i32> %arg6, <2 x i32> %arg7, <2 x i32> %arg8, <3 x i32> %arg9, <2 x i32> %arg10, <2 x i32> %arg11, <2 x i32> %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, i32 inreg %arg18, i32 %arg19, float %arg20, i32 %arg21) #0 {
	main_body:			main_body:
	%i.i = extractelement <2 x i32> %arg7, i32 0			%i.i = extractelement <2 x i32> %arg7, i32 0
	%j.i = extractelement <2 x i32> %arg7, i32 1			%j.i = extractelement <2 x i32> %arg7, i32 1
	%i.f.i = bitcast i32 %i.i to float			%i.f.i = bitcast i32 %i.i to float
	%j.f.i = bitcast i32 %j.i to float			%j.f.i = bitcast i32 %j.i to float
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/si-lower-control-flow-unreachable-block.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	; GCN-LABEL: {{^}}lower_control_flow_unreachable_terminator:			; GCN-LABEL: {{^}}lower_control_flow_unreachable_terminator:
	; GCN: v_cmp_eq_u32			; GCN: v_cmp_eq_u32
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: s_xor_b64			; GCN: s_xor_b64
	; GCN: ; mask branch [[RET:BB[0-9]+_[0-9]+]]			; GCN: ; mask branch [[RET:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: BB{{[0-9]+_[0-9]+}}: ; %unreachable			; GCN-NEXT: BB{{[0-9]+_[0-9]+}}: ; %unreachable
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: ; divergent unreachable			; GCN: ; divergent unreachable
	; GCN: s_waitcnt

	; GCN-NEXT: [[RET]]: ; %UnifiedReturnBlock			; GCN-NEXT: [[RET]]: ; %UnifiedReturnBlock
	; GCN-NEXT: s_or_b64 exec, exec			; GCN-NEXT: s_or_b64 exec, exec
	; GCN: s_endpgm			; GCN: s_endpgm

	define amdgpu_kernel void @lower_control_flow_unreachable_terminator() #0 {			define amdgpu_kernel void @lower_control_flow_unreachable_terminator() #0 {
	bb:			bb:
	%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()			%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()
	Show All 12 Lines
	; GCN: v_cmp_ne_u32			; GCN: v_cmp_ne_u32
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: s_xor_b64			; GCN: s_xor_b64
	; GCN: ; mask branch [[RETURN:BB[0-9]+_[0-9]+]]			; GCN: ; mask branch [[RETURN:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %unreachable			; GCN-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %unreachable
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: ; divergent unreachable			; GCN: ; divergent unreachable
	; GCN: s_waitcnt

	; GCN: [[RETURN]]:			; GCN: [[RETURN]]:
	; GCN-NEXT: s_or_b64 exec, exec			; GCN-NEXT: s_or_b64 exec, exec
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	define amdgpu_kernel void @lower_control_flow_unreachable_terminator_swap_block_order() #0 {			define amdgpu_kernel void @lower_control_flow_unreachable_terminator_swap_block_order() #0 {
	bb:			bb:
	%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()			%tmp15 = tail call i32 @llvm.amdgcn.workitem.id.y()
	%tmp63 = icmp eq i32 %tmp15, 32			%tmp63 = icmp eq i32 %tmp15, 32
	Show All 11 Lines
	; GCN: s_cmp_lg_u32			; GCN: s_cmp_lg_u32
	; GCN: s_cbranch_scc0 [[UNREACHABLE:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_scc0 [[UNREACHABLE:BB[0-9]+_[0-9]+]]

	; GCN-NEXT: BB#{{[0-9]+}}: ; %ret			; GCN-NEXT: BB#{{[0-9]+}}: ; %ret
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm

	; GCN: [[UNREACHABLE]]:			; GCN: [[UNREACHABLE]]:
	; GCN: ds_write_b32			; GCN: ds_write_b32
	; GCN: s_waitcnt
	define amdgpu_kernel void @uniform_lower_control_flow_unreachable_terminator(i32 %arg0) #0 {			define amdgpu_kernel void @uniform_lower_control_flow_unreachable_terminator(i32 %arg0) #0 {
	bb:			bb:
	%tmp63 = icmp eq i32 %arg0, 32			%tmp63 = icmp eq i32 %arg0, 32
	br i1 %tmp63, label %unreachable, label %ret			br i1 %tmp63, label %unreachable, label %ret

	unreachable:			unreachable:
	store volatile i32 0, i32 addrspace(3)* undef, align 4			store volatile i32 0, i32 addrspace(3)* undef, align 4
	unreachable			unreachable
	Show All 10 Lines

test/CodeGen/AMDGPU/smrd-vccz-bug.ll

	; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s			; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
	; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s			; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=NOVCCZ-BUG %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=NOVCCZ-BUG %s

	; GCN-FUNC: {{^}}vccz_workaround:			; GCN-FUNC: {{^}}vccz_workaround:
	; GCN: s_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x0			; GCN: s_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x0
	; GCN: v_cmp_neq_f32_e64 vcc, s{{[0-9]+}}, 0{{$}}			; GCN: v_cmp_neq_f32_e64 vcc, s{{[0-9]+}}, 0{{$}}
	; GCN: s_waitcnt lgkmcnt(0)			; VCCZ-BUG: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; VCCZ-BUG: s_mov_b64 vcc, vcc			; VCCZ-BUG: s_mov_b64 vcc, vcc
	; NOVCCZ-BUG-NOT: s_mov_b64 vcc, vcc			; NOVCCZ-BUG-NOT: s_mov_b64 vcc, vcc
	; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]			; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]
	; GCN: buffer_store_dword			; GCN: buffer_store_dword
	; GCN: [[EXIT]]:			; GCN: [[EXIT]]:
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @vccz_workaround(i32 addrspace(2)* %in, i32 addrspace(1)* %out, float %cond) {			define amdgpu_kernel void @vccz_workaround(i32 addrspace(2)* %in, i32 addrspace(1)* %out, float %cond) {
	entry:			entry:
	Show All 31 Lines

test/CodeGen/AMDGPU/spill-m0.ll

	Show All 12 Lines
	; GCN-DAG: s_cmp_lg_u32			; GCN-DAG: s_cmp_lg_u32

	; TOVGPR-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0			; TOVGPR-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0
	; TOVGPR: v_writelane_b32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]], 0			; TOVGPR: v_writelane_b32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]], 0

	; TOVMEM-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0			; TOVMEM-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0
	; TOVMEM-DAG: v_mov_b32_e32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]]			; TOVMEM-DAG: v_mov_b32_e32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]]
	; TOVMEM: buffer_store_dword [[SPILL_VREG]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4 ; 4-byte Folded Spill			; TOVMEM: buffer_store_dword [[SPILL_VREG]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4 ; 4-byte Folded Spill
	; TOVMEM: s_waitcnt vmcnt(0)

	; TOSMEM-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0			; TOSMEM-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0
	; TOSMEM: s_add_u32 m0, s3, 0x100{{$}}			; TOSMEM: s_add_u32 m0, s3, 0x100{{$}}
	; TOSMEM-NOT: [[M0_COPY]]			; TOSMEM-NOT: [[M0_COPY]]
	; TOSMEM: s_buffer_store_dword [[M0_COPY]], s{{\[}}[[LO]]:[[HI]]], m0 ; 4-byte Folded Spill			; TOSMEM: s_buffer_store_dword [[M0_COPY]], s{{\[}}[[LO]]:[[HI]]], m0 ; 4-byte Folded Spill
	; TOSMEM: s_waitcnt lgkmcnt(0)

	; GCN: s_cbranch_scc1 [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_scc1 [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: [[ENDIF]]:			; GCN: [[ENDIF]]:
	; TOVGPR: v_readlane_b32 [[M0_RESTORE:s[0-9]+]], [[SPILL_VREG]], 0			; TOVGPR: v_readlane_b32 [[M0_RESTORE:s[0-9]+]], [[SPILL_VREG]], 0
	; TOVGPR: s_mov_b32 m0, [[M0_RESTORE]]			; TOVGPR: s_mov_b32 m0, [[M0_RESTORE]]

	; TOVMEM: buffer_load_dword [[RELOAD_VREG:v[0-9]+]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4 ; 4-byte Folded Reload			; TOVMEM: buffer_load_dword [[RELOAD_VREG:v[0-9]+]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4 ; 4-byte Folded Reload
	▲ Show 20 Lines • Show All 178 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/valu-i1.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs -enable-misched -asm-verbose < %s \| FileCheck -check-prefix=SI %s			; RUN: llc -march=amdgcn -verify-machineinstrs -enable-misched -asm-verbose < %s \| FileCheck -check-prefix=SI %s

	declare i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			declare i32 @llvm.amdgcn.workitem.id.x() nounwind readnone

	; SI-LABEL: {{^}}test_if:			; SI-LABEL: {{^}}test_if:
	; Make sure the i1 values created by the cfg structurizer pass are			; Make sure the i1 values created by the cfg structurizer pass are
	; moved using VALU instructions			; moved using VALU instructions


	; waitcnt should be inserted after exec modification			; waitcnt should be inserted after exec modification
	; SI: v_cmp_lt_i32_e32 vcc, 0,			; SI: v_cmp_lt_i32_e32 vcc, 0,
	; SI-NEXT: s_and_saveexec_b64 [[SAVE1:s\[[0-9]+:[0-9]+\]]], vcc			; SI-NEXT: s_and_saveexec_b64 [[SAVE1:s\[[0-9]+:[0-9]+\]]], vcc
	; SI-NEXT: s_xor_b64 [[SAVE2:s\[[0-9]+:[0-9]+\]]], exec, [[SAVE1]]			; SI-NEXT: s_xor_b64 [[SAVE2:s\[[0-9]+:[0-9]+\]]], exec, [[SAVE1]]
	; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: ; mask branch [[FLOW_BB:BB[0-9]+_[0-9]+]]			; SI-NEXT: ; mask branch [[FLOW_BB:BB[0-9]+_[0-9]+]]
	; SI-NEXT: s_cbranch_execz [[FLOW_BB]]			; SI-NEXT: s_cbranch_execz [[FLOW_BB]]

	; SI-NEXT: BB{{[0-9]+}}_1: ; %LeafBlock3			; SI-NEXT: BB{{[0-9]+}}_1: ; %LeafBlock3
	; SI-NOT: s_mov_b64 s[{{[0-9]:[0-9]}}], -1			; SI-NOT: s_mov_b64 s[{{[0-9]:[0-9]}}], -1
	; SI: v_mov_b32_e32 v{{[0-9]}}, -1			; SI: v_mov_b32_e32 v{{[0-9]}}, -1
	; SI: s_and_saveexec_b64			; SI: s_and_saveexec_b64
	; SI-NEXT: s_xor_b64			; SI-NEXT: s_xor_b64
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	; SI-LABEL: {{^}}simple_test_v_if:			; SI-LABEL: {{^}}simple_test_v_if:
	; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}			; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}
	; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]			; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]
	; SI: ; mask branch [[EXIT:BB[0-9]+_[0-9]+]]			; SI: ; mask branch [[EXIT:BB[0-9]+_[0-9]+]]

	; SI-NEXT: BB{{[0-9]+_[0-9]+}}:			; SI-NEXT: BB{{[0-9]+_[0-9]+}}:
	; SI: buffer_store_dword			; SI: buffer_store_dword
	; SI-NEXT: s_waitcnt

	; SI-NEXT: {{^}}[[EXIT]]:			; SI-NEXT: {{^}}[[EXIT]]:
	; SI: s_or_b64 exec, exec, [[BR_SREG]]			; SI: s_or_b64 exec, exec, [[BR_SREG]]
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @simple_test_v_if(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {			define amdgpu_kernel void @simple_test_v_if(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone
	%is.0 = icmp ne i32 %tid, 0			%is.0 = icmp ne i32 %tid, 0
	br i1 %is.0, label %then, label %exit			br i1 %is.0, label %then, label %exit
	Show All 12 Lines
	; SI-LABEL: {{^}}simple_test_v_if_ret_else_ret:			; SI-LABEL: {{^}}simple_test_v_if_ret_else_ret:
	; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}			; SI: v_cmp_ne_u32_e32 vcc, 0, v{{[0-9]+}}
	; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]			; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]
	; SI: ; mask branch [[EXIT:BB[0-9]+_[0-9]+]]			; SI: ; mask branch [[EXIT:BB[0-9]+_[0-9]+]]

	; SI-NEXT: BB{{[0-9]+_[0-9]+}}:			; SI-NEXT: BB{{[0-9]+_[0-9]+}}:
	; SI: buffer_store_dword			; SI: buffer_store_dword
	; SI-NEXT: s_waitcnt

	; SI-NEXT: {{^}}[[EXIT]]:			; SI-NEXT: {{^}}[[EXIT]]:
	; SI: s_or_b64 exec, exec, [[BR_SREG]]			; SI: s_or_b64 exec, exec, [[BR_SREG]]
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @simple_test_v_if_ret_else_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {			define amdgpu_kernel void @simple_test_v_if_ret_else_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%is.0 = icmp ne i32 %tid, 0			%is.0 = icmp ne i32 %tid, 0
	br i1 %is.0, label %then, label %exit			br i1 %is.0, label %then, label %exit
	Show All 14 Lines
	; SI-LABEL: {{^}}simple_test_v_if_ret_else_code_ret:			; SI-LABEL: {{^}}simple_test_v_if_ret_else_code_ret:
	; SI: v_cmp_eq_u32_e32 vcc, 0, v{{[0-9]+}}			; SI: v_cmp_eq_u32_e32 vcc, 0, v{{[0-9]+}}
	; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc			; SI: s_and_saveexec_b64 [[BR_SREG:s\[[0-9]+:[0-9]+\]]], vcc
	; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]			; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]
	; SI: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]			; SI: ; mask branch [[FLOW:BB[0-9]+_[0-9]+]]

	; SI-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %exit			; SI-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %exit
	; SI: ds_write_b32			; SI: ds_write_b32
	; SI: s_waitcnt

	; SI-NEXT: {{^}}[[FLOW]]:			; SI-NEXT: {{^}}[[FLOW]]:
	; SI-NEXT: s_or_saveexec_b64			; SI-NEXT: s_or_saveexec_b64
	; SI-NEXT: s_xor_b64 exec, exec			; SI-NEXT: s_xor_b64 exec, exec
	; SI-NEXT: ; mask branch [[UNIFIED_RETURN:BB[0-9]+_[0-9]+]]			; SI-NEXT: ; mask branch [[UNIFIED_RETURN:BB[0-9]+_[0-9]+]]

	; SI-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %then			; SI-NEXT: {{^BB[0-9]+_[0-9]+}}: ; %then
	; SI: buffer_store_dword			; SI: s_waitcnt
	; SI-NEXT: s_waitcnt			; SI-NEXT: buffer_store_dword

	; SI-NEXT: {{^}}[[UNIFIED_RETURN]]: ; %UnifiedReturnBlock			; SI-NEXT: {{^}}[[UNIFIED_RETURN]]: ; %UnifiedReturnBlock
	; SI: s_or_b64 exec, exec			; SI: s_or_b64 exec, exec
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @simple_test_v_if_ret_else_code_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {			define amdgpu_kernel void @simple_test_v_if_ret_else_code_ret(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%is.0 = icmp ne i32 %tid, 0			%is.0 = icmp ne i32 %tid, 0
	br i1 %is.0, label %then, label %exit			br i1 %is.0, label %then, label %exit
	▲ Show 20 Lines • Show All 130 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] New Waitcnt Insertion PassClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 94310

lib/Target/AMDGPU/AMDGPU.h

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

lib/Target/AMDGPU/CMakeLists.txt

lib/Target/AMDGPU/DSInstructions.td

lib/Target/AMDGPU/SIInsertWaitcnts.cpp

test/CodeGen/AMDGPU/basic-branch.ll

test/CodeGen/AMDGPU/branch-condition-and.ll

test/CodeGen/AMDGPU/branch-relaxation.ll

test/CodeGen/AMDGPU/control-flow-fastregalloc.ll

test/CodeGen/AMDGPU/indirect-addressing-si.ll

test/CodeGen/AMDGPU/infinite-loop.ll

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.format.ll

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.ll

test/CodeGen/AMDGPU/llvm.amdgcn.ds.bpermute.ll

test/CodeGen/AMDGPU/llvm.amdgcn.ds.permute.ll

test/CodeGen/AMDGPU/llvm.amdgcn.ds.swizzle.ll

test/CodeGen/AMDGPU/llvm.amdgcn.image.ll

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.inv.ll

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.inv.vol.ll

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.wb.ll

test/CodeGen/AMDGPU/llvm.amdgcn.s.dcache.wb.vol.ll

test/CodeGen/AMDGPU/llvm.amdgcn.s.waitcnt.ll

test/CodeGen/AMDGPU/multi-divergent-exit-region.ll

test/CodeGen/AMDGPU/ret_jump.ll

test/CodeGen/AMDGPU/si-lower-control-flow-unreachable-block.ll

test/CodeGen/AMDGPU/smrd-vccz-bug.ll

test/CodeGen/AMDGPU/spill-m0.ll

test/CodeGen/AMDGPU/valu-i1.ll

[AMDGPU] New Waitcnt Insertion Pass
ClosedPublic