This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Suppress redundant waitcnt instrs
ClosedPublic

Authored by msearles on Feb 2 2018, 10:16 AM.

Download Raw Diff

Details

Reviewers

kzhuravl
rampitec

Commits

rG24c92eeb83ef: [AMDGPU] Suppress redundant waitcnt instrs.
rL324440: [AMDGPU] Suppress redundant waitcnt instrs.

Summary

Run the memory legalizer prior to the waitcnt pass; keep the policy that the waitcnt pass does not remove any waitcnts within the incoming IR.
The waitcnt pass doesn't (yet) track waitcnts that exist prior to the waitcnt pass (it just skips over them); because the waitcnt pass is ignorant of them, it may insert a redundant waitcnt. To avoid this, check the prev instr. If it and the to-be-inserted waitcnt are the same, suppress the insertion. We keep the existing waitcnt under the assumption that whomever, e.g., the memory legalizer, inserted it knows what they were doing.
Follow-on work: teach the waitcnt pass to record the pre-existing waitcnts for better waitcnt production.

Diff Detail

Repository: rL LLVM

Event Timeline

msearles created this revision.Feb 2 2018, 10:16 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptFeb 2 2018, 10:16 AM

msearles added reviewers: kzhuravl, rampitec.Feb 2 2018, 10:17 AM

msearles added a project: Restricted Project.

msearles added a subscriber: llvm-commits.

A concern is that you do not want to remove an original waitcnt when inserting a new one, as the pass may iterate and subsequently decide not to add a waitcnt there, but will have eliminated a waitcnt needed to implement the memory model. Is that an issue?

rampitec added inline comments.Feb 2 2018, 10:45 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗	(On Diff #132622)	Probably we need not to check for identity, but create a single strongest wait combined from two. For example, one waits for vmcnt(2), another for vmcnt(3) - keep vmcnt(2). One waits for lgkmcnt, another does not - wait for lgkmcnt. Generalizing: produce one wait with: s_waitcnt vmcnt(min(vmcnts[])), expcnt(min(expcnts[])), lgkmcnt(min(lgkmcnts[]))

In D42854#996486, @t-tye wrote:

A concern is that you do not want to remove an original waitcnt when inserting a new one, as the pass may iterate and subsequently decide not to add a waitcnt there, but will have eliminated a waitcnt needed to implement the memory model. Is that an issue?

OK, since we need to ensure that the existing waitcnts are not touched, I can mod the patch so that it doesn't remove the existing waitcnt and suppresses the to-be-inserted-by-the-waitcnt-pass waitcnt.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗	(On Diff #132622)	Agreed in principle, however, I have a sense that the common case is the redundant waitcnt and, moreover, the common case is redundant because of interaction with the memory legalizer. Per Tony, the existing waitcnt instrs should be left alone; it might get messy to attempt to create the strongest waitcnt instr of an existing waitcnt and a waitcnt pass waitcnt.

rampitec added inline comments.Feb 3 2018, 10:44 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗	(On Diff #132622)	We can keep it in mind for the future. I believe a more common case is wait to 0 inserted by the legalizer and wait to a higher count inserted here, so they not necessarily equal.

Can the pass update its internal state while walking the control flow to factor in the consequences of the original waitcnts? That way a decision as to whether a waitcnt is required will take into account these original waitcnts. This means the benefit is obtained regardless of whether the waitcnts are adjacent or separated (even in different BBs).

It seems that a separate pass could be done after the final waitcnts have been decided to collapse adjacent waitncts into a single one if possible. Or perhaps it would be better to postpone inserting the waitcnts until after the dataflow iteration has found a fixed point, at which time any original/deduced waitcnts can be merger if adjacent.

In D42854#997187, @t-tye wrote:

Can the pass update its internal state while walking the control flow to factor in the consequences of the original waitcnts? That way a decision as to whether a waitcnt is required will take into account these original waitcnts. This means the benefit is obtained regardless of whether the waitcnts are adjacent or separated (even in different BBs).

It seems that a separate pass could be done after the final waitcnts have been decided to collapse adjacent waitncts into a single one if possible. Or perhaps it would be better to postpone inserting the waitcnts until after the dataflow iteration has found a fixed point, at which time any original/deduced waitcnts can be merger if adjacent.

Yes, all of that can be done; it's been on the TODO list for several months; see TODO at line 1531; my intent for this patch was to grab the low-hanging fruit re: redundant waitcnt instrs.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1567 ↗	(On Diff #132622)	Agreed.

msearles planned changes to this revision.Feb 4 2018, 12:03 PM

Don't remove existing waitcnt instrs; if a redundant is to be inserted, keep the existing waitcnt and don't insert the duplicate.
Fix mir test

rampitec added inline comments.Feb 6 2018, 5:00 PM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1533 ↗	(On Diff #133104)	!TrackedWaitcntSet.count(&Inst) I guess.

Adjust per reviewer comment

msearles marked an inline comment as done.Feb 6 2018, 5:56 PM

msearles added inline comments.

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1533 ↗	(On Diff #133104)	Done; made equivalent changes in several other places.

LGTM

This revision is now accepted and ready to land.Feb 6 2018, 5:59 PM

Closed by commit rL324440: [AMDGPU] Suppress redundant waitcnt instrs. (authored by msearles). · Explain WhyFeb 6 2018, 6:23 PM

This revision was automatically updated to reflect the committed changes.

msearles marked an inline comment as done.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUTargetMachine.cpp

2 lines

SIInsertWaitcnts.cpp

56 lines

test/

CodeGen/

AMDGPU/

waitcnt-no-redundant.mir

24 lines

Diff 133127

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 868 Lines • ▼ Show 20 Lines	void GCNPassConfig::addPreEmitPass() {
// are multiple scheduling regions in a basic block, the regions are scheduled		// are multiple scheduling regions in a basic block, the regions are scheduled
// bottom up, so when we begin to schedule a region we don't know what		// bottom up, so when we begin to schedule a region we don't know what
// instructions were emitted directly before it.		// instructions were emitted directly before it.
//		//
// Here we add a stand-alone hazard recognizer pass which can handle all		// Here we add a stand-alone hazard recognizer pass which can handle all
// cases.		// cases.
addPass(&PostRAHazardRecognizerID);		addPass(&PostRAHazardRecognizerID);

		addPass(createSIMemoryLegalizerPass());
if (EnableSIInsertWaitcntsPass)		if (EnableSIInsertWaitcntsPass)
addPass(createSIInsertWaitcntsPass());		addPass(createSIInsertWaitcntsPass());
else		else
addPass(createSIInsertWaitsPass());		addPass(createSIInsertWaitsPass());
addPass(createSIShrinkInstructionsPass());		addPass(createSIShrinkInstructionsPass());
addPass(&SIInsertSkipsPassID);		addPass(&SIInsertSkipsPassID);
addPass(createSIMemoryLegalizerPass());
addPass(createSIDebuggerInsertNopsPass());		addPass(createSIDebuggerInsertNopsPass());
addPass(&BranchRelaxationPassID);		addPass(&BranchRelaxationPassID);
}		}

TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {
return new GCNPassConfig(*this, PM);		return new GCNPassConfig(*this, PM);
}		}

llvm/trunk/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Show First 20 Lines • Show All 355 Lines • ▼ Show 20 Lines	private:
const SIInstrInfo *TII = nullptr;		const SIInstrInfo *TII = nullptr;
const SIRegisterInfo *TRI = nullptr;		const SIRegisterInfo *TRI = nullptr;
const MachineRegisterInfo *MRI = nullptr;		const MachineRegisterInfo *MRI = nullptr;
const MachineLoopInfo *MLI = nullptr;		const MachineLoopInfo *MLI = nullptr;
AMDGPU::IsaInfo::IsaVersion IV;		AMDGPU::IsaInfo::IsaVersion IV;
AMDGPUAS AMDGPUASI;		AMDGPUAS AMDGPUASI;

DenseSet<MachineBasicBlock *> BlockVisitedSet;		DenseSet<MachineBasicBlock *> BlockVisitedSet;
DenseSet<MachineInstr *> CompilerGeneratedWaitcntSet;		DenseSet<MachineInstr *> TrackedWaitcntSet;
DenseSet<MachineInstr *> VCCZBugHandledSet;		DenseSet<MachineInstr *> VCCZBugHandledSet;

DenseMap<MachineBasicBlock *, std::unique_ptr<BlockWaitcntBrackets>>		DenseMap<MachineBasicBlock *, std::unique_ptr<BlockWaitcntBrackets>>
BlockWaitcntBracketsMap;		BlockWaitcntBracketsMap;

DenseSet<MachineBasicBlock *> BlockWaitcntProcessedSet;		DenseSet<MachineBasicBlock *> BlockWaitcntProcessedSet;

DenseMap<MachineLoop *, std::unique_ptr<LoopWaitcntData>> LoopWaitcntDataMap;		DenseMap<MachineLoop *, std::unique_ptr<LoopWaitcntData>> LoopWaitcntDataMap;
▲ Show 20 Lines • Show All 736 Lines • ▼ Show 20 Lines	if (EmitSwaitcnt != 0) {
(AMDGPU::decodeLgkmcnt(IV, Imm) !=		(AMDGPU::decodeLgkmcnt(IV, Imm) !=
(CntVal[LGKM_CNT] & AMDGPU::getLgkmcntBitMask(IV)))) {		(CntVal[LGKM_CNT] & AMDGPU::getLgkmcntBitMask(IV)))) {
MachineLoop *ContainingLoop = MLI->getLoopFor(MI.getParent());		MachineLoop *ContainingLoop = MLI->getLoopFor(MI.getParent());
if (ContainingLoop) {		if (ContainingLoop) {
MachineBasicBlock *TBB = ContainingLoop->getHeader();		MachineBasicBlock *TBB = ContainingLoop->getHeader();
BlockWaitcntBrackets *ScoreBracket =		BlockWaitcntBrackets *ScoreBracket =
BlockWaitcntBracketsMap[TBB].get();		BlockWaitcntBracketsMap[TBB].get();
if (!ScoreBracket) {		if (!ScoreBracket) {
assert(BlockVisitedSet.find(TBB) == BlockVisitedSet.end());		assert(!BlockVisitedSet.count(TBB));
BlockWaitcntBracketsMap[TBB] =		BlockWaitcntBracketsMap[TBB] =
llvm::make_unique<BlockWaitcntBrackets>();		llvm::make_unique<BlockWaitcntBrackets>();
ScoreBracket = BlockWaitcntBracketsMap[TBB].get();		ScoreBracket = BlockWaitcntBracketsMap[TBB].get();
}		}
ScoreBracket->setRevisitLoop(true);		ScoreBracket->setRevisitLoop(true);
DEBUG(dbgs() << "set-revisit: block"		DEBUG(dbgs() << "set-revisit: block"
<< ContainingLoop->getHeader()->getNumber() << '\n';);		<< ContainingLoop->getHeader()->getNumber() << '\n';);
}		}
}		}

// Update an existing waitcount, or make a new one.		// Update an existing waitcount, or make a new one.
MachineFunction &MF = *MI.getParent()->getParent();		MachineFunction &MF = *MI.getParent()->getParent();
if (OldWaitcnt && OldWaitcnt->getOpcode() != AMDGPU::S_WAITCNT) {		if (OldWaitcnt && OldWaitcnt->getOpcode() != AMDGPU::S_WAITCNT) {
SWaitInst = OldWaitcnt;		SWaitInst = OldWaitcnt;
} else {		} else {
SWaitInst = MF.CreateMachineInstr(TII->get(AMDGPU::S_WAITCNT),		SWaitInst = MF.CreateMachineInstr(TII->get(AMDGPU::S_WAITCNT),
MI.getDebugLoc());		MI.getDebugLoc());
CompilerGeneratedWaitcntSet.insert(SWaitInst);		TrackedWaitcntSet.insert(SWaitInst);
}		}

const MachineOperand &Op =		const MachineOperand &Op =
MachineOperand::CreateImm(AMDGPU::encodeWaitcnt(		MachineOperand::CreateImm(AMDGPU::encodeWaitcnt(
IV, CntVal[VM_CNT], CntVal[EXP_CNT], CntVal[LGKM_CNT]));		IV, CntVal[VM_CNT], CntVal[EXP_CNT], CntVal[LGKM_CNT]));
SWaitInst->addOperand(MF, Op);		SWaitInst->addOperand(MF, Op);

if (CntVal[EXP_CNT] == 0) {		if (CntVal[EXP_CNT] == 0) {
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	void SIInsertWaitcnts::mergeInputScoreBrackets(MachineBasicBlock &Block) {
// need to handle single BBs with backedges to themselves. This means that		// need to handle single BBs with backedges to themselves. This means that
// they will need to retain and not clear their initial state.		// they will need to retain and not clear their initial state.

// See if there are any uninitialized predecessors. If so, emit an		// See if there are any uninitialized predecessors. If so, emit an
// s_waitcnt 0 at the beginning of the block.		// s_waitcnt 0 at the beginning of the block.
for (MachineBasicBlock *pred : Block.predecessors()) {		for (MachineBasicBlock *pred : Block.predecessors()) {
BlockWaitcntBrackets *PredScoreBrackets =		BlockWaitcntBrackets *PredScoreBrackets =
BlockWaitcntBracketsMap[pred].get();		BlockWaitcntBracketsMap[pred].get();
bool Visited = BlockVisitedSet.find(pred) != BlockVisitedSet.end();		bool Visited = BlockVisitedSet.count(pred);
if (!Visited \|\| PredScoreBrackets->getWaitAtBeginning()) {		if (!Visited \|\| PredScoreBrackets->getWaitAtBeginning()) {
continue;		continue;
}		}
for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;		for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
T = (enum InstCounterType)(T + 1)) {		T = (enum InstCounterType)(T + 1)) {
int span =		int span =
PredScoreBrackets->getScoreUB(T) - PredScoreBrackets->getScoreLB(T);		PredScoreBrackets->getScoreUB(T) - PredScoreBrackets->getScoreLB(T);
MaxPending[T] = std::max(MaxPending[T], span);		MaxPending[T] = std::max(MaxPending[T], span);
Show All 22 Lines	for (unsigned int I = 0; I < KillWaitBrackets.size(); I++) {
MixedExpTypes \|= KillWaitBrackets[I]->mixedExpTypes();		MixedExpTypes \|= KillWaitBrackets[I]->mixedExpTypes();
}		}
}		}

// Special handling for GDS_GPR_LOCK and EXP_GPR_LOCK.		// Special handling for GDS_GPR_LOCK and EXP_GPR_LOCK.
for (MachineBasicBlock *Pred : Block.predecessors()) {		for (MachineBasicBlock *Pred : Block.predecessors()) {
BlockWaitcntBrackets *PredScoreBrackets =		BlockWaitcntBrackets *PredScoreBrackets =
BlockWaitcntBracketsMap[Pred].get();		BlockWaitcntBracketsMap[Pred].get();
bool Visited = BlockVisitedSet.find(Pred) != BlockVisitedSet.end();		bool Visited = BlockVisitedSet.count(Pred);
if (!Visited \|\| PredScoreBrackets->getWaitAtBeginning()) {		if (!Visited \|\| PredScoreBrackets->getWaitAtBeginning()) {
continue;		continue;
}		}

int GDSSpan = PredScoreBrackets->getEventUB(GDS_GPR_LOCK) -		int GDSSpan = PredScoreBrackets->getEventUB(GDS_GPR_LOCK) -
PredScoreBrackets->getScoreLB(EXP_CNT);		PredScoreBrackets->getScoreLB(EXP_CNT);
MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], GDSSpan);		MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], GDSSpan);
int EXPSpan = PredScoreBrackets->getEventUB(EXP_GPR_LOCK) -		int EXPSpan = PredScoreBrackets->getEventUB(EXP_GPR_LOCK) -
Show All 31 Lines	for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
ScoreBrackets->setScoreLB(T, 0);		ScoreBrackets->setScoreLB(T, 0);
ScoreBrackets->setLastFlat(T, MaxFlat[T]);		ScoreBrackets->setLastFlat(T, MaxFlat[T]);
}		}

ScoreBrackets->setMixedExpTypes(MixedExpTypes);		ScoreBrackets->setMixedExpTypes(MixedExpTypes);

// Set the register scoreboard.		// Set the register scoreboard.
for (MachineBasicBlock *Pred : Block.predecessors()) {		for (MachineBasicBlock *Pred : Block.predecessors()) {
if (BlockVisitedSet.find(Pred) == BlockVisitedSet.end()) {		if (!BlockVisitedSet.count(Pred)) {
continue;		continue;
}		}

BlockWaitcntBrackets *PredScoreBrackets =		BlockWaitcntBrackets *PredScoreBrackets =
BlockWaitcntBracketsMap[Pred].get();		BlockWaitcntBracketsMap[Pred].get();

// Now merge the gpr_reg_score information		// Now merge the gpr_reg_score information
for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;		for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	for (unsigned int I = 0; I < KillWaitBrackets.size(); I++) {
}		}
}		}
}		}

// Special case handling of GDS_GPR_LOCK and EXP_GPR_LOCK. Merge this for the		// Special case handling of GDS_GPR_LOCK and EXP_GPR_LOCK. Merge this for the
// sequencing predecessors, because changes to EXEC require waitcnts due to		// sequencing predecessors, because changes to EXEC require waitcnts due to
// the delayed nature of these operations.		// the delayed nature of these operations.
for (MachineBasicBlock *Pred : Block.predecessors()) {		for (MachineBasicBlock *Pred : Block.predecessors()) {
if (BlockVisitedSet.find(Pred) == BlockVisitedSet.end()) {		if (!BlockVisitedSet.count(Pred)) {
continue;		continue;
}		}

BlockWaitcntBrackets *PredScoreBrackets =		BlockWaitcntBrackets *PredScoreBrackets =
BlockWaitcntBracketsMap[Pred].get();		BlockWaitcntBracketsMap[Pred].get();

int pred_gds_ub = PredScoreBrackets->getEventUB(GDS_GPR_LOCK);		int pred_gds_ub = PredScoreBrackets->getEventUB(GDS_GPR_LOCK);
if (pred_gds_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {		if (pred_gds_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	void SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
// Walk over the instructions.		// Walk over the instructions.
for (MachineBasicBlock::iterator Iter = Block.begin(), E = Block.end();		for (MachineBasicBlock::iterator Iter = Block.begin(), E = Block.end();
Iter != E;) {		Iter != E;) {
MachineInstr &Inst = *Iter;		MachineInstr &Inst = *Iter;
// Remove any previously existing waitcnts.		// Remove any previously existing waitcnts.
if (Inst.getOpcode() == AMDGPU::S_WAITCNT) {		if (Inst.getOpcode() == AMDGPU::S_WAITCNT) {
// TODO: Register the old waitcnt and optimize the following waitcnts.		// TODO: Register the old waitcnt and optimize the following waitcnts.
// Leaving the previously existing waitcnts is conservatively correct.		// Leaving the previously existing waitcnts is conservatively correct.
if (CompilerGeneratedWaitcntSet.find(&Inst) ==		if (!TrackedWaitcntSet.count(&Inst))
CompilerGeneratedWaitcntSet.end())
++Iter;		++Iter;
else {		else {
ScoreBrackets->setWaitcnt(&Inst);		ScoreBrackets->setWaitcnt(&Inst);
++Iter;		++Iter;
Inst.removeFromParent();		Inst.removeFromParent();
}		}
continue;		continue;
}		}

// Kill instructions generate a conditional branch to the endmain block.		// Kill instructions generate a conditional branch to the endmain block.
// Merge the current waitcnt state into the endmain block information.		// Merge the current waitcnt state into the endmain block information.
// TODO: Are there other flavors of KILL instruction?		// TODO: Are there other flavors of KILL instruction?
if (Inst.getOpcode() == AMDGPU::KILL) {		if (Inst.getOpcode() == AMDGPU::KILL) {
addKillWaitBracket(ScoreBrackets);		addKillWaitBracket(ScoreBrackets);
}		}

bool VCCZBugWorkAround = false;		bool VCCZBugWorkAround = false;
if (readsVCCZ(Inst) &&		if (readsVCCZ(Inst) &&
(VCCZBugHandledSet.find(&Inst) == VCCZBugHandledSet.end())) {		(!VCCZBugHandledSet.count(&Inst))) {
if (ScoreBrackets->getScoreLB(LGKM_CNT) <		if (ScoreBrackets->getScoreLB(LGKM_CNT) <
ScoreBrackets->getScoreUB(LGKM_CNT) &&		ScoreBrackets->getScoreUB(LGKM_CNT) &&
ScoreBrackets->hasPendingSMEM()) {		ScoreBrackets->hasPendingSMEM()) {
if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS)		if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS)
VCCZBugWorkAround = true;		VCCZBugWorkAround = true;
}		}
}		}

// Generate an s_waitcnt instruction to be placed before		// Generate an s_waitcnt instruction to be placed before
// cur_Inst, if needed.		// cur_Inst, if needed.
MachineInstr *SWaitInst = generateSWaitCntInstBefore(Inst, ScoreBrackets);		MachineInstr *SWaitInst = generateSWaitCntInstBefore(Inst, ScoreBrackets);

if (SWaitInst) {		if (SWaitInst) {
		// We don't (yet) track waitcnts that existed prior to the waitcnt
		// pass (we just skip over them); because the waitcnt pass is ignorant
		// of them, it may insert a redundant waitcnt. To avoid this, check
		// the prev instr. If it and the to-be-inserted waitcnt are the
		// same, keep the prev waitcnt and skip the insertion. We assume that
		// whomever. e.g., for memory model, inserted the prev waitcnt really
		// wants it there.
		bool insertSWaitInst = true;
		if (Iter != Block.begin()) {
		MachineInstr MIPrevInst = &std::prev(Iter);
		if (MIPrevInst &&
		MIPrevInst->getOpcode() == AMDGPU::S_WAITCNT &&
		MIPrevInst->getOperand(0).getImm() == SWaitInst->getOperand(0).getImm()) {
		insertSWaitInst = false;
		}
		}
		if (insertSWaitInst) {
Block.insert(Inst, SWaitInst);		Block.insert(Inst, SWaitInst);
if (ScoreBrackets->getWaitcnt() != SWaitInst) {		if (ScoreBrackets->getWaitcnt() != SWaitInst) {
DEBUG(dbgs() << "insertWaitcntInBlock\n"		DEBUG(dbgs() << "insertWaitcntInBlock\n"
<< "Old Instr: " << Inst << '\n'		<< "Old Instr: " << Inst << '\n'
<< "New Instr: " << *SWaitInst << '\n';);		<< "New Instr: " << *SWaitInst << '\n';);
}		}
}		}
		}

updateEventWaitCntAfter(Inst, ScoreBrackets);		updateEventWaitCntAfter(Inst, ScoreBrackets);

#if 0 // TODO: implement resource type check controlled by options with ub = LB.		#if 0 // TODO: implement resource type check controlled by options with ub = LB.
// If this instruction generates a S_SETVSKIP because it is an		// If this instruction generates a S_SETVSKIP because it is an
// indexed resource, and we are on Tahiti, then it will also force		// indexed resource, and we are on Tahiti, then it will also force
// an S_WAITCNT vmcnt(0)		// an S_WAITCNT vmcnt(0)
if (RequireCheckResourceType(Inst, context)) {		if (RequireCheckResourceType(Inst, context)) {
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	if (WaitcntData->getIterCnt() > 2) {
HasPending = true;		HasPending = true;
}		}
}		}

if (HasPending) {		if (HasPending) {
if (!SWaitInst) {		if (!SWaitInst) {
SWaitInst = Block.getParent()->CreateMachineInstr(		SWaitInst = Block.getParent()->CreateMachineInstr(
TII->get(AMDGPU::S_WAITCNT), DebugLoc());		TII->get(AMDGPU::S_WAITCNT), DebugLoc());
CompilerGeneratedWaitcntSet.insert(SWaitInst);		TrackedWaitcntSet.insert(SWaitInst);
const MachineOperand &Op = MachineOperand::CreateImm(0);		const MachineOperand &Op = MachineOperand::CreateImm(0);
SWaitInst->addOperand(MF, Op);		SWaitInst->addOperand(MF, Op);
#if 0 // TODO: Format the debug output		#if 0 // TODO: Format the debug output
OutputTransformBanner("insertWaitcntInBlock",0,"Create:",context);		OutputTransformBanner("insertWaitcntInBlock",0,"Create:",context);
OutputTransformAdd(SWaitInst, context);		OutputTransformAdd(SWaitInst, context);
#endif		#endif
}		}
#if 0 // TODO: ??		#if 0 // TODO: ??
Show All 39 Lines	bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {

RegisterEncoding.VGPR0 = TRI->getEncodingValue(AMDGPU::VGPR0);		RegisterEncoding.VGPR0 = TRI->getEncodingValue(AMDGPU::VGPR0);
RegisterEncoding.VGPRL =		RegisterEncoding.VGPRL =
RegisterEncoding.VGPR0 + HardwareLimits.NumVGPRsMax - 1;		RegisterEncoding.VGPR0 + HardwareLimits.NumVGPRsMax - 1;
RegisterEncoding.SGPR0 = TRI->getEncodingValue(AMDGPU::SGPR0);		RegisterEncoding.SGPR0 = TRI->getEncodingValue(AMDGPU::SGPR0);
RegisterEncoding.SGPRL =		RegisterEncoding.SGPRL =
RegisterEncoding.SGPR0 + HardwareLimits.NumSGPRsMax - 1;		RegisterEncoding.SGPR0 + HardwareLimits.NumSGPRsMax - 1;

		TrackedWaitcntSet.clear();
		BlockVisitedSet.clear();
		VCCZBugHandledSet.clear();

// Walk over the blocks in reverse post-dominator order, inserting		// Walk over the blocks in reverse post-dominator order, inserting
// s_waitcnt where needed.		// s_waitcnt where needed.
ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);		ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
bool Modified = false;		bool Modified = false;
for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator		for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator
I = RPOT.begin(),		I = RPOT.begin(),
E = RPOT.end(), J = RPOT.begin();		E = RPOT.end(), J = RPOT.begin();
I != E;) {		I != E;) {
Show All 10 Lines	for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator
MachineLoop *ContainingLoop = MLI->getLoopFor(&MBB);		MachineLoop *ContainingLoop = MLI->getLoopFor(&MBB);
if (ContainingLoop && LoopWaitcntDataMap[ContainingLoop] == nullptr)		if (ContainingLoop && LoopWaitcntDataMap[ContainingLoop] == nullptr)
LoopWaitcntDataMap[ContainingLoop] = llvm::make_unique<LoopWaitcntData>();		LoopWaitcntDataMap[ContainingLoop] = llvm::make_unique<LoopWaitcntData>();

// If we are walking into the block from before the loop, then guarantee		// If we are walking into the block from before the loop, then guarantee
// at least 1 re-walk over the loop to propagate the information, even if		// at least 1 re-walk over the loop to propagate the information, even if
// no S_WAITCNT instructions were generated.		// no S_WAITCNT instructions were generated.
if (ContainingLoop && ContainingLoop->getHeader() == &MBB && J < I &&		if (ContainingLoop && ContainingLoop->getHeader() == &MBB && J < I &&
(BlockWaitcntProcessedSet.find(&MBB) ==		(!BlockWaitcntProcessedSet.count(&MBB))) {
BlockWaitcntProcessedSet.end())) {
BlockWaitcntBracketsMap[&MBB]->setRevisitLoop(true);		BlockWaitcntBracketsMap[&MBB]->setRevisitLoop(true);
DEBUG(dbgs() << "set-revisit: block"		DEBUG(dbgs() << "set-revisit: block"
<< ContainingLoop->getHeader()->getNumber() << '\n';);		<< ContainingLoop->getHeader()->getNumber() << '\n';);
}		}

// Walk over the instructions.		// Walk over the instructions.
insertWaitcntInBlock(MF, MBB);		insertWaitcntInBlock(MF, MBB);

▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/waitcnt-no-redundant.mir

				# RUN: llc -mtriple=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s \| FileCheck %s

				# Check that the waitcnt pass does not insert a redundant waitcnt instr.
				# In this testcase, ensure that pass does not insert redundant S_WAITCNT 127
				# or S_WAITCNT 3952

				...
				# CHECK-LABEL: name: waitcnt-no-redundant
				# CHECK: DS_READ_B64
				# CHECK-NEXT: S_WAITCNT 127
				# CHECK-NEXT: FLAT_ATOMIC_CMPSWAP
				# CHECK-NEXT: S_WAITCNT 3952
				# CHECK-NEXT: BUFFER_WBINVL1_VOL

				name: waitcnt-no-redundant
				body: \|
				bb.0:
				renamable $vgpr0_vgpr1 = DS_READ_B64 killed renamable $vgpr0, 0, 0, implicit $m0, implicit $exec
				S_WAITCNT 127
				FLAT_ATOMIC_CMPSWAP killed renamable $vgpr0_vgpr1, killed renamable $vgpr3_vgpr4, 0, 0, implicit $exec, implicit $flat_scr
				S_WAITCNT 3952
				BUFFER_WBINVL1_VOL implicit $exec
				S_ENDPGM
				...