This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/InsertWaitcnts: Untangle some semi-global state
ClosedPublic

Authored by nhaehnle on Nov 7 2018, 2:15 PM.

Download Raw Diff

Details

Reviewers

msearles
rampitec
scott.linder
kanarayan

Commits

rG1a94cbb3f54b: AMDGPU/InsertWaitcnts: Untangle some semi-global state
rL347848: AMDGPU/InsertWaitcnts: Untangle some semi-global state

Summary

Reduce the statefulness of the algorithm in two ways:

More clearly split generateWaitcntInstBefore into two phases: the first one which determines the required wait, if any, without changing the ScoreBrackets, and the second one which actually inserts the wait and updates the brackets.

Communicate pre-existing s_waitcnt instructions using an argument to generateWaitcntInstBefore instead of through the ScoreBrackets.

To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.

There are some functional changes:

The FIXME for the VCCZ bug workaround was implemented: we only wait for SMEM instructions as required instead of waiting on all counters.

There are some cases where we previously merged some waitcnt instructions together non-locally due to the somewhat odd OldWaitcnt tracking, e.g. we would produce code like this:

ds_read_b32 v0, ... ds_read_b32 v1, ... s_waitcnt lgkmcnt(0) <-- this is a merged wait for both uses use(v0) more code use(v1)

In these cases we will now always first emit a wait for lgkmcnt(1), and then later for lgkmcnt(0). This should basically always be a win, although theoretically there could be cases where it's very slightly worse due to the increased code size. The worst code size regressions in my shader-db are:

WORST REGRESSIONS - Code Size Before After Delta Percentage 1724 1736 12 0.70 % shaders/private/f1-2015/1334.shader_test [0] 2276 2284 8 0.35 % shaders/private/f1-2015/1306.shader_test [0] 4632 4640 8 0.17 % shaders/private/ue4_elemental/62.shader_test [0] 2376 2384 8 0.34 % shaders/private/f1-2015/1308.shader_test [0] 3284 3292 8 0.24 % shaders/private/talos_principle/1955.shader_test [0]

... so I'm not particularly worried about the rather theoretical downside.

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle created this revision.Nov 7 2018, 2:15 PM

Harbormaster completed remote builds in B24692: Diff 173027.Nov 7 2018, 2:15 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptNov 7 2018, 2:15 PM

nhaehnle added a child revision: D54227: AMDGPU/InsertWaitcnts: Use foreach loops for inst and wait event types.Nov 7 2018, 2:18 PM

nhaehnle added a parent revision: D54225: AMDGPU/InsertWaitcnts: Some more const-correctness.

Testcase for the 2nd change?

Turns out I was a bit too quick in my analysis of the second point.
I thought the overly conservative waitcnt was due to the control flow
in the shader I was looking at, but it was actually due to a pre-existing
waitcnt.

Harbormaster completed remote builds in B24779: Diff 173327.Nov 9 2018, 7:14 AM

Note, the new waitcnt-preexisting.mir test shows this change.

ping

Early-out in generateWaitcntInstBefore when no wait is needed. This helps keep the nesting complexity in check for later changes.

LGTM

This revision is now accepted and ready to land.Nov 22 2018, 9:38 AM

Closed by commit rL347848: AMDGPU/InsertWaitcnts: Untangle some semi-global state (authored by nha). · Explain WhyNov 29 2018, 3:09 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIInsertWaitcnts.cpp

436 lines

Utils/

AMDGPUBaseInfo.h

30 lines

AMDGPUBaseInfo.cpp

12 lines

test/

CodeGen/

AMDGPU/

smrd-vccz-bug.ll

2 lines

vccz-corrupt-bug-workaround.mir

2 lines

waitcnt-preexisting.mir

3 lines

Diff 175834

llvm/trunk/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	enum RegisterMapping {
NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.		NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.
};		};

#define ForAllWaitEventType(w) \		#define ForAllWaitEventType(w) \
for (enum WaitEventType w = (enum WaitEventType)0; \		for (enum WaitEventType w = (enum WaitEventType)0; \
(w) < (enum WaitEventType)NUM_WAIT_EVENTS; \		(w) < (enum WaitEventType)NUM_WAIT_EVENTS; \
(w) = (enum WaitEventType)((w) + 1))		(w) = (enum WaitEventType)((w) + 1))

		void addWait(AMDGPU::Waitcnt &Wait, InstCounterType T, unsigned Count) {
		switch (T) {
		case VM_CNT:
		Wait.VmCnt = std::min(Wait.VmCnt, Count);
		break;
		case EXP_CNT:
		Wait.ExpCnt = std::min(Wait.ExpCnt, Count);
		break;
		case LGKM_CNT:
		Wait.LgkmCnt = std::min(Wait.LgkmCnt, Count);
		break;
		default:
		llvm_unreachable("bad InstCounterType");
		}
		}

// This is a per-basic-block object that maintains current score brackets		// This is a per-basic-block object that maintains current score brackets
// of each wait counter, and a per-register scoreboard for each wait counter.		// of each wait counter, and a per-register scoreboard for each wait counter.
// We also maintain the latest score for every event type that can change the		// We also maintain the latest score for every event type that can change the
// waitcnt in order to know if there are multiple types of events within		// waitcnt in order to know if there are multiple types of events within
// the brackets. When multiple types of event happen in the bracket,		// the brackets. When multiple types of event happen in the bracket,
// wait count may get decreased out of order, therefore we need to put in		// wait count may get decreased out of order, therefore we need to put in
// "s_waitcnt 0" before use.		// "s_waitcnt 0" before use.
class BlockWaitcntBrackets {		class BlockWaitcntBrackets {
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	if (GprNo < NUM_ALL_VGPRS) {
SgprScores[GprNo - NUM_ALL_VGPRS] = Val;		SgprScores[GprNo - NUM_ALL_VGPRS] = Val;
}		}
}		}

int32_t getRegScore(int GprNo, InstCounterType T) {		int32_t getRegScore(int GprNo, InstCounterType T) {
if (GprNo < NUM_ALL_VGPRS) {		if (GprNo < NUM_ALL_VGPRS) {
return VgprScores[T][GprNo];		return VgprScores[T][GprNo];
}		}
		assert(T == LGKM_CNT);
return SgprScores[GprNo - NUM_ALL_VGPRS];		return SgprScores[GprNo - NUM_ALL_VGPRS];
}		}

void clear() {		void clear() {
memset(ScoreLBs, 0, sizeof(ScoreLBs));		memset(ScoreLBs, 0, sizeof(ScoreLBs));
memset(ScoreUBs, 0, sizeof(ScoreUBs));		memset(ScoreUBs, 0, sizeof(ScoreUBs));
memset(EventUBs, 0, sizeof(EventUBs));		memset(EventUBs, 0, sizeof(EventUBs));
for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;		for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
Show All 20 Lines	public:
int32_t getMaxSGPR() const { return SgprUB; }		int32_t getMaxSGPR() const { return SgprUB; }

int32_t getEventUB(enum WaitEventType W) const {		int32_t getEventUB(enum WaitEventType W) const {
assert(W < NUM_WAIT_EVENTS);		assert(W < NUM_WAIT_EVENTS);
return EventUBs[W];		return EventUBs[W];
}		}

bool counterOutOfOrder(InstCounterType T) const;		bool counterOutOfOrder(InstCounterType T) const;
unsigned int updateByWait(InstCounterType T, int ScoreToWait);		bool simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;
		bool simplifyWaitcnt(InstCounterType T, unsigned &Count) const;
		void determineWait(InstCounterType T, int ScoreToWait,
		AMDGPU::Waitcnt &Wait) const;
		void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
		void applyWaitcnt(InstCounterType T, unsigned Count);
void updateByEvent(const SIInstrInfo TII, const SIRegisterInfo TRI,		void updateByEvent(const SIInstrInfo TII, const SIRegisterInfo TRI,
const MachineRegisterInfo *MRI, WaitEventType E,		const MachineRegisterInfo *MRI, WaitEventType E,
MachineInstr &MI);		MachineInstr &MI);

bool hasPendingSMEM() const {		bool hasPendingSMEM() const {
return (EventUBs[SMEM_ACCESS] > ScoreLBs[LGKM_CNT] &&		return (EventUBs[SMEM_ACCESS] > ScoreLBs[LGKM_CNT] &&
EventUBs[SMEM_ACCESS] <= ScoreUBs[LGKM_CNT]);		EventUBs[SMEM_ACCESS] <= ScoreUBs[LGKM_CNT]);
}		}
Show All 15 Lines	public:
void setLastFlat(InstCounterType Ct, int Val) { LastFlat[Ct] = Val; }		void setLastFlat(InstCounterType Ct, int Val) { LastFlat[Ct] = Val; }

bool getRevisitLoop() const { return RevisitLoop; }		bool getRevisitLoop() const { return RevisitLoop; }
void setRevisitLoop(bool RevisitLoopIn) { RevisitLoop = RevisitLoopIn; }		void setRevisitLoop(bool RevisitLoopIn) { RevisitLoop = RevisitLoopIn; }

void setPostOrder(int32_t PostOrderIn) { PostOrder = PostOrderIn; }		void setPostOrder(int32_t PostOrderIn) { PostOrder = PostOrderIn; }
int32_t getPostOrder() const { return PostOrder; }		int32_t getPostOrder() const { return PostOrder; }

void setWaitcnt(MachineInstr *WaitcntIn) { Waitcnt = WaitcntIn; }
void clearWaitcnt() { Waitcnt = nullptr; }
MachineInstr *getWaitcnt() const { return Waitcnt; }

bool mixedExpTypes() const { return MixedExpTypes; }		bool mixedExpTypes() const { return MixedExpTypes; }
void setMixedExpTypes(bool MixedExpTypesIn) {		void setMixedExpTypes(bool MixedExpTypesIn) {
MixedExpTypes = MixedExpTypesIn;		MixedExpTypes = MixedExpTypesIn;
}		}

void print(raw_ostream &);		void print(raw_ostream &);
void dump() { print(dbgs()); }		void dump() { print(dbgs()); }

private:		private:
const GCNSubtarget *ST = nullptr;		const GCNSubtarget *ST = nullptr;
bool WaitAtBeginning = false;		bool WaitAtBeginning = false;
bool RevisitLoop = false;		bool RevisitLoop = false;
bool MixedExpTypes = false;		bool MixedExpTypes = false;
int32_t PostOrder = 0;		int32_t PostOrder = 0;
MachineInstr *Waitcnt = nullptr;
int32_t ScoreLBs[NUM_INST_CNTS] = {0};		int32_t ScoreLBs[NUM_INST_CNTS] = {0};
int32_t ScoreUBs[NUM_INST_CNTS] = {0};		int32_t ScoreUBs[NUM_INST_CNTS] = {0};
int32_t EventUBs[NUM_WAIT_EVENTS] = {0};		int32_t EventUBs[NUM_WAIT_EVENTS] = {0};
// Remember the last flat memory operation.		// Remember the last flat memory operation.
int32_t LastFlat[NUM_INST_CNTS] = {0};		int32_t LastFlat[NUM_INST_CNTS] = {0};
// wait_cnt scores for every vgpr.		// wait_cnt scores for every vgpr.
// Keep track of the VgprUB and SgprUB to make merge at join efficient.		// Keep track of the VgprUB and SgprUB to make merge at join efficient.
int32_t VgprUB = 0;		int32_t VgprUB = 0;
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	#ifndef NDEBUG
} else {		} else {
ForceEmitWaitcnt[VM_CNT] = false;		ForceEmitWaitcnt[VM_CNT] = false;
}		}
#endif // NDEBUG		#endif // NDEBUG
}		}

bool mayAccessLDSThroughFlat(const MachineInstr &MI) const;		bool mayAccessLDSThroughFlat(const MachineInstr &MI) const;
void generateWaitcntInstBefore(MachineInstr &MI,		void generateWaitcntInstBefore(MachineInstr &MI,
BlockWaitcntBrackets *ScoreBrackets);		BlockWaitcntBrackets *ScoreBrackets,
		MachineInstr *OldWaitcntInstr);
void updateEventWaitcntAfter(MachineInstr &Inst,		void updateEventWaitcntAfter(MachineInstr &Inst,
BlockWaitcntBrackets *ScoreBrackets);		BlockWaitcntBrackets *ScoreBrackets);
void mergeInputScoreBrackets(MachineBasicBlock &Block);		void mergeInputScoreBrackets(MachineBasicBlock &Block);
bool isLoopBottom(const MachineLoop Loop, const MachineBasicBlock Block);		bool isLoopBottom(const MachineLoop Loop, const MachineBasicBlock Block);
unsigned countNumBottomBlocks(const MachineLoop *Loop);		unsigned countNumBottomBlocks(const MachineLoop *Loop);
void insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block);		void insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block);
void insertWaitcntBeforeCF(MachineBasicBlock &Block, MachineInstr *Inst);		void insertWaitcntBeforeCF(MachineBasicBlock &Block, MachineInstr *Inst);
bool isWaitcntStronger(unsigned LHS, unsigned RHS);
unsigned combineWaitcnt(unsigned LHS, unsigned RHS);
};		};

} // end anonymous namespace		} // end anonymous namespace

RegInterval BlockWaitcntBrackets::getRegInterval(const MachineInstr *MI,		RegInterval BlockWaitcntBrackets::getRegInterval(const MachineInstr *MI,
const SIInstrInfo *TII,		const SIInstrInfo *TII,
const MachineRegisterInfo *MRI,		const MachineRegisterInfo *MRI,
const SIRegisterInfo *TRI,		const SIRegisterInfo *TRI,
▲ Show 20 Lines • Show All 241 Lines • ▼ Show 20 Lines	if (LB < UB) {
}		}
}		}
}		}
OS << '\n';		OS << '\n';
}		}
OS << '\n';		OS << '\n';
}		}

unsigned int BlockWaitcntBrackets::updateByWait(InstCounterType T,		/// Simplify the waitcnt, in the sense of removing redundant counts, and return
int ScoreToWait) {		/// whether a waitcnt instruction is needed at all.
unsigned int NeedWait = 0;		bool BlockWaitcntBrackets::simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const {
		return simplifyWaitcnt(VM_CNT, Wait.VmCnt) \|
		simplifyWaitcnt(EXP_CNT, Wait.ExpCnt) \|
		simplifyWaitcnt(LGKM_CNT, Wait.LgkmCnt);
		}

		bool BlockWaitcntBrackets::simplifyWaitcnt(InstCounterType T,
		unsigned &Count) const {
		const int32_t LB = getScoreLB(T);
		const int32_t UB = getScoreUB(T);
		if (Count < (unsigned)UB && UB - (int32_t)Count > LB)
		return true;

		Count = ~0u;
		return false;
		}

		void BlockWaitcntBrackets::determineWait(InstCounterType T, int ScoreToWait,
		AMDGPU::Waitcnt &Wait) const {
if (ScoreToWait == -1) {		if (ScoreToWait == -1) {
// The score to wait is unknown. This implies that it was not encountered		// The score to wait is unknown. This implies that it was not encountered
// during the path of the CFG walk done during the current traversal but		// during the path of the CFG walk done during the current traversal but
// may be seen on a different path. Emit an s_wait counter with a		// may be seen on a different path. Emit an s_wait counter with a
// conservative value of 0 for the counter.		// conservative value of 0 for the counter.
NeedWait = CNT_MASK(T);		addWait(Wait, T, 0);
setScoreLB(T, getScoreUB(T));		return;
return NeedWait;
}		}

// If the score of src_operand falls within the bracket, we need an		// If the score of src_operand falls within the bracket, we need an
// s_waitcnt instruction.		// s_waitcnt instruction.
const int32_t LB = getScoreLB(T);		const int32_t LB = getScoreLB(T);
const int32_t UB = getScoreUB(T);		const int32_t UB = getScoreUB(T);
if ((UB >= ScoreToWait) && (ScoreToWait > LB)) {		if ((UB >= ScoreToWait) && (ScoreToWait > LB)) {
if ((T == VM_CNT \|\| T == LGKM_CNT) &&		if ((T == VM_CNT \|\| T == LGKM_CNT) &&
hasPendingFlat() &&		hasPendingFlat() &&
!ST->hasFlatLgkmVMemCountInOrder()) {		!ST->hasFlatLgkmVMemCountInOrder()) {
// If there is a pending FLAT operation, and this is a VMem or LGKM		// If there is a pending FLAT operation, and this is a VMem or LGKM
// waitcnt and the target can report early completion, then we need		// waitcnt and the target can report early completion, then we need
// to force a waitcnt 0.		// to force a waitcnt 0.
NeedWait = CNT_MASK(T);		addWait(Wait, T, 0);
setScoreLB(T, getScoreUB(T));
} else if (counterOutOfOrder(T)) {		} else if (counterOutOfOrder(T)) {
// Counter can get decremented out-of-order when there		// Counter can get decremented out-of-order when there
// are multiple types event in the bracket. Also emit an s_wait counter		// are multiple types event in the bracket. Also emit an s_wait counter
// with a conservative value of 0 for the counter.		// with a conservative value of 0 for the counter.
NeedWait = CNT_MASK(T);		addWait(Wait, T, 0);
setScoreLB(T, getScoreUB(T));
} else {		} else {
NeedWait = CNT_MASK(T);		addWait(Wait, T, UB - ScoreToWait);
setScoreLB(T, ScoreToWait);		}
		}
}		}

		void BlockWaitcntBrackets::applyWaitcnt(const AMDGPU::Waitcnt &Wait) {
		applyWaitcnt(VM_CNT, Wait.VmCnt);
		applyWaitcnt(EXP_CNT, Wait.ExpCnt);
		applyWaitcnt(LGKM_CNT, Wait.LgkmCnt);

		if (Wait.ExpCnt == 0)
		setMixedExpTypes(false);
}		}

return NeedWait;		void BlockWaitcntBrackets::applyWaitcnt(InstCounterType T, unsigned Count) {
		const int32_t UB = getScoreUB(T);
		if (Count >= (unsigned)UB)
		return;
		if (Count != 0) {
		if (counterOutOfOrder(T))
		return;
		setScoreLB(T, std::max(getScoreLB(T), UB - (int32_t)Count));
		} else {
		setScoreLB(T, UB);
		}
}		}

// Where there are multiple types of event in the bracket of a counter,		// Where there are multiple types of event in the bracket of a counter,
// the decrement may go out of order.		// the decrement may go out of order.
bool BlockWaitcntBrackets::counterOutOfOrder(InstCounterType T) const {		bool BlockWaitcntBrackets::counterOutOfOrder(InstCounterType T) const {
switch (T) {		switch (T) {
case VM_CNT:		case VM_CNT:
return false;		return false;
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines
}		}

static bool readsVCCZ(const MachineInstr &MI) {		static bool readsVCCZ(const MachineInstr &MI) {
unsigned Opc = MI.getOpcode();		unsigned Opc = MI.getOpcode();
return (Opc == AMDGPU::S_CBRANCH_VCCNZ \|\| Opc == AMDGPU::S_CBRANCH_VCCZ) &&		return (Opc == AMDGPU::S_CBRANCH_VCCNZ \|\| Opc == AMDGPU::S_CBRANCH_VCCZ) &&
!MI.getOperand(1).isUndef();		!MI.getOperand(1).isUndef();
}		}

/// Given wait count encodings checks if LHS is stronger than RHS.
bool SIInsertWaitcnts::isWaitcntStronger(unsigned LHS, unsigned RHS) {
if (AMDGPU::decodeVmcnt(IV, LHS) > AMDGPU::decodeVmcnt(IV, RHS))
return false;
if (AMDGPU::decodeLgkmcnt(IV, LHS) > AMDGPU::decodeLgkmcnt(IV, RHS))
return false;
if (AMDGPU::decodeExpcnt(IV, LHS) > AMDGPU::decodeExpcnt(IV, RHS))
return false;
return true;
}

/// Given wait count encodings create a new encoding which is stronger
/// or equal to both.
unsigned SIInsertWaitcnts::combineWaitcnt(unsigned LHS, unsigned RHS) {
unsigned VmCnt = std::min(AMDGPU::decodeVmcnt(IV, LHS),
AMDGPU::decodeVmcnt(IV, RHS));
unsigned LgkmCnt = std::min(AMDGPU::decodeLgkmcnt(IV, LHS),
AMDGPU::decodeLgkmcnt(IV, RHS));
unsigned ExpCnt = std::min(AMDGPU::decodeExpcnt(IV, LHS),
AMDGPU::decodeExpcnt(IV, RHS));
return AMDGPU::encodeWaitcnt(IV, VmCnt, ExpCnt, LgkmCnt);
}

/// Generate s_waitcnt instruction to be placed before cur_Inst.		/// Generate s_waitcnt instruction to be placed before cur_Inst.
/// Instructions of a given type are returned in order,		/// Instructions of a given type are returned in order,
/// but instructions of different types can complete out of order.		/// but instructions of different types can complete out of order.
/// We rely on this in-order completion		/// We rely on this in-order completion
/// and simply assign a score to the memory access instructions.		/// and simply assign a score to the memory access instructions.
/// We keep track of the active "score bracket" to determine		/// We keep track of the active "score bracket" to determine
/// if an access of a memory read requires an s_waitcnt		/// if an access of a memory read requires an s_waitcnt
/// and if so what the value of each counter is.		/// and if so what the value of each counter is.
/// The "score bracket" is bound by the lower bound and upper bound		/// The "score bracket" is bound by the lower bound and upper bound
/// scores (_score_LB and _score_ub respectively).		/// scores (_score_LB and _score_ub respectively).
void SIInsertWaitcnts::generateWaitcntInstBefore(		void SIInsertWaitcnts::generateWaitcntInstBefore(
MachineInstr &MI, BlockWaitcntBrackets *ScoreBrackets) {		MachineInstr &MI, BlockWaitcntBrackets *ScoreBrackets,
// To emit, or not to emit - that's the question!		MachineInstr *OldWaitcntInstr) {
// Start with an assumption that there is no need to emit.
unsigned int EmitWaitcnt = 0;

// ForceEmitZeroWaitcnt: force a single s_waitcnt 0 due to hw bug
bool ForceEmitZeroWaitcnt = false;

setForceEmitWaitcnt();		setForceEmitWaitcnt();
bool IsForceEmitWaitcnt = isForceEmitWaitcnt();		bool IsForceEmitWaitcnt = isForceEmitWaitcnt();

if (MI.isDebugInstr())		if (MI.isDebugInstr())
return;		return;

		AMDGPU::Waitcnt Wait;

// See if an s_waitcnt is forced at block entry, or is needed at		// See if an s_waitcnt is forced at block entry, or is needed at
// program end.		// program end.
if (ScoreBrackets->getWaitAtBeginning()) {		if (ScoreBrackets->getWaitAtBeginning()) {
// Note that we have already cleared the state, so we don't need to update		// Note that we have already cleared the state, so we don't need to update
// it.		// it.
ScoreBrackets->clearWaitAtBeginning();		ScoreBrackets->clearWaitAtBeginning();
for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;		Wait = AMDGPU::Waitcnt::allZero();
T = (enum InstCounterType)(T + 1)) {
EmitWaitcnt \|= CNT_MASK(T);
ScoreBrackets->setScoreLB(T, ScoreBrackets->getScoreUB(T));
}
}		}

// See if this instruction has a forced S_WAITCNT VM.		// See if this instruction has a forced S_WAITCNT VM.
// TODO: Handle other cases of NeedsWaitcntVmBefore()		// TODO: Handle other cases of NeedsWaitcntVmBefore()
else if (MI.getOpcode() == AMDGPU::BUFFER_WBINVL1 \|\|		else if (MI.getOpcode() == AMDGPU::BUFFER_WBINVL1 \|\|
MI.getOpcode() == AMDGPU::BUFFER_WBINVL1_SC \|\|		MI.getOpcode() == AMDGPU::BUFFER_WBINVL1_SC \|\|
MI.getOpcode() == AMDGPU::BUFFER_WBINVL1_VOL) {		MI.getOpcode() == AMDGPU::BUFFER_WBINVL1_VOL) {
EmitWaitcnt \|=		Wait.VmCnt = 0;
ScoreBrackets->updateByWait(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));
}		}

// All waits must be resolved at call return.		// All waits must be resolved at call return.
// NOTE: this could be improved with knowledge of all call sites or		// NOTE: this could be improved with knowledge of all call sites or
// with knowledge of the called routines.		// with knowledge of the called routines.
if (MI.getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG \|\|		if (MI.getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG \|\|
MI.getOpcode() == AMDGPU::S_SETPC_B64_return) {		MI.getOpcode() == AMDGPU::S_SETPC_B64_return) {
for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;		Wait = AMDGPU::Waitcnt::allZero();
T = (enum InstCounterType)(T + 1)) {
if (ScoreBrackets->getScoreUB(T) > ScoreBrackets->getScoreLB(T)) {
ScoreBrackets->setScoreLB(T, ScoreBrackets->getScoreUB(T));
EmitWaitcnt \|= CNT_MASK(T);
}
}
}		}
// Resolve vm waits before gs-done.		// Resolve vm waits before gs-done.
else if ((MI.getOpcode() == AMDGPU::S_SENDMSG \|\|		else if ((MI.getOpcode() == AMDGPU::S_SENDMSG \|\|
MI.getOpcode() == AMDGPU::S_SENDMSGHALT) &&		MI.getOpcode() == AMDGPU::S_SENDMSGHALT) &&
((MI.getOperand(0).getImm() & AMDGPU::SendMsg::ID_MASK_) ==		((MI.getOperand(0).getImm() & AMDGPU::SendMsg::ID_MASK_) ==
AMDGPU::SendMsg::ID_GS_DONE)) {		AMDGPU::SendMsg::ID_GS_DONE)) {
if (ScoreBrackets->getScoreUB(VM_CNT) > ScoreBrackets->getScoreLB(VM_CNT)) {		Wait.VmCnt = 0;
ScoreBrackets->setScoreLB(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));
EmitWaitcnt \|= CNT_MASK(VM_CNT);
}
}		}
#if 0 // TODO: the following blocks of logic when we have fence.		#if 0 // TODO: the following blocks of logic when we have fence.
else if (MI.getOpcode() == SC_FENCE) {		else if (MI.getOpcode() == SC_FENCE) {
const unsigned int group_size =		const unsigned int group_size =
context->shader_info->GetMaxThreadGroupSize();		context->shader_info->GetMaxThreadGroupSize();
// group_size == 0 means thread group size is unknown at compile time		// group_size == 0 means thread group size is unknown at compile time
const bool group_is_multi_wave =		const bool group_is_multi_wave =
(group_size == 0 \|\| group_size > target_info->GetWaveFrontSize());		(group_size == 0 \|\| group_size > target_info->GetWaveFrontSize());
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	#endif
// Export & GDS instructions do not read the EXEC mask until after the export		// Export & GDS instructions do not read the EXEC mask until after the export
// is granted (which can occur well after the instruction is issued).		// is granted (which can occur well after the instruction is issued).
// The shader program must flush all EXP operations on the export-count		// The shader program must flush all EXP operations on the export-count
// before overwriting the EXEC mask.		// before overwriting the EXEC mask.
else {		else {
if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {		if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
// Export and GDS are tracked individually, either may trigger a waitcnt		// Export and GDS are tracked individually, either may trigger a waitcnt
// for EXEC.		// for EXEC.
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
EXP_CNT, ScoreBrackets->getEventUB(EXP_GPR_LOCK));		EXP_CNT, ScoreBrackets->getEventUB(EXP_GPR_LOCK), Wait);
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
EXP_CNT, ScoreBrackets->getEventUB(EXP_PARAM_ACCESS));		EXP_CNT, ScoreBrackets->getEventUB(EXP_PARAM_ACCESS), Wait);
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
EXP_CNT, ScoreBrackets->getEventUB(EXP_POS_ACCESS));		EXP_CNT, ScoreBrackets->getEventUB(EXP_POS_ACCESS), Wait);
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
EXP_CNT, ScoreBrackets->getEventUB(GDS_GPR_LOCK));		EXP_CNT, ScoreBrackets->getEventUB(GDS_GPR_LOCK), Wait);
}		}

#if 0 // TODO: the following code to handle CALL.		#if 0 // TODO: the following code to handle CALL.
// The argument passing for CALLs should suffice for VM_CNT and LGKM_CNT.		// The argument passing for CALLs should suffice for VM_CNT and LGKM_CNT.
// However, there is a problem with EXP_CNT, because the call cannot		// However, there is a problem with EXP_CNT, because the call cannot
// easily tell if a register is used in the function, and if it did, then		// easily tell if a register is used in the function, and if it did, then
// the referring instruction would have to have an S_WAITCNT, which is		// the referring instruction would have to have an S_WAITCNT, which is
// dependent on all call sites. So Instead, force S_WAITCNT for EXP_CNTs		// dependent on all call sites. So Instead, force S_WAITCNT for EXP_CNTs
Show All 15 Lines	#endif
// If the source operand was defined by a load, add the s_waitcnt		// If the source operand was defined by a load, add the s_waitcnt
// instruction.		// instruction.
for (const MachineMemOperand *Memop : MI.memoperands()) {		for (const MachineMemOperand *Memop : MI.memoperands()) {
unsigned AS = Memop->getAddrSpace();		unsigned AS = Memop->getAddrSpace();
if (AS != AMDGPUAS::LOCAL_ADDRESS)		if (AS != AMDGPUAS::LOCAL_ADDRESS)
continue;		continue;
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;		unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
// VM_CNT is only relevant to vgpr or LDS.		// VM_CNT is only relevant to vgpr or LDS.
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));		VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT), Wait);
}		}

for (unsigned I = 0, E = MI.getNumOperands(); I != E; ++I) {		for (unsigned I = 0, E = MI.getNumOperands(); I != E; ++I) {
const MachineOperand &Op = MI.getOperand(I);		const MachineOperand &Op = MI.getOperand(I);
const MachineRegisterInfo &MRIA = *MRI;		const MachineRegisterInfo &MRIA = *MRI;
RegInterval Interval =		RegInterval Interval =
ScoreBrackets->getRegInterval(&MI, TII, MRI, TRI, I, false);		ScoreBrackets->getRegInterval(&MI, TII, MRI, TRI, I, false);
for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {		for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
if (TRI->isVGPR(MRIA, Op.getReg())) {		if (TRI->isVGPR(MRIA, Op.getReg())) {
// VM_CNT is only relevant to vgpr or LDS.		// VM_CNT is only relevant to vgpr or LDS.
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));		VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT), Wait);
}		}
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
LGKM_CNT, ScoreBrackets->getRegScore(RegNo, LGKM_CNT));		LGKM_CNT, ScoreBrackets->getRegScore(RegNo, LGKM_CNT), Wait);
}		}
}		}
// End of for loop that looks at all source operands to decide vm_wait_cnt		// End of for loop that looks at all source operands to decide vm_wait_cnt
// and lgk_wait_cnt.		// and lgk_wait_cnt.

// Two cases are handled for destination operands:		// Two cases are handled for destination operands:
// 1) If the destination operand was defined by a load, add the s_waitcnt		// 1) If the destination operand was defined by a load, add the s_waitcnt
// instruction to guarantee the right WAW order.		// instruction to guarantee the right WAW order.
// 2) If a destination operand that was used by a recent export/store ins,		// 2) If a destination operand that was used by a recent export/store ins,
// add s_waitcnt on exp_cnt to guarantee the WAR order.		// add s_waitcnt on exp_cnt to guarantee the WAR order.
if (MI.mayStore()) {		if (MI.mayStore()) {
// FIXME: Should not be relying on memoperands.		// FIXME: Should not be relying on memoperands.
for (const MachineMemOperand *Memop : MI.memoperands()) {		for (const MachineMemOperand *Memop : MI.memoperands()) {
unsigned AS = Memop->getAddrSpace();		unsigned AS = Memop->getAddrSpace();
if (AS != AMDGPUAS::LOCAL_ADDRESS)		if (AS != AMDGPUAS::LOCAL_ADDRESS)
continue;		continue;
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;		unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));		VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT), Wait);
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
EXP_CNT, ScoreBrackets->getRegScore(RegNo, EXP_CNT));		EXP_CNT, ScoreBrackets->getRegScore(RegNo, EXP_CNT), Wait);
}		}
}		}
for (unsigned I = 0, E = MI.getNumOperands(); I != E; ++I) {		for (unsigned I = 0, E = MI.getNumOperands(); I != E; ++I) {
MachineOperand &Def = MI.getOperand(I);		MachineOperand &Def = MI.getOperand(I);
const MachineRegisterInfo &MRIA = *MRI;		const MachineRegisterInfo &MRIA = *MRI;
RegInterval Interval =		RegInterval Interval =
ScoreBrackets->getRegInterval(&MI, TII, MRI, TRI, I, true);		ScoreBrackets->getRegInterval(&MI, TII, MRI, TRI, I, true);
for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {		for (signed RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
if (TRI->isVGPR(MRIA, Def.getReg())) {		if (TRI->isVGPR(MRIA, Def.getReg())) {
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT));		VM_CNT, ScoreBrackets->getRegScore(RegNo, VM_CNT), Wait);
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
EXP_CNT, ScoreBrackets->getRegScore(RegNo, EXP_CNT));		EXP_CNT, ScoreBrackets->getRegScore(RegNo, EXP_CNT), Wait);
}		}
EmitWaitcnt \|= ScoreBrackets->updateByWait(		ScoreBrackets->determineWait(
LGKM_CNT, ScoreBrackets->getRegScore(RegNo, LGKM_CNT));		LGKM_CNT, ScoreBrackets->getRegScore(RegNo, LGKM_CNT), Wait);
}		}
} // End of for loop that looks at all dest operands.		} // End of for loop that looks at all dest operands.
}		}

// Check to see if this is an S_BARRIER, and if an implicit S_WAITCNT 0		// Check to see if this is an S_BARRIER, and if an implicit S_WAITCNT 0
// occurs before the instruction. Doing it here prevents any additional		// occurs before the instruction. Doing it here prevents any additional
// S_WAITCNTs from being emitted if the instruction was marked as		// S_WAITCNTs from being emitted if the instruction was marked as
// requiring a WAITCNT beforehand.		// requiring a WAITCNT beforehand.
if (MI.getOpcode() == AMDGPU::S_BARRIER &&		if (MI.getOpcode() == AMDGPU::S_BARRIER &&
!ST->hasAutoWaitcntBeforeBarrier()) {		!ST->hasAutoWaitcntBeforeBarrier()) {
EmitWaitcnt \|=		Wait = AMDGPU::Waitcnt::allZero();
ScoreBrackets->updateByWait(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));
EmitWaitcnt \|= ScoreBrackets->updateByWait(
EXP_CNT, ScoreBrackets->getScoreUB(EXP_CNT));
EmitWaitcnt \|= ScoreBrackets->updateByWait(
LGKM_CNT, ScoreBrackets->getScoreUB(LGKM_CNT));
}		}

// TODO: Remove this work-around, enable the assert for Bug 457939		// TODO: Remove this work-around, enable the assert for Bug 457939
// after fixing the scheduler. Also, the Shader Compiler code is		// after fixing the scheduler. Also, the Shader Compiler code is
// independent of target.		// independent of target.
if (readsVCCZ(MI) && ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS) {		if (readsVCCZ(MI) && ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS) {
if (ScoreBrackets->getScoreLB(LGKM_CNT) <		if (ScoreBrackets->getScoreLB(LGKM_CNT) <
ScoreBrackets->getScoreUB(LGKM_CNT) &&		ScoreBrackets->getScoreUB(LGKM_CNT) &&
ScoreBrackets->hasPendingSMEM()) {		ScoreBrackets->hasPendingSMEM()) {
// Wait on everything, not just LGKM. vccz reads usually come from		Wait.LgkmCnt = 0;
// terminators, and we always wait on everything at the end of the
// block, so if we only wait on LGKM here, we might end up with
// another s_waitcnt inserted right after this if there are non-LGKM
// instructions still outstanding.
// FIXME: this is too conservative / the comment is wrong.
// We don't wait on everything at the end of the block and we combine
// waitcnts so we should never have back-to-back waitcnts.
ForceEmitZeroWaitcnt = true;
EmitWaitcnt = true;
}		}
}		}

// Does this operand processing indicate s_wait counter update?		// Early-out if no wait is indicated.
if (EmitWaitcnt \|\| IsForceEmitWaitcnt) {		if (!ScoreBrackets->simplifyWaitcnt(Wait) && !IsForceEmitWaitcnt) {
int CntVal[NUM_INST_CNTS];		if (OldWaitcntInstr) {
		if (TrackedWaitcntSet.count(OldWaitcntInstr)) {
if (ForceEmitZeroWaitcnt \|\| ForceEmitZeroWaitcnts) {		TrackedWaitcntSet.erase(OldWaitcntInstr);
// Force all waitcnts to 0.		OldWaitcntInstr->eraseFromParent();
for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;
T = (enum InstCounterType)(T + 1)) {
ScoreBrackets->setScoreLB(T, ScoreBrackets->getScoreUB(T));
}
CntVal[VM_CNT] = 0;
CntVal[EXP_CNT] = 0;
CntVal[LGKM_CNT] = 0;
} else {		} else {
for (enum InstCounterType T = VM_CNT; T < NUM_INST_CNTS;		int64_t Imm = OldWaitcntInstr->getOperand(0).getImm();
T = (enum InstCounterType)(T + 1)) {		ScoreBrackets->applyWaitcnt(AMDGPU::decodeWaitcnt(IV, Imm));
if (EmitWaitcnt & CNT_MASK(T)) {
int Delta =
ScoreBrackets->getScoreUB(T) - ScoreBrackets->getScoreLB(T);
int MaxDelta = ScoreBrackets->getWaitCountMax(T);
if (Delta >= MaxDelta) {
Delta = -1;
if (T != EXP_CNT) {
ScoreBrackets->setScoreLB(
T, ScoreBrackets->getScoreUB(T) - MaxDelta);
}
EmitWaitcnt &= ~CNT_MASK(T);
}		}
CntVal[T] = Delta;
} else {
// If we are not waiting for a particular counter then encode
// it as -1 which means "don't care."
CntVal[T] = -1;
}		}
		return;
}		}

		if (ForceEmitZeroWaitcnts)
		Wait = AMDGPU::Waitcnt::allZero();

		if (ForceEmitWaitcnt[VM_CNT])
		Wait.VmCnt = 0;
		if (ForceEmitWaitcnt[EXP_CNT])
		Wait.ExpCnt = 0;
		if (ForceEmitWaitcnt[LGKM_CNT])
		Wait.LgkmCnt = 0;

		ScoreBrackets->applyWaitcnt(Wait);

		AMDGPU::Waitcnt OldWait;
		if (OldWaitcntInstr) {
		OldWait =
		AMDGPU::decodeWaitcnt(IV, OldWaitcntInstr->getOperand(0).getImm());
}		}
		if (OldWait.dominates(Wait))
		return;

MachineInstr *OldWaitcnt = ScoreBrackets->getWaitcnt();
int Imm = (!OldWaitcnt) ? 0 : OldWaitcnt->getOperand(0).getImm();
if (!OldWaitcnt \|\|
(AMDGPU::decodeVmcnt(IV, Imm) !=
(CntVal[VM_CNT] & AMDGPU::getVmcntBitMask(IV))) \|\|
(AMDGPU::decodeExpcnt(IV, Imm) !=
(CntVal[EXP_CNT] & AMDGPU::getExpcntBitMask(IV))) \|\|
(AMDGPU::decodeLgkmcnt(IV, Imm) !=
(CntVal[LGKM_CNT] & AMDGPU::getLgkmcntBitMask(IV)))) {
MachineLoop *ContainingLoop = MLI->getLoopFor(MI.getParent());		MachineLoop *ContainingLoop = MLI->getLoopFor(MI.getParent());
if (ContainingLoop) {		if (ContainingLoop) {
MachineBasicBlock *TBB = ContainingLoop->getHeader();		MachineBasicBlock *TBB = ContainingLoop->getHeader();
BlockWaitcntBrackets *ScoreBracket = BlockWaitcntBracketsMap[TBB].get();		BlockWaitcntBrackets *ScoreBracket = BlockWaitcntBracketsMap[TBB].get();
if (!ScoreBracket) {		if (!ScoreBracket) {
assert(!BlockVisitedSet.count(TBB));		assert(!BlockVisitedSet.count(TBB));
BlockWaitcntBracketsMap[TBB] =		BlockWaitcntBracketsMap[TBB] =
llvm::make_unique<BlockWaitcntBrackets>(ST);		llvm::make_unique<BlockWaitcntBrackets>(ST);
ScoreBracket = BlockWaitcntBracketsMap[TBB].get();		ScoreBracket = BlockWaitcntBracketsMap[TBB].get();
}		}
ScoreBracket->setRevisitLoop(true);		ScoreBracket->setRevisitLoop(true);
LLVM_DEBUG(dbgs() << "set-revisit2: Block"		LLVM_DEBUG(dbgs() << "set-revisit2: Block"
<< ContainingLoop->getHeader()->getNumber() << '\n';);		<< ContainingLoop->getHeader()->getNumber() << '\n';);
}		}
}

// Update an existing waitcount, or make a new one.		if (OldWaitcntInstr && !TrackedWaitcntSet.count(OldWaitcntInstr))
unsigned Enc = AMDGPU::encodeWaitcnt(IV,		Wait = Wait.combined(OldWait);
ForceEmitWaitcnt[VM_CNT] ? 0 : CntVal[VM_CNT],
ForceEmitWaitcnt[EXP_CNT] ? 0 : CntVal[EXP_CNT],
ForceEmitWaitcnt[LGKM_CNT] ? 0 : CntVal[LGKM_CNT]);
// We don't remove waitcnts that existed prior to the waitcnt
// pass. Check if the waitcnt to-be-inserted can be avoided
// or if the prev waitcnt can be updated.
bool insertSWaitInst = true;
for (MachineBasicBlock::iterator I = MI.getIterator(),
B = MI.getParent()->begin();
insertSWaitInst && I != B; --I) {
if (I == MI.getIterator())
continue;

switch (I->getOpcode()) {		unsigned Enc = AMDGPU::encodeWaitcnt(IV, Wait);
case AMDGPU::S_WAITCNT:		if (OldWaitcntInstr) {
if (isWaitcntStronger(I->getOperand(0).getImm(), Enc))		OldWaitcntInstr->getOperand(0).setImm(Enc);
insertSWaitInst = false;
else if (!OldWaitcnt) {
OldWaitcnt = &*I;
Enc = combineWaitcnt(I->getOperand(0).getImm(), Enc);
}
break;
// TODO: skip over instructions which never require wait.
}
break;
}
if (insertSWaitInst) {
if (OldWaitcnt) {
assert(OldWaitcnt->getOpcode() == AMDGPU::S_WAITCNT);
if (ForceEmitZeroWaitcnts)
LLVM_DEBUG(dbgs()
<< "Force emit s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)\n");
if (IsForceEmitWaitcnt)
LLVM_DEBUG(dbgs() << "Force emit a s_waitcnt due to debug counter\n");

OldWaitcnt->getOperand(0).setImm(Enc);
if (!OldWaitcnt->getParent())
MI.getParent()->insert(MI, OldWaitcnt);

LLVM_DEBUG(dbgs() << "updateWaitcntInBlock\n"		LLVM_DEBUG(dbgs() << "updateWaitcntInBlock\n"
<< "Old Instr: " << MI << '\n'		<< "Old Instr: " << MI << '\n'
<< "New Instr: " << *OldWaitcnt << '\n');		<< "New Instr: " << *OldWaitcntInstr << '\n');
} else {		} else {
auto SWaitInst = BuildMI(*MI.getParent(), MI.getIterator(),		auto SWaitInst = BuildMI(*MI.getParent(), MI.getIterator(),
MI.getDebugLoc(), TII->get(AMDGPU::S_WAITCNT))		MI.getDebugLoc(), TII->get(AMDGPU::S_WAITCNT))
.addImm(Enc);		.addImm(Enc);
TrackedWaitcntSet.insert(SWaitInst);		TrackedWaitcntSet.insert(SWaitInst);

LLVM_DEBUG(dbgs() << "insertWaitcntInBlock\n"		LLVM_DEBUG(dbgs() << "insertWaitcntInBlock\n"
<< "Old Instr: " << MI << '\n'		<< "Old Instr: " << MI << '\n'
<< "New Instr: " << *SWaitInst << '\n');		<< "New Instr: " << *SWaitInst << '\n');
}		}
}		}

if (CntVal[EXP_CNT] == 0) {
ScoreBrackets->setMixedExpTypes(false);
}
}
}

void SIInsertWaitcnts::insertWaitcntBeforeCF(MachineBasicBlock &MBB,		void SIInsertWaitcnts::insertWaitcntBeforeCF(MachineBasicBlock &MBB,
MachineInstr *Waitcnt) {		MachineInstr *Waitcnt) {
if (MBB.empty()) {		if (MBB.empty()) {
MBB.push_back(Waitcnt);		MBB.push_back(Waitcnt);
return;		return;
}		}

MachineBasicBlock::iterator It = MBB.end();		MachineBasicBlock::iterator It = MBB.end();
▲ Show 20 Lines • Show All 303 Lines • ▼ Show 20 Lines	void SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
BlockWaitcntBrackets *ScoreBrackets = BlockWaitcntBracketsMap[&Block].get();		BlockWaitcntBrackets *ScoreBrackets = BlockWaitcntBracketsMap[&Block].get();

LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "* Block" << Block.getNumber() << " *";		dbgs() << "* Block" << Block.getNumber() << " *";
ScoreBrackets->dump();		ScoreBrackets->dump();
});		});

// Walk over the instructions.		// Walk over the instructions.
		MachineInstr *OldWaitcntInstr = nullptr;

for (MachineBasicBlock::iterator Iter = Block.begin(), E = Block.end();		for (MachineBasicBlock::iterator Iter = Block.begin(), E = Block.end();
Iter != E;) {		Iter != E;) {
MachineInstr &Inst = *Iter;		MachineInstr &Inst = *Iter;

// Remove any previously existing waitcnts.		// Remove any previously existing waitcnts.
if (Inst.getOpcode() == AMDGPU::S_WAITCNT) {		if (Inst.getOpcode() == AMDGPU::S_WAITCNT) {
// Leave pre-existing waitcnts, but note their existence via setWaitcnt.		if (OldWaitcntInstr) {
// Remove the waitcnt-pass-generated waitcnts; the pass will add them back		if (TrackedWaitcntSet.count(OldWaitcntInstr)) {
// as needed.		TrackedWaitcntSet.erase(OldWaitcntInstr);
if (!TrackedWaitcntSet.count(&Inst))		OldWaitcntInstr->eraseFromParent();
++Iter;		OldWaitcntInstr = nullptr;
else {		} else if (!TrackedWaitcntSet.count(&Inst)) {
		// Two successive s_waitcnt's, both of which are pre-existing and
		// are therefore preserved.
		int64_t Imm = OldWaitcntInstr->getOperand(0).getImm();
		ScoreBrackets->applyWaitcnt(AMDGPU::decodeWaitcnt(IV, Imm));
		} else {
++Iter;		++Iter;
Inst.removeFromParent();		Inst.eraseFromParent();
		continue;
}		}
ScoreBrackets->setWaitcnt(&Inst);		}

		OldWaitcntInstr = &Inst;
		++Iter;
continue;		continue;
}		}

bool VCCZBugWorkAround = false;		bool VCCZBugWorkAround = false;
if (readsVCCZ(Inst) &&		if (readsVCCZ(Inst) &&
(!VCCZBugHandledSet.count(&Inst))) {		(!VCCZBugHandledSet.count(&Inst))) {
if (ScoreBrackets->getScoreLB(LGKM_CNT) <		if (ScoreBrackets->getScoreLB(LGKM_CNT) <
ScoreBrackets->getScoreUB(LGKM_CNT) &&		ScoreBrackets->getScoreUB(LGKM_CNT) &&
ScoreBrackets->hasPendingSMEM()) {		ScoreBrackets->hasPendingSMEM()) {
if (ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS)		if (ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS)
VCCZBugWorkAround = true;		VCCZBugWorkAround = true;
}		}
}		}

// Generate an s_waitcnt instruction to be placed before		// Generate an s_waitcnt instruction to be placed before
// cur_Inst, if needed.		// cur_Inst, if needed.
generateWaitcntInstBefore(Inst, ScoreBrackets);		generateWaitcntInstBefore(Inst, ScoreBrackets, OldWaitcntInstr);
		OldWaitcntInstr = nullptr;

updateEventWaitcntAfter(Inst, ScoreBrackets);		updateEventWaitcntAfter(Inst, ScoreBrackets);

#if 0 // TODO: implement resource type check controlled by options with ub = LB.		#if 0 // TODO: implement resource type check controlled by options with ub = LB.
// If this instruction generates a S_SETVSKIP because it is an		// If this instruction generates a S_SETVSKIP because it is an
// indexed resource, and we are on Tahiti, then it will also force		// indexed resource, and we are on Tahiti, then it will also force
// an S_WAITCNT vmcnt(0)		// an S_WAITCNT vmcnt(0)
if (RequireCheckResourceType(Inst, context)) {		if (RequireCheckResourceType(Inst, context)) {
// Force the score to as if an S_WAITCNT vmcnt(0) is emitted.		// Force the score to as if an S_WAITCNT vmcnt(0) is emitted.
ScoreBrackets->setScoreLB(VM_CNT,		ScoreBrackets->setScoreLB(VM_CNT,
ScoreBrackets->getScoreUB(VM_CNT));		ScoreBrackets->getScoreUB(VM_CNT));
}		}
#endif		#endif

ScoreBrackets->clearWaitcnt();

LLVM_DEBUG({		LLVM_DEBUG({
Inst.print(dbgs());		Inst.print(dbgs());
ScoreBrackets->dump();		ScoreBrackets->dump();
});		});

// Check to see if this is a GWS instruction. If so, and if this is CI or		// Check to see if this is a GWS instruction. If so, and if this is CI or
// VI, then the generated code sequence will include an S_WAITCNT 0.		// VI, then the generated code sequence will include an S_WAITCNT 0.
// TODO: Are these the only GWS instructions?		// TODO: Are these the only GWS instructions?
if (Inst.getOpcode() == AMDGPU::DS_GWS_INIT \|\|		if (Inst.getOpcode() == AMDGPU::DS_GWS_INIT \|\|
Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_V \|\|		Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_V \|\|
Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_BR \|\|		Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_BR \|\|
Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_P \|\|		Inst.getOpcode() == AMDGPU::DS_GWS_SEMA_P \|\|
Inst.getOpcode() == AMDGPU::DS_GWS_BARRIER) {		Inst.getOpcode() == AMDGPU::DS_GWS_BARRIER) {
// TODO: && context->target_info->GwsRequiresMemViolTest() ) {		// TODO: && context->target_info->GwsRequiresMemViolTest() ) {
ScoreBrackets->updateByWait(VM_CNT, ScoreBrackets->getScoreUB(VM_CNT));		ScoreBrackets->applyWaitcnt(AMDGPU::Waitcnt::allZero());
ScoreBrackets->updateByWait(EXP_CNT, ScoreBrackets->getScoreUB(EXP_CNT));
ScoreBrackets->updateByWait(LGKM_CNT,
ScoreBrackets->getScoreUB(LGKM_CNT));
}		}

// TODO: Remove this work-around after fixing the scheduler and enable the		// TODO: Remove this work-around after fixing the scheduler and enable the
// assert above.		// assert above.
if (VCCZBugWorkAround) {		if (VCCZBugWorkAround) {
// Restore the vccz bit. Any time a value is written to vcc, the vcc		// Restore the vccz bit. Any time a value is written to vcc, the vcc
// bit is updated, so we can restore the bit by reading the value of		// bit is updated, so we can restore the bit by reading the value of
// vcc and then writing it back to the register.		// vcc and then writing it back to the register.
▲ Show 20 Lines • Show All 254 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h

	Show First 20 Lines • Show All 252 Lines • ▼ Show 20 Lines
	/// \returns \p Default and emits error if one of the requested values cannot be			/// \returns \p Default and emits error if one of the requested values cannot be
	/// converted to integer, or \p OnlyFirstRequired is false and "second" value is			/// converted to integer, or \p OnlyFirstRequired is false and "second" value is
	/// not present.			/// not present.
	std::pair<int, int> getIntegerPairAttribute(const Function &F,			std::pair<int, int> getIntegerPairAttribute(const Function &F,
	StringRef Name,			StringRef Name,
	std::pair<int, int> Default,			std::pair<int, int> Default,
	bool OnlyFirstRequired = false);			bool OnlyFirstRequired = false);

				/// Represents the counter values to wait for in an s_waitcnt instruction.
				///
				/// Large values (including the maximum possible integer) can be used to
				/// represent "don't care" waits.
				struct Waitcnt {
				unsigned VmCnt = ~0u;
				unsigned ExpCnt = ~0u;
				unsigned LgkmCnt = ~0u;

				Waitcnt() {}
				Waitcnt(unsigned VmCnt, unsigned ExpCnt, unsigned LgkmCnt)
				: VmCnt(VmCnt), ExpCnt(ExpCnt), LgkmCnt(LgkmCnt) {}

				static Waitcnt allZero() { return Waitcnt(0, 0, 0); }

				bool dominates(const Waitcnt &Other) const {
				return VmCnt <= Other.VmCnt && ExpCnt <= Other.ExpCnt &&
				LgkmCnt <= Other.LgkmCnt;
				}

				Waitcnt combined(const Waitcnt &Other) const {
				return Waitcnt(std::min(VmCnt, Other.VmCnt), std::min(ExpCnt, Other.ExpCnt),
				std::min(LgkmCnt, Other.LgkmCnt));
				}
				};

	/// \returns Vmcnt bit mask for given isa \p Version.			/// \returns Vmcnt bit mask for given isa \p Version.
	unsigned getVmcntBitMask(const IsaVersion &Version);			unsigned getVmcntBitMask(const IsaVersion &Version);

	/// \returns Expcnt bit mask for given isa \p Version.			/// \returns Expcnt bit mask for given isa \p Version.
	unsigned getExpcntBitMask(const IsaVersion &Version);			unsigned getExpcntBitMask(const IsaVersion &Version);

	/// \returns Lgkmcnt bit mask for given isa \p Version.			/// \returns Lgkmcnt bit mask for given isa \p Version.
	unsigned getLgkmcntBitMask(const IsaVersion &Version);			unsigned getLgkmcntBitMask(const IsaVersion &Version);
	Show All 17 Lines
	/// \details \p Vmcnt, \p Expcnt and \p Lgkmcnt are decoded as follows:			/// \details \p Vmcnt, \p Expcnt and \p Lgkmcnt are decoded as follows:
	/// \p Vmcnt = \p Waitcnt[3:0] (pre-gfx9 only)			/// \p Vmcnt = \p Waitcnt[3:0] (pre-gfx9 only)
	/// \p Vmcnt = \p Waitcnt[3:0] \| \p Waitcnt[15:14] (gfx9+ only)			/// \p Vmcnt = \p Waitcnt[3:0] \| \p Waitcnt[15:14] (gfx9+ only)
	/// \p Expcnt = \p Waitcnt[6:4]			/// \p Expcnt = \p Waitcnt[6:4]
	/// \p Lgkmcnt = \p Waitcnt[11:8]			/// \p Lgkmcnt = \p Waitcnt[11:8]
	void decodeWaitcnt(const IsaVersion &Version, unsigned Waitcnt,			void decodeWaitcnt(const IsaVersion &Version, unsigned Waitcnt,
	unsigned &Vmcnt, unsigned &Expcnt, unsigned &Lgkmcnt);			unsigned &Vmcnt, unsigned &Expcnt, unsigned &Lgkmcnt);

				Waitcnt decodeWaitcnt(const IsaVersion &Version, unsigned Encoded);

	/// \returns \p Waitcnt with encoded \p Vmcnt for given isa \p Version.			/// \returns \p Waitcnt with encoded \p Vmcnt for given isa \p Version.
	unsigned encodeVmcnt(const IsaVersion &Version, unsigned Waitcnt,			unsigned encodeVmcnt(const IsaVersion &Version, unsigned Waitcnt,
	unsigned Vmcnt);			unsigned Vmcnt);

	/// \returns \p Waitcnt with encoded \p Expcnt for given isa \p Version.			/// \returns \p Waitcnt with encoded \p Expcnt for given isa \p Version.
	unsigned encodeExpcnt(const IsaVersion &Version, unsigned Waitcnt,			unsigned encodeExpcnt(const IsaVersion &Version, unsigned Waitcnt,
	unsigned Expcnt);			unsigned Expcnt);

	Show All 11 Lines
	/// Waitcnt[11:8] = \p Lgkmcnt			/// Waitcnt[11:8] = \p Lgkmcnt
	/// Waitcnt[15:14] = \p Vmcnt[5:4] (gfx9+ only)			/// Waitcnt[15:14] = \p Vmcnt[5:4] (gfx9+ only)
	///			///
	/// \returns Waitcnt with encoded \p Vmcnt, \p Expcnt and \p Lgkmcnt for given			/// \returns Waitcnt with encoded \p Vmcnt, \p Expcnt and \p Lgkmcnt for given
	/// isa \p Version.			/// isa \p Version.
	unsigned encodeWaitcnt(const IsaVersion &Version,			unsigned encodeWaitcnt(const IsaVersion &Version,
	unsigned Vmcnt, unsigned Expcnt, unsigned Lgkmcnt);			unsigned Vmcnt, unsigned Expcnt, unsigned Lgkmcnt);

				unsigned encodeWaitcnt(const IsaVersion &Version, const Waitcnt &Decoded);

	unsigned getInitialPSInputAddr(const Function &F);			unsigned getInitialPSInputAddr(const Function &F);

	LLVM_READNONE			LLVM_READNONE
	bool isShader(CallingConv::ID CC);			bool isShader(CallingConv::ID CC);

	LLVM_READNONE			LLVM_READNONE
	bool isCompute(CallingConv::ID CC);			bool isCompute(CallingConv::ID CC);

	▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

Show First 20 Lines • Show All 516 Lines • ▼ Show 20 Lines

void decodeWaitcnt(const IsaVersion &Version, unsigned Waitcnt,		void decodeWaitcnt(const IsaVersion &Version, unsigned Waitcnt,
unsigned &Vmcnt, unsigned &Expcnt, unsigned &Lgkmcnt) {		unsigned &Vmcnt, unsigned &Expcnt, unsigned &Lgkmcnt) {
Vmcnt = decodeVmcnt(Version, Waitcnt);		Vmcnt = decodeVmcnt(Version, Waitcnt);
Expcnt = decodeExpcnt(Version, Waitcnt);		Expcnt = decodeExpcnt(Version, Waitcnt);
Lgkmcnt = decodeLgkmcnt(Version, Waitcnt);		Lgkmcnt = decodeLgkmcnt(Version, Waitcnt);
}		}

		Waitcnt decodeWaitcnt(const IsaVersion &Version, unsigned Encoded) {
		Waitcnt Decoded;
		Decoded.VmCnt = decodeVmcnt(Version, Encoded);
		Decoded.ExpCnt = decodeExpcnt(Version, Encoded);
		Decoded.LgkmCnt = decodeLgkmcnt(Version, Encoded);
		return Decoded;
		}

unsigned encodeVmcnt(const IsaVersion &Version, unsigned Waitcnt,		unsigned encodeVmcnt(const IsaVersion &Version, unsigned Waitcnt,
unsigned Vmcnt) {		unsigned Vmcnt) {
Waitcnt =		Waitcnt =
packBits(Vmcnt, Waitcnt, getVmcntBitShiftLo(), getVmcntBitWidthLo());		packBits(Vmcnt, Waitcnt, getVmcntBitShiftLo(), getVmcntBitWidthLo());
if (Version.Major < 9)		if (Version.Major < 9)
return Waitcnt;		return Waitcnt;

Vmcnt >>= getVmcntBitWidthLo();		Vmcnt >>= getVmcntBitWidthLo();
Show All 14 Lines	unsigned encodeWaitcnt(const IsaVersion &Version,
unsigned Vmcnt, unsigned Expcnt, unsigned Lgkmcnt) {		unsigned Vmcnt, unsigned Expcnt, unsigned Lgkmcnt) {
unsigned Waitcnt = getWaitcntBitMask(Version);		unsigned Waitcnt = getWaitcntBitMask(Version);
Waitcnt = encodeVmcnt(Version, Waitcnt, Vmcnt);		Waitcnt = encodeVmcnt(Version, Waitcnt, Vmcnt);
Waitcnt = encodeExpcnt(Version, Waitcnt, Expcnt);		Waitcnt = encodeExpcnt(Version, Waitcnt, Expcnt);
Waitcnt = encodeLgkmcnt(Version, Waitcnt, Lgkmcnt);		Waitcnt = encodeLgkmcnt(Version, Waitcnt, Lgkmcnt);
return Waitcnt;		return Waitcnt;
}		}

		unsigned encodeWaitcnt(const IsaVersion &Version, const Waitcnt &Decoded) {
		return encodeWaitcnt(Version, Decoded.VmCnt, Decoded.ExpCnt, Decoded.LgkmCnt);
		}

unsigned getInitialPSInputAddr(const Function &F) {		unsigned getInitialPSInputAddr(const Function &F) {
return getIntegerAttribute(F, "InitialPSInputAddr", 0);		return getIntegerAttribute(F, "InitialPSInputAddr", 0);
}		}

bool isShader(CallingConv::ID cc) {		bool isShader(CallingConv::ID cc) {
switch(cc) {		switch(cc) {
case CallingConv::AMDGPU_VS:		case CallingConv::AMDGPU_VS:
case CallingConv::AMDGPU_LS:		case CallingConv::AMDGPU_LS:
▲ Show 20 Lines • Show All 393 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/smrd-vccz-bug.ll

	; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s			; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
	; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s			; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=NOVCCZ-BUG %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=NOVCCZ-BUG %s

	; GCN-FUNC: {{^}}vccz_workaround:			; GCN-FUNC: {{^}}vccz_workaround:
	; GCN: s_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x0			; GCN: s_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x0
	; GCN: v_cmp_neq_f32_e64 {{[^,]*}}, s{{[0-9]+}}, 0{{$}}			; GCN: v_cmp_neq_f32_e64 {{[^,]*}}, s{{[0-9]+}}, 0{{$}}
	; VCCZ-BUG: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; VCCZ-BUG: s_waitcnt lgkmcnt(0)
	; VCCZ-BUG: s_mov_b64 vcc, vcc			; VCCZ-BUG: s_mov_b64 vcc, vcc
	; NOVCCZ-BUG-NOT: s_mov_b64 vcc, vcc			; NOVCCZ-BUG-NOT: s_mov_b64 vcc, vcc
	; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]			; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]
	; GCN: buffer_store_dword			; GCN: buffer_store_dword
	; GCN: [[EXIT]]:			; GCN: [[EXIT]]:
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @vccz_workaround(i32 addrspace(4)* %in, i32 addrspace(1)* %out, float %cond) {			define amdgpu_kernel void @vccz_workaround(i32 addrspace(4)* %in, i32 addrspace(1)* %out, float %cond) {
	entry:			entry:
	Show All 31 Lines

llvm/trunk/test/CodeGen/AMDGPU/vccz-corrupt-bug-workaround.mir

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	--- \|
attributes #1 = { readnone }		attributes #1 = { readnone }

!0 = !{}		!0 = !{}

...		...
---		---
# CHECK-LABEL: name: vccz_corrupt_workaround		# CHECK-LABEL: name: vccz_corrupt_workaround
# CHECK: $vcc = V_CMP_EQ_F32		# CHECK: $vcc = V_CMP_EQ_F32
# CHECK-NEXT: S_WAITCNT 0		# CHECK-NEXT: S_WAITCNT 127
# CHECK-NEXT: $vcc = S_MOV_B64 $vcc		# CHECK-NEXT: $vcc = S_MOV_B64 $vcc
# CHECK-NEXT: S_CBRANCH_VCCZ %bb.2, implicit killed $vcc		# CHECK-NEXT: S_CBRANCH_VCCZ %bb.2, implicit killed $vcc

name: vccz_corrupt_workaround		name: vccz_corrupt_workaround
alignment: 0		alignment: 0
exposesReturnsTwice: false		exposesReturnsTwice: false
legalized: false		legalized: false
regBankSelected: false		regBankSelected: false
▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/waitcnt-preexisting.mir

	# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s \| FileCheck -check-prefixes=GCN %s			# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s \| FileCheck -check-prefixes=GCN %s

	# GCN-LABEL: name: test{{$}}			# GCN-LABEL: name: test{{$}}
	# GCN: S_WAITCNT -16257			# GCN: S_WAITCNT -16257
	# GGN: DS_READ2_B32			# GGN: DS_READ2_B32
	# GGN: DS_READ2_B32			# GGN: DS_READ2_B32
	# GCN: S_WAITCNT 127{{$}}			# GCN: S_WAITCNT 383{{$}}
	# GCN-NEXT: $vgpr1 = V_ADD_U32_e32 1, killed $vgpr1, implicit $exec			# GCN-NEXT: $vgpr1 = V_ADD_U32_e32 1, killed $vgpr1, implicit $exec
	# GCN-NEXT: $vgpr1 = V_MAX_U32_e32 killed $vgpr0, killed $vgpr1, implicit $exec			# GCN-NEXT: $vgpr1 = V_MAX_U32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
				# GCN-NEXT: S_WAITCNT 127{{$}}
	# GCN-NEXT: $vgpr1 = V_MAX_U32_e32 killed $vgpr2, killed $vgpr1, implicit $exec			# GCN-NEXT: $vgpr1 = V_MAX_U32_e32 killed $vgpr2, killed $vgpr1, implicit $exec
	--- \|			--- \|
	define amdgpu_cs void @test() {			define amdgpu_cs void @test() {
	ret void			ret void
	}			}
	...			...
	---			---
	name: test			name: test
	Show All 19 Lines