This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/InsertWaitcnts: Simplify pending events tracking
ClosedPublic

Authored by nhaehnle on Nov 7 2018, 2:16 PM.

Download Raw Diff

Details

Reviewers

msearles
rampitec
scott.linder
kanarayan

Commits

rGd1f45dad84f8: AMDGPU/InsertWaitcnts: Simplify pending events tracking
rL347850: AMDGPU/InsertWaitcnts: Simplify pending events tracking

Summary

Instead of storing the "score" (last time point) of the various relevant
events, only store whether an event is pending or not.

This is sufficient, because whenever only one event of a count type is
pending, its last time point is naturally the upper bound of all time
points of this count type, and when multiple event types are pending,
the count type has gone out of order and an s_waitcnt to 0 is required
to clear any pending event type (and will then clear all pending event
types for that count type).

This also removes the special handling of GDS_GPR_LOCK and EXP_GPR_LOCK.
I do not understand what this special handling ever attempted to achieve.
It has existed ever since the original port from an internal code base,
so my best guess is that it solved a problem related to EXEC handling in
that internal code base.

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle created this revision.Nov 7 2018, 2:16 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptNov 7 2018, 2:16 PM

Harbormaster completed remote builds in B24694: Diff 173029.Nov 7 2018, 2:16 PM

nhaehnle added a child revision: D54229: AMDGPU/InsertWaitcnt: Remove unused WaitAtBeginning.Nov 7 2018, 2:17 PM

nhaehnle added a parent revision: D54227: AMDGPU/InsertWaitcnts: Use foreach loops for inst and wait event types.

This is sufficient, because whenever only one event of a count type is

pending, its last time point is naturally the upper bound of all time
points of this count type, and when multiple event types are pending,
the count type has gone out of order and an s_waitcnt to 0 is required
to clear any pending event type (and will then clear all pending event
types for that count type).

Just wondered if can do better than using 0. Instead can the lowest count be used as this should be sufficient to ensure all out-of-order events in this have happened? I had discussed this with Bob at one time.

In D54228#1290997, @t-tye wrote:

This is sufficient, because whenever only one event of a count type is

pending, its last time point is naturally the upper bound of all time
points of this count type, and when multiple event types are pending,
the count type has gone out of order and an s_waitcnt to 0 is required
to clear any pending event type (and will then clear all pending event
types for that count type).

Just wondered if can do better than using 0. Instead can the lowest count be used as this should be sufficient to ensure all out-of-order events in this have happened? I had discussed this with Bob at one time.

Hmm, how would that work? What lowest count are you referring to? For example, if lgkm has both in-flight SMEM read, and in-flight LDS, we could either have all SMEM read finish first or all LDS finish first.

Something that we could do is a more finely-grained tracking of in-order events. For example, if we have both in-flight SMEM and in-flight LDS, and we need to wait for the second-to-last LDS, then in fact we could do an lgkmcnt(1) wait -- because if the counter reaches 1 or less, the second-to-last LDS must have returned. After the lgkmcnt(1), we still need to conservatively assume that any event type that was previously in-flight may still be in-flight, so this patch here is compatible with such a more finely-grained tracking.

I think the finer-grained tracking could be achieved by introducing separate timelines for each event type: currently we only have timelines by counter. Anyway, it'd be a separate change, mainly for the benefit of mixing LDS and SMEM I think.

In D54228#1292786, @nhaehnle wrote:

In D54228#1290997, @t-tye wrote:

This is sufficient, because whenever only one event of a count type is

pending, its last time point is naturally the upper bound of all time
points of this count type, and when multiple event types are pending,
the count type has gone out of order and an s_waitcnt to 0 is required
to clear any pending event type (and will then clear all pending event
types for that count type).

Just wondered if can do better than using 0. Instead can the lowest count be used as this should be sufficient to ensure all out-of-order events in this have happened? I had discussed this with Bob at one time.

Hmm, how would that work? What lowest count are you referring to? For example, if lgkm has both in-flight SMEM read, and in-flight LDS, we could either have all SMEM read finish first or all LDS finish first.

Something that we could do is a more finely-grained tracking of in-order events. For example, if we have both in-flight SMEM and in-flight LDS, and we need to wait for the second-to-last LDS, then in fact we could do an lgkmcnt(1) wait -- because if the counter reaches 1 or less, the second-to-last LDS must have returned. After the lgkmcnt(1), we still need to conservatively assume that any event type that was previously in-flight may still be in-flight, so this patch here is compatible with such a more finely-grained tracking.

I think the finer-grained tracking could be achieved by introducing separate timelines for each event type: currently we only have timelines by counter. Anyway, it'd be a separate change, mainly for the benefit of mixing LDS and SMEM I think.

Right that was what I was meaning. Even though some waitcnt counters (such as lgkm) are an amalgam of several other internal counters, some of those internal counters are still in order. So tracking the internal counters avoids having to do the overly conservative use of 0.

It can also mean that some waitcnts may be eliminated since it may be that it is known there are no outstanding operations associated with an internal counter due to a previous waitcnt plus the fact that no operations for that internal counter have occurred subsequently.

Ping?

I think the remarks by @t-tye point to a potentially useful optimization, but that should not be part of this patch.

In D54228#1302715, @nhaehnle wrote:

Ping?

I think the remarks by @t-tye point to a potentially useful optimization, but that should not be part of this patch.

Agree that optimization could be done as a separate patch. Would a Fixme comment be a good idea to record the idea in the appropriate place?

add TODO comment about the opportunity of keeping per-event timelines

Harbormaster completed remote builds in B25262: Diff 175041.Nov 22 2018, 5:16 AM

LGTM

This revision is now accepted and ready to land.Nov 22 2018, 9:37 AM

Closed by commit rL347850: AMDGPU/InsertWaitcnts: Simplify pending events tracking (authored by nha). · Explain WhyNov 29 2018, 3:09 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIInsertWaitcnts.cpp

250 lines

Diff 175836

llvm/trunk/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

//===- SIInsertWaitcnts.cpp - Insert Wait Instructions --------------------===//		//===- SIInsertWaitcnts.cpp - Insert Wait Instructions --------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
/// \file		/// \file
/// Insert wait instructions for memory reads and writes.		/// Insert wait instructions for memory reads and writes.
///		///
/// Memory reads and writes are issued asynchronously, so we need to insert		/// Memory reads and writes are issued asynchronously, so we need to insert
/// S_WAITCNT instructions when we want to access any of their results or		/// S_WAITCNT instructions when we want to access any of their results or
/// overwrite any register that's used asynchronously.		/// overwrite any register that's used asynchronously.
		///
		/// TODO: This pass currently keeps one timeline per hardware counter. A more
		/// finely-grained approach that keeps one timeline per event type could
		/// sometimes get away with generating weaker s_waitcnt instructions. For
		/// example, when both SMEM and LDS are in flight and we need to wait for
		/// the i-th-last LDS instruction, then an lgkmcnt(i) is actually sufficient,
		/// but the pass will currently generate a conservative lgkmcnt(0) because
		/// multiple event types are in flight.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIDefines.h"		#include "SIDefines.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	enum WaitEventType {
EXP_GPR_LOCK, // export holding on its data src		EXP_GPR_LOCK, // export holding on its data src
GDS_GPR_LOCK, // GDS holding on its data and addr src		GDS_GPR_LOCK, // GDS holding on its data and addr src
EXP_POS_ACCESS, // write to export position		EXP_POS_ACCESS, // write to export position
EXP_PARAM_ACCESS, // write to export parameter		EXP_PARAM_ACCESS, // write to export parameter
VMW_GPR_LOCK, // vector-memory write holding on its data src		VMW_GPR_LOCK, // vector-memory write holding on its data src
NUM_WAIT_EVENTS,		NUM_WAIT_EVENTS,
};		};

iterator_range<enum_iterator<WaitEventType>> wait_event_types() {		static const uint32_t WaitEventMaskForInst[NUM_INST_CNTS] = {
return make_range(enum_iterator<WaitEventType>(VMEM_ACCESS),		(1 << VMEM_ACCESS),
enum_iterator<WaitEventType>(NUM_WAIT_EVENTS));		(1 << SMEM_ACCESS) \| (1 << LDS_ACCESS) \| (1 << GDS_ACCESS) \|
}		(1 << SQ_MESSAGE),
		(1 << EXP_GPR_LOCK) \| (1 << GDS_GPR_LOCK) \| (1 << VMW_GPR_LOCK) \|
		(1 << EXP_PARAM_ACCESS) \| (1 << EXP_POS_ACCESS),
		};

// The mapping is:		// The mapping is:
// 0 .. SQ_MAX_PGM_VGPRS-1 real VGPRs		// 0 .. SQ_MAX_PGM_VGPRS-1 real VGPRs
// SQ_MAX_PGM_VGPRS .. NUM_ALL_VGPRS-1 extra VGPR-like slots		// SQ_MAX_PGM_VGPRS .. NUM_ALL_VGPRS-1 extra VGPR-like slots
// NUM_ALL_VGPRS .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs		// NUM_ALL_VGPRS .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs
// We reserve a fixed number of VGPR slots in the scoring tables for		// We reserve a fixed number of VGPR slots in the scoring tables for
// special tokens like SCMEM_LDS (needed for buffer load to LDS).		// special tokens like SCMEM_LDS (needed for buffer load to LDS).
enum RegisterMapping {		enum RegisterMapping {
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	int32_t getScoreUB(InstCounterType T) const {
assert(T < NUM_INST_CNTS);		assert(T < NUM_INST_CNTS);
if (T >= NUM_INST_CNTS)		if (T >= NUM_INST_CNTS)
return 0;		return 0;
return ScoreUBs[T];		return ScoreUBs[T];
}		}

// Mapping from event to counter.		// Mapping from event to counter.
InstCounterType eventCounter(WaitEventType E) {		InstCounterType eventCounter(WaitEventType E) {
switch (E) {		if (E == VMEM_ACCESS)
case VMEM_ACCESS:
return VM_CNT;		return VM_CNT;
case LDS_ACCESS:		if (WaitEventMaskForInst[LGKM_CNT] & (1 << E))
case GDS_ACCESS:
case SQ_MESSAGE:
case SMEM_ACCESS:
return LGKM_CNT;		return LGKM_CNT;
case EXP_GPR_LOCK:		assert(WaitEventMaskForInst[EXP_CNT] & (1 << E));
case GDS_GPR_LOCK:
case VMW_GPR_LOCK:
case EXP_POS_ACCESS:
case EXP_PARAM_ACCESS:
return EXP_CNT;		return EXP_CNT;
default:
llvm_unreachable("unhandled event type");
}
return NUM_INST_CNTS;
}		}

void setRegScore(int GprNo, InstCounterType T, int32_t Val) {		void setRegScore(int GprNo, InstCounterType T, int32_t Val) {
if (GprNo < NUM_ALL_VGPRS) {		if (GprNo < NUM_ALL_VGPRS) {
if (GprNo > VgprUB) {		if (GprNo > VgprUB) {
VgprUB = GprNo;		VgprUB = GprNo;
}		}
VgprScores[T][GprNo] = Val;		VgprScores[T][GprNo] = Val;
Show All 12 Lines	int32_t getRegScore(int GprNo, InstCounterType T) {
}		}
assert(T == LGKM_CNT);		assert(T == LGKM_CNT);
return SgprScores[GprNo - NUM_ALL_VGPRS];		return SgprScores[GprNo - NUM_ALL_VGPRS];
}		}

void clear() {		void clear() {
memset(ScoreLBs, 0, sizeof(ScoreLBs));		memset(ScoreLBs, 0, sizeof(ScoreLBs));
memset(ScoreUBs, 0, sizeof(ScoreUBs));		memset(ScoreUBs, 0, sizeof(ScoreUBs));
memset(EventUBs, 0, sizeof(EventUBs));		PendingEvents = 0;
		memset(MixedPendingEvents, 0, sizeof(MixedPendingEvents));
for (auto T : inst_counter_types())		for (auto T : inst_counter_types())
memset(VgprScores[T], 0, sizeof(VgprScores[T]));		memset(VgprScores[T], 0, sizeof(VgprScores[T]));
memset(SgprScores, 0, sizeof(SgprScores));		memset(SgprScores, 0, sizeof(SgprScores));
}		}

RegInterval getRegInterval(const MachineInstr MI, const SIInstrInfo TII,		RegInterval getRegInterval(const MachineInstr MI, const SIInstrInfo TII,
const MachineRegisterInfo *MRI,		const MachineRegisterInfo *MRI,
const SIRegisterInfo *TRI, unsigned OpNo,		const SIRegisterInfo *TRI, unsigned OpNo,
bool Def) const;		bool Def) const;

void setExpScore(const MachineInstr MI, const SIInstrInfo TII,		void setExpScore(const MachineInstr MI, const SIInstrInfo TII,
const SIRegisterInfo TRI, const MachineRegisterInfo MRI,		const SIRegisterInfo TRI, const MachineRegisterInfo MRI,
unsigned OpNo, int32_t Val);		unsigned OpNo, int32_t Val);

void setWaitAtBeginning() { WaitAtBeginning = true; }		void setWaitAtBeginning() { WaitAtBeginning = true; }
void clearWaitAtBeginning() { WaitAtBeginning = false; }		void clearWaitAtBeginning() { WaitAtBeginning = false; }
bool getWaitAtBeginning() const { return WaitAtBeginning; }		bool getWaitAtBeginning() const { return WaitAtBeginning; }
void setEventUB(enum WaitEventType W, int32_t Val) { EventUBs[W] = Val; }
int32_t getMaxVGPR() const { return VgprUB; }		int32_t getMaxVGPR() const { return VgprUB; }
int32_t getMaxSGPR() const { return SgprUB; }		int32_t getMaxSGPR() const { return SgprUB; }

int32_t getEventUB(enum WaitEventType W) const {
assert(W < NUM_WAIT_EVENTS);
return EventUBs[W];
}

bool counterOutOfOrder(InstCounterType T) const;		bool counterOutOfOrder(InstCounterType T) const;
bool simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;		bool simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;
bool simplifyWaitcnt(InstCounterType T, unsigned &Count) const;		bool simplifyWaitcnt(InstCounterType T, unsigned &Count) const;
void determineWait(InstCounterType T, int ScoreToWait,		void determineWait(InstCounterType T, int ScoreToWait,
AMDGPU::Waitcnt &Wait) const;		AMDGPU::Waitcnt &Wait) const;
void applyWaitcnt(const AMDGPU::Waitcnt &Wait);		void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
void applyWaitcnt(InstCounterType T, unsigned Count);		void applyWaitcnt(InstCounterType T, unsigned Count);
void updateByEvent(const SIInstrInfo TII, const SIRegisterInfo TRI,		void updateByEvent(const SIInstrInfo TII, const SIRegisterInfo TRI,
const MachineRegisterInfo *MRI, WaitEventType E,		const MachineRegisterInfo *MRI, WaitEventType E,
MachineInstr &MI);		MachineInstr &MI);

bool hasPendingSMEM() const {		bool hasPendingEvent(WaitEventType E) const {
return (EventUBs[SMEM_ACCESS] > ScoreLBs[LGKM_CNT] &&		return PendingEvents & (1 << E);
EventUBs[SMEM_ACCESS] <= ScoreUBs[LGKM_CNT]);
}		}

		void mergePendingEvents(const BlockWaitcntBrackets &Other);

bool hasPendingFlat() const {		bool hasPendingFlat() const {
return ((LastFlat[LGKM_CNT] > ScoreLBs[LGKM_CNT] &&		return ((LastFlat[LGKM_CNT] > ScoreLBs[LGKM_CNT] &&
LastFlat[LGKM_CNT] <= ScoreUBs[LGKM_CNT]) \|\|		LastFlat[LGKM_CNT] <= ScoreUBs[LGKM_CNT]) \|\|
(LastFlat[VM_CNT] > ScoreLBs[VM_CNT] &&		(LastFlat[VM_CNT] > ScoreLBs[VM_CNT] &&
LastFlat[VM_CNT] <= ScoreUBs[VM_CNT]));		LastFlat[VM_CNT] <= ScoreUBs[VM_CNT]));
}		}

void setPendingFlat() {		void setPendingFlat() {
LastFlat[VM_CNT] = ScoreUBs[VM_CNT];		LastFlat[VM_CNT] = ScoreUBs[VM_CNT];
LastFlat[LGKM_CNT] = ScoreUBs[LGKM_CNT];		LastFlat[LGKM_CNT] = ScoreUBs[LGKM_CNT];
}		}

int pendingFlat(InstCounterType Ct) const { return LastFlat[Ct]; }		int pendingFlat(InstCounterType Ct) const { return LastFlat[Ct]; }

void setLastFlat(InstCounterType Ct, int Val) { LastFlat[Ct] = Val; }		void setLastFlat(InstCounterType Ct, int Val) { LastFlat[Ct] = Val; }

bool getRevisitLoop() const { return RevisitLoop; }		bool getRevisitLoop() const { return RevisitLoop; }
void setRevisitLoop(bool RevisitLoopIn) { RevisitLoop = RevisitLoopIn; }		void setRevisitLoop(bool RevisitLoopIn) { RevisitLoop = RevisitLoopIn; }

void setPostOrder(int32_t PostOrderIn) { PostOrder = PostOrderIn; }		void setPostOrder(int32_t PostOrderIn) { PostOrder = PostOrderIn; }
int32_t getPostOrder() const { return PostOrder; }		int32_t getPostOrder() const { return PostOrder; }

bool mixedExpTypes() const { return MixedExpTypes; }
void setMixedExpTypes(bool MixedExpTypesIn) {
MixedExpTypes = MixedExpTypesIn;
}

void print(raw_ostream &);		void print(raw_ostream &);
void dump() { print(dbgs()); }		void dump() { print(dbgs()); }

private:		private:
const GCNSubtarget *ST = nullptr;		const GCNSubtarget *ST = nullptr;
bool WaitAtBeginning = false;		bool WaitAtBeginning = false;
bool RevisitLoop = false;		bool RevisitLoop = false;
bool MixedExpTypes = false;
int32_t PostOrder = 0;		int32_t PostOrder = 0;
int32_t ScoreLBs[NUM_INST_CNTS] = {0};		int32_t ScoreLBs[NUM_INST_CNTS] = {0};
int32_t ScoreUBs[NUM_INST_CNTS] = {0};		int32_t ScoreUBs[NUM_INST_CNTS] = {0};
int32_t EventUBs[NUM_WAIT_EVENTS] = {0};		uint32_t PendingEvents = 0;
		bool MixedPendingEvents[NUM_INST_CNTS] = {false};
// Remember the last flat memory operation.		// Remember the last flat memory operation.
int32_t LastFlat[NUM_INST_CNTS] = {0};		int32_t LastFlat[NUM_INST_CNTS] = {0};
// wait_cnt scores for every vgpr.		// wait_cnt scores for every vgpr.
// Keep track of the VgprUB and SgprUB to make merge at join efficient.		// Keep track of the VgprUB and SgprUB to make merge at join efficient.
int32_t VgprUB = 0;		int32_t VgprUB = 0;
int32_t SgprUB = 0;		int32_t SgprUB = 0;
int32_t VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS];		int32_t VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS];
// Wait cnt scores for every sgpr, only lgkmcnt is relevant.		// Wait cnt scores for every sgpr, only lgkmcnt is relevant.
▲ Show 20 Lines • Show All 183 Lines • ▼ Show 20 Lines

void BlockWaitcntBrackets::updateByEvent(const SIInstrInfo *TII,		void BlockWaitcntBrackets::updateByEvent(const SIInstrInfo *TII,
const SIRegisterInfo *TRI,		const SIRegisterInfo *TRI,
const MachineRegisterInfo *MRI,		const MachineRegisterInfo *MRI,
WaitEventType E, MachineInstr &Inst) {		WaitEventType E, MachineInstr &Inst) {
const MachineRegisterInfo &MRIA = *MRI;		const MachineRegisterInfo &MRIA = *MRI;
InstCounterType T = eventCounter(E);		InstCounterType T = eventCounter(E);
int32_t CurrScore = getScoreUB(T) + 1;		int32_t CurrScore = getScoreUB(T) + 1;
// EventUB and ScoreUB need to be update regardless if this event changes		// PendingEvents and ScoreUB need to be update regardless if this event
// the score of a register or not.		// changes the score of a register or not.
// Examples including vm_cnt when buffer-store or lgkm_cnt when send-message.		// Examples including vm_cnt when buffer-store or lgkm_cnt when send-message.
EventUBs[E] = CurrScore;		if (!hasPendingEvent(E)) {
		if (PendingEvents & WaitEventMaskForInst[T])
		MixedPendingEvents[T] = true;
		PendingEvents \|= 1 << E;
		}
setScoreUB(T, CurrScore);		setScoreUB(T, CurrScore);

if (T == EXP_CNT) {		if (T == EXP_CNT) {
// Check for mixed export types. If they are mixed, then a waitcnt exp(0)
// is required.
if (!MixedExpTypes) {
MixedExpTypes = counterOutOfOrder(EXP_CNT);
}

// Put score on the source vgprs. If this is a store, just use those		// Put score on the source vgprs. If this is a store, just use those
// specific register(s).		// specific register(s).
if (TII->isDS(Inst) && (Inst.mayStore() \|\| Inst.mayLoad())) {		if (TII->isDS(Inst) && (Inst.mayStore() \|\| Inst.mayLoad())) {
// All GDS operations must protect their address register (same as		// All GDS operations must protect their address register (same as
// export.)		// export.)
if (Inst.getOpcode() != AMDGPU::DS_APPEND &&		if (Inst.getOpcode() != AMDGPU::DS_APPEND &&
Inst.getOpcode() != AMDGPU::DS_CONSUME) {		Inst.getOpcode() != AMDGPU::DS_CONSUME) {
setExpScore(		setExpScore(
▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	if ((UB >= ScoreToWait) && (ScoreToWait > LB)) {
}		}
}		}
}		}

void BlockWaitcntBrackets::applyWaitcnt(const AMDGPU::Waitcnt &Wait) {		void BlockWaitcntBrackets::applyWaitcnt(const AMDGPU::Waitcnt &Wait) {
applyWaitcnt(VM_CNT, Wait.VmCnt);		applyWaitcnt(VM_CNT, Wait.VmCnt);
applyWaitcnt(EXP_CNT, Wait.ExpCnt);		applyWaitcnt(EXP_CNT, Wait.ExpCnt);
applyWaitcnt(LGKM_CNT, Wait.LgkmCnt);		applyWaitcnt(LGKM_CNT, Wait.LgkmCnt);

if (Wait.ExpCnt == 0)
setMixedExpTypes(false);
}		}

void BlockWaitcntBrackets::applyWaitcnt(InstCounterType T, unsigned Count) {		void BlockWaitcntBrackets::applyWaitcnt(InstCounterType T, unsigned Count) {
const int32_t UB = getScoreUB(T);		const int32_t UB = getScoreUB(T);
if (Count >= (unsigned)UB)		if (Count >= (unsigned)UB)
return;		return;
if (Count != 0) {		if (Count != 0) {
if (counterOutOfOrder(T))		if (counterOutOfOrder(T))
return;		return;
setScoreLB(T, std::max(getScoreLB(T), UB - (int32_t)Count));		setScoreLB(T, std::max(getScoreLB(T), UB - (int32_t)Count));
} else {		} else {
setScoreLB(T, UB);		setScoreLB(T, UB);
		MixedPendingEvents[T] = false;
		PendingEvents &= ~WaitEventMaskForInst[T];
		}
		}

		void BlockWaitcntBrackets::mergePendingEvents(const BlockWaitcntBrackets &Other) {
		for (auto T : inst_counter_types()) {
		uint32_t Old = PendingEvents & WaitEventMaskForInst[T];
		uint32_t New = Other.PendingEvents & WaitEventMaskForInst[T];
		if (Other.MixedPendingEvents[T] \|\| (Old && New && Old != New))
		MixedPendingEvents[T] = true;
		PendingEvents \|= New;
}		}
}		}

// Where there are multiple types of event in the bracket of a counter,		// Where there are multiple types of event in the bracket of a counter,
// the decrement may go out of order.		// the decrement may go out of order.
bool BlockWaitcntBrackets::counterOutOfOrder(InstCounterType T) const {		bool BlockWaitcntBrackets::counterOutOfOrder(InstCounterType T) const {
switch (T) {
case VM_CNT:
return false;
case LGKM_CNT: {
if (EventUBs[SMEM_ACCESS] > ScoreLBs[LGKM_CNT] &&
EventUBs[SMEM_ACCESS] <= ScoreUBs[LGKM_CNT]) {
// Scalar memory read always can go out of order.		// Scalar memory read always can go out of order.
		if (T == LGKM_CNT && hasPendingEvent(SMEM_ACCESS))
return true;		return true;
}		return MixedPendingEvents[T];
int NumEventTypes = 0;
if (EventUBs[LDS_ACCESS] > ScoreLBs[LGKM_CNT] &&
EventUBs[LDS_ACCESS] <= ScoreUBs[LGKM_CNT]) {
NumEventTypes++;
}
if (EventUBs[GDS_ACCESS] > ScoreLBs[LGKM_CNT] &&
EventUBs[GDS_ACCESS] <= ScoreUBs[LGKM_CNT]) {
NumEventTypes++;
}
if (EventUBs[SQ_MESSAGE] > ScoreLBs[LGKM_CNT] &&
EventUBs[SQ_MESSAGE] <= ScoreUBs[LGKM_CNT]) {
NumEventTypes++;
}
if (NumEventTypes <= 1) {
return false;
}
break;
}
case EXP_CNT: {
// If there has been a mixture of export types, then a waitcnt exp(0) is
// required.
if (MixedExpTypes)
return true;
int NumEventTypes = 0;
if (EventUBs[EXP_GPR_LOCK] > ScoreLBs[EXP_CNT] &&
EventUBs[EXP_GPR_LOCK] <= ScoreUBs[EXP_CNT]) {
NumEventTypes++;
}
if (EventUBs[GDS_GPR_LOCK] > ScoreLBs[EXP_CNT] &&
EventUBs[GDS_GPR_LOCK] <= ScoreUBs[EXP_CNT]) {
NumEventTypes++;
}
if (EventUBs[VMW_GPR_LOCK] > ScoreLBs[EXP_CNT] &&
EventUBs[VMW_GPR_LOCK] <= ScoreUBs[EXP_CNT]) {
NumEventTypes++;
}
if (EventUBs[EXP_PARAM_ACCESS] > ScoreLBs[EXP_CNT] &&
EventUBs[EXP_PARAM_ACCESS] <= ScoreUBs[EXP_CNT]) {
NumEventTypes++;
}

if (EventUBs[EXP_POS_ACCESS] > ScoreLBs[EXP_CNT] &&
EventUBs[EXP_POS_ACCESS] <= ScoreUBs[EXP_CNT]) {
NumEventTypes++;
}

if (NumEventTypes <= 1) {
return false;
}
break;
}
default:
break;
}
return true;
}		}

INITIALIZE_PASS_BEGIN(SIInsertWaitcnts, DEBUG_TYPE, "SI Insert Waitcnts", false,		INITIALIZE_PASS_BEGIN(SIInsertWaitcnts, DEBUG_TYPE, "SI Insert Waitcnts", false,
false)		false)
INITIALIZE_PASS_END(SIInsertWaitcnts, DEBUG_TYPE, "SI Insert Waitcnts", false,		INITIALIZE_PASS_END(SIInsertWaitcnts, DEBUG_TYPE, "SI Insert Waitcnts", false,
false)		false)

char SIInsertWaitcnts::ID = 0;		char SIInsertWaitcnts::ID = 0;
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	#endif
// Export & GDS instructions do not read the EXEC mask until after the export		// Export & GDS instructions do not read the EXEC mask until after the export
// is granted (which can occur well after the instruction is issued).		// is granted (which can occur well after the instruction is issued).
// The shader program must flush all EXP operations on the export-count		// The shader program must flush all EXP operations on the export-count
// before overwriting the EXEC mask.		// before overwriting the EXEC mask.
else {		else {
if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {		if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
// Export and GDS are tracked individually, either may trigger a waitcnt		// Export and GDS are tracked individually, either may trigger a waitcnt
// for EXEC.		// for EXEC.
ScoreBrackets->determineWait(		if (ScoreBrackets->hasPendingEvent(EXP_GPR_LOCK) \|\|
EXP_CNT, ScoreBrackets->getEventUB(EXP_GPR_LOCK), Wait);		ScoreBrackets->hasPendingEvent(EXP_PARAM_ACCESS) \|\|
ScoreBrackets->determineWait(		ScoreBrackets->hasPendingEvent(EXP_POS_ACCESS) \|\|
EXP_CNT, ScoreBrackets->getEventUB(EXP_PARAM_ACCESS), Wait);		ScoreBrackets->hasPendingEvent(GDS_GPR_LOCK)) {
ScoreBrackets->determineWait(		Wait.ExpCnt = 0;
EXP_CNT, ScoreBrackets->getEventUB(EXP_POS_ACCESS), Wait);		}
ScoreBrackets->determineWait(
EXP_CNT, ScoreBrackets->getEventUB(GDS_GPR_LOCK), Wait);
}		}

#if 0 // TODO: the following code to handle CALL.		#if 0 // TODO: the following code to handle CALL.
// The argument passing for CALLs should suffice for VM_CNT and LGKM_CNT.		// The argument passing for CALLs should suffice for VM_CNT and LGKM_CNT.
// However, there is a problem with EXP_CNT, because the call cannot		// However, there is a problem with EXP_CNT, because the call cannot
// easily tell if a register is used in the function, and if it did, then		// easily tell if a register is used in the function, and if it did, then
// the referring instruction would have to have an S_WAITCNT, which is		// the referring instruction would have to have an S_WAITCNT, which is
// dependent on all call sites. So Instead, force S_WAITCNT for EXP_CNTs		// dependent on all call sites. So Instead, force S_WAITCNT for EXP_CNTs
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	#endif
}		}

// TODO: Remove this work-around, enable the assert for Bug 457939		// TODO: Remove this work-around, enable the assert for Bug 457939
// after fixing the scheduler. Also, the Shader Compiler code is		// after fixing the scheduler. Also, the Shader Compiler code is
// independent of target.		// independent of target.
if (readsVCCZ(MI) && ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS) {		if (readsVCCZ(MI) && ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS) {
if (ScoreBrackets->getScoreLB(LGKM_CNT) <		if (ScoreBrackets->getScoreLB(LGKM_CNT) <
ScoreBrackets->getScoreUB(LGKM_CNT) &&		ScoreBrackets->getScoreUB(LGKM_CNT) &&
ScoreBrackets->hasPendingSMEM()) {		ScoreBrackets->hasPendingEvent(SMEM_ACCESS)) {
Wait.LgkmCnt = 0;		Wait.LgkmCnt = 0;
}		}
}		}

// Early-out if no wait is indicated.		// Early-out if no wait is indicated.
if (!ScoreBrackets->simplifyWaitcnt(Wait) && !IsForceEmitWaitcnt) {		if (!ScoreBrackets->simplifyWaitcnt(Wait) && !IsForceEmitWaitcnt) {
if (OldWaitcntInstr) {		if (OldWaitcntInstr) {
if (TrackedWaitcntSet.count(OldWaitcntInstr)) {		if (TrackedWaitcntSet.count(OldWaitcntInstr)) {
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines
}		}

// Merge the score brackets of the Block's predecessors;		// Merge the score brackets of the Block's predecessors;
// this merged score bracket is used when adding waitcnts to the Block		// this merged score bracket is used when adding waitcnts to the Block
void SIInsertWaitcnts::mergeInputScoreBrackets(MachineBasicBlock &Block) {		void SIInsertWaitcnts::mergeInputScoreBrackets(MachineBasicBlock &Block) {
BlockWaitcntBrackets *ScoreBrackets = BlockWaitcntBracketsMap[&Block].get();		BlockWaitcntBrackets *ScoreBrackets = BlockWaitcntBracketsMap[&Block].get();
int32_t MaxPending[NUM_INST_CNTS] = {0};		int32_t MaxPending[NUM_INST_CNTS] = {0};
int32_t MaxFlat[NUM_INST_CNTS] = {0};		int32_t MaxFlat[NUM_INST_CNTS] = {0};
bool MixedExpTypes = false;

// For single basic block loops, we need to retain the Block's		// For single basic block loops, we need to retain the Block's
// score bracket to have accurate Pred info. So, make a copy of Block's		// score bracket to have accurate Pred info. So, make a copy of Block's
// score bracket, clear() it (which retains several important bits of info),		// score bracket, clear() it (which retains several important bits of info),
// populate, and then replace en masse. For non-single basic block loops,		// populate, and then replace en masse. For non-single basic block loops,
// just clear Block's current score bracket and repopulate in-place.		// just clear Block's current score bracket and repopulate in-place.
bool IsSelfPred;		bool IsSelfPred;
std::unique_ptr<BlockWaitcntBrackets> S;		std::unique_ptr<BlockWaitcntBrackets> S;
Show All 19 Lines	for (MachineBasicBlock *Pred : Block.predecessors()) {
for (auto T : inst_counter_types()) {		for (auto T : inst_counter_types()) {
int span =		int span =
PredScoreBrackets->getScoreUB(T) - PredScoreBrackets->getScoreLB(T);		PredScoreBrackets->getScoreUB(T) - PredScoreBrackets->getScoreLB(T);
MaxPending[T] = std::max(MaxPending[T], span);		MaxPending[T] = std::max(MaxPending[T], span);
span =		span =
PredScoreBrackets->pendingFlat(T) - PredScoreBrackets->getScoreLB(T);		PredScoreBrackets->pendingFlat(T) - PredScoreBrackets->getScoreLB(T);
MaxFlat[T] = std::max(MaxFlat[T], span);		MaxFlat[T] = std::max(MaxFlat[T], span);
}		}

MixedExpTypes \|= PredScoreBrackets->mixedExpTypes();
}

// Special handling for GDS_GPR_LOCK and EXP_GPR_LOCK.
for (MachineBasicBlock *Pred : Block.predecessors()) {
BlockWaitcntBrackets *PredScoreBrackets =
BlockWaitcntBracketsMap[Pred].get();
bool Visited = BlockVisitedSet.count(Pred);
if (!Visited \|\| PredScoreBrackets->getWaitAtBeginning()) {
continue;
}

int GDSSpan = PredScoreBrackets->getEventUB(GDS_GPR_LOCK) -
PredScoreBrackets->getScoreLB(EXP_CNT);
MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], GDSSpan);
int EXPSpan = PredScoreBrackets->getEventUB(EXP_GPR_LOCK) -
PredScoreBrackets->getScoreLB(EXP_CNT);
MaxPending[EXP_CNT] = std::max(MaxPending[EXP_CNT], EXPSpan);
}		}

#if 0		#if 0
// LC does not (unlike) add a waitcnt at beginning. Leaving it as marker.		// LC does not (unlike) add a waitcnt at beginning. Leaving it as marker.
// TODO: how does LC distinguish between function entry and main entry?		// TODO: how does LC distinguish between function entry and main entry?
// If this is the entry to a function, force a wait.		// If this is the entry to a function, force a wait.
MachineBasicBlock &Entry = Block.getParent()->front();		MachineBasicBlock &Entry = Block.getParent()->front();
if (Entry.getNumber() == Block.getNumber()) {		if (Entry.getNumber() == Block.getNumber()) {
ScoreBrackets->setWaitAtBeginning();		ScoreBrackets->setWaitAtBeginning();
return;		return;
}		}
#endif		#endif

// Now set the current Block's brackets to the largest ending bracket.		// Now set the current Block's brackets to the largest ending bracket.
for (auto T : inst_counter_types()) {		for (auto T : inst_counter_types()) {
ScoreBrackets->setScoreUB(T, MaxPending[T]);		ScoreBrackets->setScoreUB(T, MaxPending[T]);
ScoreBrackets->setScoreLB(T, 0);		ScoreBrackets->setScoreLB(T, 0);
ScoreBrackets->setLastFlat(T, MaxFlat[T]);		ScoreBrackets->setLastFlat(T, MaxFlat[T]);
}		}

ScoreBrackets->setMixedExpTypes(MixedExpTypes);

// Set the register scoreboard.		// Set the register scoreboard.
for (MachineBasicBlock *Pred : Block.predecessors()) {		for (MachineBasicBlock *Pred : Block.predecessors()) {
if (!BlockVisitedSet.count(Pred)) {		if (!BlockVisitedSet.count(Pred)) {
continue;		continue;
}		}

BlockWaitcntBrackets *PredScoreBrackets =		BlockWaitcntBrackets *PredScoreBrackets =
BlockWaitcntBracketsMap[Pred].get();		BlockWaitcntBracketsMap[Pred].get();
Show All 26 Lines	for (auto T : inst_counter_types()) {
std::max(		std::max(
ScoreBrackets->getRegScore(J + NUM_ALL_VGPRS, LGKM_CNT),		ScoreBrackets->getRegScore(J + NUM_ALL_VGPRS, LGKM_CNT),
NewRegScore));		NewRegScore));
}		}
}		}
}		}
}		}

// Also merge the WaitEvent information.		ScoreBrackets->mergePendingEvents(*PredScoreBrackets);
for (auto W : wait_event_types()) {
enum InstCounterType T = PredScoreBrackets->eventCounter(W);
int PredEventUB = PredScoreBrackets->getEventUB(W);
if (PredEventUB > PredScoreBrackets->getScoreLB(T)) {
int NewEventUB =
MaxPending[T] + PredEventUB - PredScoreBrackets->getScoreUB(T);
if (NewEventUB > 0) {
ScoreBrackets->setEventUB(
W, std::max(ScoreBrackets->getEventUB(W), NewEventUB));
}
}
}
}

// Special case handling of GDS_GPR_LOCK and EXP_GPR_LOCK. Merge this for the
// sequencing predecessors, because changes to EXEC require waitcnts due to
// the delayed nature of these operations.
for (MachineBasicBlock *Pred : Block.predecessors()) {
if (!BlockVisitedSet.count(Pred)) {
continue;
}

BlockWaitcntBrackets *PredScoreBrackets =
BlockWaitcntBracketsMap[Pred].get();

int pred_gds_ub = PredScoreBrackets->getEventUB(GDS_GPR_LOCK);
if (pred_gds_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {
int new_gds_ub = MaxPending[EXP_CNT] + pred_gds_ub -
PredScoreBrackets->getScoreUB(EXP_CNT);
if (new_gds_ub > 0) {
ScoreBrackets->setEventUB(
GDS_GPR_LOCK,
std::max(ScoreBrackets->getEventUB(GDS_GPR_LOCK), new_gds_ub));
}
}
int pred_exp_ub = PredScoreBrackets->getEventUB(EXP_GPR_LOCK);
if (pred_exp_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {
int new_exp_ub = MaxPending[EXP_CNT] + pred_exp_ub -
PredScoreBrackets->getScoreUB(EXP_CNT);
if (new_exp_ub > 0) {
ScoreBrackets->setEventUB(
EXP_GPR_LOCK,
std::max(ScoreBrackets->getEventUB(EXP_GPR_LOCK), new_exp_ub));
}
}
}		}

// if a single block loop, update the score brackets. Not needed for other		// if a single block loop, update the score brackets. Not needed for other
// blocks, as we did this in-place		// blocks, as we did this in-place
if (IsSelfPred) {		if (IsSelfPred) {
BlockWaitcntBracketsMap[&Block] = llvm::make_unique<BlockWaitcntBrackets>(*ScoreBrackets);		BlockWaitcntBracketsMap[&Block] = llvm::make_unique<BlockWaitcntBrackets>(*ScoreBrackets);
}		}
}		}
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	if (Inst.getOpcode() == AMDGPU::S_WAITCNT) {
continue;		continue;
}		}

bool VCCZBugWorkAround = false;		bool VCCZBugWorkAround = false;
if (readsVCCZ(Inst) &&		if (readsVCCZ(Inst) &&
(!VCCZBugHandledSet.count(&Inst))) {		(!VCCZBugHandledSet.count(&Inst))) {
if (ScoreBrackets->getScoreLB(LGKM_CNT) <		if (ScoreBrackets->getScoreLB(LGKM_CNT) <
ScoreBrackets->getScoreUB(LGKM_CNT) &&		ScoreBrackets->getScoreUB(LGKM_CNT) &&
ScoreBrackets->hasPendingSMEM()) {		ScoreBrackets->hasPendingEvent(SMEM_ACCESS)) {
if (ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS)		if (ST->getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS)
VCCZBugWorkAround = true;		VCCZBugWorkAround = true;
}		}
}		}

// Generate an s_waitcnt instruction to be placed before		// Generate an s_waitcnt instruction to be placed before
// cur_Inst, if needed.		// cur_Inst, if needed.
generateWaitcntInstBefore(Inst, ScoreBrackets, OldWaitcntInstr);		generateWaitcntInstBefore(Inst, ScoreBrackets, OldWaitcntInstr);
▲ Show 20 Lines • Show All 290 Lines • Show Last 20 Lines