This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not only rely on BB number when finding bottom loop
ClosedPublic

Authored by msearles on Feb 27 2018, 12:20 PM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle
mareko
hakzsam
kzhuravl

Commits

rG1bc6e71f32eb: [AMDGPU] Do not only rely on BB number when finding bottom loop
rL330337: [AMDGPU] Do not only rely on BB number when finding bottom loop

Summary

We should also check that the "bottom" basic block of a loop
is a successor of the "header" basic block, otherwise we don't
propagate the information correctly when the CFG is complex.

This fixes an important rendering problem with Wolfsentein 2,
because of one vector-memory wait was missing.

Please review,
Thanks!

Diff Detail

Event Timeline

hakzsam created this revision.Feb 27 2018, 12:20 PM

Herald added subscribers: t-tye, tpr, dstuttard and 3 others. · View Herald TranscriptFeb 27 2018, 12:20 PM

is it possible to include the test as well?

This new revision attaches a LLVM IR testcase that reproduces the issue, not sure if the test can still be reduced compared to the original monster shader.

I have noticed one weird thing after running shader-db, it looks like there is a side effect that produces some unnecessary expcnt(0) instructions in some compute shaders. Maybe this patch triggers a new issue? Otherwise, I don't see any regressions with RADV.

I will investigate about the extra expcnt(0).

Please review, thanks!

Here's a test case where a expcnt(0) is generated https://hastebin.com/uyuhabeqov.pl

If you look at the CFG, there is two loops (BB1 and BB5). The previous algorithm returns BB12 for both loop headers which looks quite wrong. The real bottom basic blocks are BB2 for BB1 and BB6 for BB5 (which is what this patch does return).

I think something isn't correctly re-initialized. Anyway, generating expcnt(0) instructions in compute shaders is just useless.

Oh, I thought expcnt() was for parameters export only, but it can be used for GDS instructions as well.

So, the expcnt(0) is generated because there is a buffer_atomic_add in that test case, which actually makes sense to me. Looks like this patch also fixes that case.

Could you please reduce the testcase?

In D43831#1027001, @nhaehnle wrote:

Could you please reduce the testcase?

Okay, I will try to.

v3: reduced testcase

For the testcase, would you also run -instnamer to rename instances of %<number>? You may want to combine it with other clean-up options, something like this: opt -S -deadarghaX0r -strip -strip-debug -strip-dead-prototypes -instnamer

In D43831#1027405, @msearles wrote:

For the testcase, would you also run -instnamer to rename instances of %<number>? You may want to combine it with other clean-up options, something like this: opt -S -deadarghaX0r -strip -strip-debug -strip-dead-prototypes -instnamer

Ah yeah, thanks for the hints. I will try to clean up the testcase a bit more.

v4: - run 'opt -S -deadarghaX0r -strip -strip-debug -strip-dead-prototypes -instnamer'

remove RFC on the subject

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

test/CodeGen/AMDGPU/waitcnt-looptest.ll
145–146 ↗	(On Diff #137043)	These attributes clash with the attributes defined later on in the added part of the tests. I'm surprised there is no warning about that. Please put all attributes at the end of the file and make their numbers unique.

nhaehnle added inline comments.Mar 5 2018, 11:42 PM

test/CodeGen/AMDGPU/waitcnt-looptest.ll
170 ↗	(On Diff #137043)	This and possibly others are redundant. Have you tried running -dce?

In D43831#1028129, @nhaehnle wrote:

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

In D43831#1028150, @hakzsam wrote:

In D43831#1028129, @nhaehnle wrote:

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

In D43831#1028165, @nhaehnle wrote:

In D43831#1028150, @hakzsam wrote:

In D43831#1028129, @nhaehnle wrote:

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

Exactly, I can try to write a test with multi back-edges and we will see what happens in this situation.

test/CodeGen/AMDGPU/waitcnt-looptest.ll
170 ↗	(On Diff #137043)	No, I didn't, I will try and update the patch again if needed, Thanks!

v5: run -dce to remove redundant phis

In D43831#1028165, @nhaehnle wrote:

In D43831#1028150, @hakzsam wrote:

In D43831#1028129, @nhaehnle wrote:

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

https://hastebin.com/joqihayohe One case that doesn't work (ie. a vmcnt(0) is missing).

In D43831#1028208, @hakzsam wrote:

In D43831#1028165, @nhaehnle wrote:

In D43831#1028150, @hakzsam wrote:

In D43831#1028129, @nhaehnle wrote:

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

https://hastebin.com/joqihayohe One case that doesn't work (ie. a vmcnt(0) is missing).

I should have a fix for the multiple back-edges case, I'm going to run CTS to be sure it doesn't break anything before updating.

v6: fix the multiple back-edges case and update the testcase

I think this really calls for a MIR testcase

v7:

replace the LLVM IR testcase with a MIR one
remove one unused variable in isLoopBottom()
do not rely on block numbfer in isLoopBottom()

arsenm added inline comments.Mar 7 2018, 7:21 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1556	lowercase true
1562	Braces
1563	Swap order of checks
1572	Braces

v8:

add a sanity check when a block is looping over itself
cosmetic changes

msearles added inline comments.Mar 8 2018, 6:40 PM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1564	Can you explain why you avoid re-visiting a loop when a block loops over itself? I suppose this makes sense considering the current waitcnt pass sources. I.e., it admits that it does not handle single block loops correctly. See an "IMPORTANT NOTE" to this effect in SIInsertWaitcnts::mergeInputScoreBrackets() at SIInsertWaitcnts.cpp:1315-1317 . However, in order for it to correctly handle single block loops, it will need to revisit them and then the sanitify check won't make sense. I have a patch that fixes SIInsertWaitcnts::mergeInputScoreBrackets() in the case of a single block loop, but it's only useful if the block is revisited. See the reduced testcase. Note: the first GLOBAL_STORE_DWORD has its write-data in $vgpr11 the last GLOBAL_LOAD_DWORD loads into $vgpr11 so, a waitcnt is needed before the store to wait for the load if the pass does not revisit bb.0, it has no way of knowing this $ llc -mcpu=gfx900 -mtriple=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts testcase.mir $ cat testcase.mir --- \| define amdgpu_kernel void @benchmark_func() { ret void } ... --- name: benchmark_func body: \| bb.0: GLOBAL_STORE_DWORD $vgpr7_vgpr8, $vgpr11, 0, 0, 0, implicit $exec $vgpr21 = GLOBAL_LOAD_DWORD $vgpr4_vgpr5, 0, 0, 0, implicit $exec $vgpr10 = GLOBAL_LOAD_DWORD $vgpr10_vgpr11, 0, 0, 0, implicit $exec GLOBAL_STORE_DWORD $vgpr14_vgpr15, $vgpr21, 0, 0, 0, implicit $exec $vgpr11 = GLOBAL_LOAD_DWORD $vgpr11_vgpr12, 0, 0, 0, implicit $exec S_CBRANCH_SCC1 %bb.0, implicit $scc bb.1: S_ENDPGM ...

msearles added inline comments.Mar 14 2018, 9:35 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1564	Hi @hakzsam, I'd like to see this patch move forward, so any comment on my comment? Thanks.

hakzsam added inline comments.Mar 15 2018, 1:17 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1564	Hi, Sorry, I was working on something more important. Yes, I would like this patch to move forward too, I have to handle your testcase. I will update in the next few days. Thanks for your feedbacks!

msearles added inline comments.Mar 15 2018, 8:50 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1564	Sounds good; note that the testcase is llvm/trunk/test/CodeGen/AMDGPU/waitcnt-loop-single-basic-block.mir , which was pushed as part of https://reviews.llvm.org/rL327583 . I think if you remove that sanitify check, then the testcase will continue to pass; however, it will fail with the santify check as the single bb is not revisited. Mark

msearles added inline comments.Apr 7 2018, 7:11 AM

lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1564	Hi @hakzsam - ping; let me know if you don't have time to work on this; I can pick-up where you left off. Mark

Hi Mark,

Yes, feel free to update this patch if you have time because I'm not sure when I will be able to.

Thanks.

Picking up where @hakzsam left off

Change isLoopBottom() to return true in the case of a single basic block loop
Clear LoopWaitcntDataMap in between functions

In D43831#1063216, @msearles wrote:

Change isLoopBottom() to return true in the case of a single basic block loop

Clear LoopWaitcntDataMap in between functions

Thanks for taking care of this!

The patch seems good to me, but as my LLVM skills are not that great, I'm not able to say "LGTM". :)

I will let AMDGPU LLVM people reply.

Ping; the patch looks good to the original author and to me, the follow-on author. However, it would be nice if someone other than one of us accepts the revision.

LGTM.

This revision is now accepted and ready to land.Apr 17 2018, 12:51 PM

Closed by commit rL330337: [AMDGPU] Do not only rely on BB number when finding bottom loop (authored by msearles). · Explain WhyApr 19 2018, 8:45 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptApr 19 2018, 8:45 AM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIInsertWaitcnts.cpp

65 lines

test/

CodeGen/

AMDGPU/

waitcnt-back-edge-loop.mir

59 lines

Diff 141867

lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Show First 20 Lines • Show All 361 Lines • ▼ Show 20 Lines	private:

DenseSet<MachineBasicBlock *> BlockVisitedSet;		DenseSet<MachineBasicBlock *> BlockVisitedSet;
DenseSet<MachineInstr *> TrackedWaitcntSet;		DenseSet<MachineInstr *> TrackedWaitcntSet;
DenseSet<MachineInstr *> VCCZBugHandledSet;		DenseSet<MachineInstr *> VCCZBugHandledSet;

DenseMap<MachineBasicBlock *, std::unique_ptr<BlockWaitcntBrackets>>		DenseMap<MachineBasicBlock *, std::unique_ptr<BlockWaitcntBrackets>>
BlockWaitcntBracketsMap;		BlockWaitcntBracketsMap;

DenseSet<MachineBasicBlock *> BlockWaitcntProcessedSet;		std::vector<MachineBasicBlock *> BlockWaitcntProcessedSet;

DenseMap<MachineLoop *, std::unique_ptr<LoopWaitcntData>> LoopWaitcntDataMap;		DenseMap<MachineLoop *, std::unique_ptr<LoopWaitcntData>> LoopWaitcntDataMap;

std::vector<std::unique_ptr<BlockWaitcntBrackets>> KillWaitBrackets;		std::vector<std::unique_ptr<BlockWaitcntBrackets>> KillWaitBrackets;

public:		public:
static char ID;		static char ID;

Show All 19 Lines	public:
}		}

bool mayAccessLDSThroughFlat(const MachineInstr &MI) const;		bool mayAccessLDSThroughFlat(const MachineInstr &MI) const;
void generateSWaitCntInstBefore(MachineInstr &MI,		void generateSWaitCntInstBefore(MachineInstr &MI,
BlockWaitcntBrackets *ScoreBrackets);		BlockWaitcntBrackets *ScoreBrackets);
void updateEventWaitCntAfter(MachineInstr &Inst,		void updateEventWaitCntAfter(MachineInstr &Inst,
BlockWaitcntBrackets *ScoreBrackets);		BlockWaitcntBrackets *ScoreBrackets);
void mergeInputScoreBrackets(MachineBasicBlock &Block);		void mergeInputScoreBrackets(MachineBasicBlock &Block);
MachineBasicBlock loopBottom(const MachineLoop Loop);		bool isLoopBottom(const MachineLoop Loop, const MachineBasicBlock Block);
		unsigned countNumBottomBlocks(const MachineLoop *Loop);
void insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block);		void insertWaitcntInBlock(MachineFunction &MF, MachineBasicBlock &Block);
void insertWaitcntBeforeCF(MachineBasicBlock &Block, MachineInstr *Inst);		void insertWaitcntBeforeCF(MachineBasicBlock &Block, MachineInstr *Inst);
bool isWaitcntStronger(unsigned LHS, unsigned RHS);		bool isWaitcntStronger(unsigned LHS, unsigned RHS);
unsigned combineWaitcnt(unsigned LHS, unsigned RHS);		unsigned combineWaitcnt(unsigned LHS, unsigned RHS);
};		};

} // end anonymous namespace		} // end anonymous namespace

▲ Show 20 Lines • Show All 1,132 Lines • ▼ Show 20 Lines	if (pred_gds_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {
ScoreBrackets->setEventUB(		ScoreBrackets->setEventUB(
GDS_GPR_LOCK,		GDS_GPR_LOCK,
std::max(ScoreBrackets->getEventUB(GDS_GPR_LOCK), new_gds_ub));		std::max(ScoreBrackets->getEventUB(GDS_GPR_LOCK), new_gds_ub));
}		}
}		}
int pred_exp_ub = PredScoreBrackets->getEventUB(EXP_GPR_LOCK);		int pred_exp_ub = PredScoreBrackets->getEventUB(EXP_GPR_LOCK);
if (pred_exp_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {		if (pred_exp_ub > PredScoreBrackets->getScoreLB(EXP_CNT)) {
int new_exp_ub = MaxPending[EXP_CNT] + pred_exp_ub -		int new_exp_ub = MaxPending[EXP_CNT] + pred_exp_ub -
PredScoreBrackets->getScoreUB(EXP_CNT);		PredScoreBrackets->getScoreUB(EXP_CNT);
		arsenmUnsubmitted Not Done Reply Inline Actions lowercase true arsenm: lowercase true
if (new_exp_ub > 0) {		if (new_exp_ub > 0) {
ScoreBrackets->setEventUB(		ScoreBrackets->setEventUB(
EXP_GPR_LOCK,		EXP_GPR_LOCK,
std::max(ScoreBrackets->getEventUB(EXP_GPR_LOCK), new_exp_ub));		std::max(ScoreBrackets->getEventUB(EXP_GPR_LOCK), new_exp_ub));
}		}
}		}
		arsenmUnsubmitted Not Done Reply Inline Actions Braces arsenm: Braces
}		}
		arsenmUnsubmitted Not Done Reply Inline Actions Swap order of checks arsenm: Swap order of checks

		msearlesAuthorUnsubmitted Not Done Reply Inline Actions Can you explain why you avoid re-visiting a loop when a block loops over itself? I suppose this makes sense considering the current waitcnt pass sources. I.e., it admits that it does not handle single block loops correctly. See an "IMPORTANT NOTE" to this effect in SIInsertWaitcnts::mergeInputScoreBrackets() at SIInsertWaitcnts.cpp:1315-1317 . However, in order for it to correctly handle single block loops, it will need to revisit them and then the sanitify check won't make sense. I have a patch that fixes SIInsertWaitcnts::mergeInputScoreBrackets() in the case of a single block loop, but it's only useful if the block is revisited. See the reduced testcase. Note: the first GLOBAL_STORE_DWORD has its write-data in $vgpr11 the last GLOBAL_LOAD_DWORD loads into $vgpr11 so, a waitcnt is needed before the store to wait for the load if the pass does not revisit bb.0, it has no way of knowing this $ llc -mcpu=gfx900 -mtriple=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts testcase.mir $ cat testcase.mir --- \| define amdgpu_kernel void @benchmark_func() { ret void } ... --- name: benchmark_func body: \| bb.0: GLOBAL_STORE_DWORD $vgpr7_vgpr8, $vgpr11, 0, 0, 0, implicit $exec $vgpr21 = GLOBAL_LOAD_DWORD $vgpr4_vgpr5, 0, 0, 0, implicit $exec $vgpr10 = GLOBAL_LOAD_DWORD $vgpr10_vgpr11, 0, 0, 0, implicit $exec GLOBAL_STORE_DWORD $vgpr14_vgpr15, $vgpr21, 0, 0, 0, implicit $exec $vgpr11 = GLOBAL_LOAD_DWORD $vgpr11_vgpr12, 0, 0, 0, implicit $exec S_CBRANCH_SCC1 %bb.0, implicit $scc bb.1: S_ENDPGM ... msearles: Can you explain why you avoid re-visiting a loop when a block loops over itself? I suppose this…
		msearlesAuthorUnsubmitted Not Done Reply Inline Actions Hi @hakzsam, I'd like to see this patch move forward, so any comment on my comment? Thanks. msearles: Hi @hakzsam, I'd like to see this patch move forward, so any comment on my comment? Thanks.
		hakzsamUnsubmitted Not Done Reply Inline Actions Hi, Sorry, I was working on something more important. Yes, I would like this patch to move forward too, I have to handle your testcase. I will update in the next few days. Thanks for your feedbacks! hakzsam: Hi, Sorry, I was working on something more important. Yes, I would like this patch to move…
		msearlesAuthorUnsubmitted Not Done Reply Inline Actions Sounds good; note that the testcase is llvm/trunk/test/CodeGen/AMDGPU/waitcnt-loop-single-basic-block.mir , which was pushed as part of https://reviews.llvm.org/rL327583 . I think if you remove that sanitify check, then the testcase will continue to pass; however, it will fail with the santify check as the single bb is not revisited. Mark msearles: Sounds good; note that the testcase is llvm/trunk/test/CodeGen/AMDGPU/waitcnt-loop-single-basic…
		msearlesAuthorUnsubmitted Not Done Reply Inline Actions Hi @hakzsam - ping; let me know if you don't have time to work on this; I can pick-up where you left off. Mark msearles: Hi @hakzsam - ping; let me know if you don't have time to work on this; I can pick-up where you…
// if a single block loop, update the score brackets. Not needed for other		// if a single block loop, update the score brackets. Not needed for other
// blocks, as we did this in-place		// blocks, as we did this in-place
if (IsSelfPred) {		if (IsSelfPred) {
BlockWaitcntBracketsMap[&Block] = llvm::make_unique<BlockWaitcntBrackets>(*ScoreBrackets);		BlockWaitcntBracketsMap[&Block] = llvm::make_unique<BlockWaitcntBrackets>(*ScoreBrackets);
}		}
}		}

/// Return the "bottom" block of a loop. This differs from		/// Return true if the given basic block is a "bottom" block of a loop. This
		arsenmUnsubmitted Not Done Reply Inline Actions Braces arsenm: Braces
/// MachineLoop::getBottomBlock in that it works even if the loop is		/// differs from MachineLoop::getBottomBlock in that it works even if the loop
/// discontiguous.		/// is discontiguous. This also handles multiple back-edges for the same
MachineBasicBlock SIInsertWaitcnts::loopBottom(const MachineLoop Loop) {		/// "header" block of a loop.
MachineBasicBlock *Bottom = Loop->getHeader();		bool SIInsertWaitcnts::isLoopBottom(const MachineLoop *Loop,
for (MachineBasicBlock *MBB : Loop->blocks())		const MachineBasicBlock *Block) {
if (MBB->getNumber() > Bottom->getNumber())		for (MachineBasicBlock *MBB : Loop->blocks()) {
Bottom = MBB;		if (MBB == Block && MBB->isSuccessor(Loop->getHeader())) {
return Bottom;		return true;
		}
		}
		return false;
		}

		/// Count the number of "bottom" basic blocks of a loop.
		unsigned SIInsertWaitcnts::countNumBottomBlocks(const MachineLoop *Loop) {
		unsigned Count = 0;
		for (MachineBasicBlock *MBB : Loop->blocks()) {
		if (MBB->isSuccessor(Loop->getHeader())) {
		Count++;
		}
		}
		return Count;
}		}

// Generate s_waitcnt instructions where needed.		// Generate s_waitcnt instructions where needed.
void SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,		void SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
MachineBasicBlock &Block) {		MachineBasicBlock &Block) {
// Initialize the state information.		// Initialize the state information.
mergeInputScoreBrackets(Block);		mergeInputScoreBrackets(Block);

▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	if (VCCZBugWorkAround) {
VCCZBugHandledSet.insert(&Inst);		VCCZBugHandledSet.insert(&Inst);
}		}

++Iter;		++Iter;
}		}

// Check if we need to force convergence at loop footer.		// Check if we need to force convergence at loop footer.
MachineLoop *ContainingLoop = MLI->getLoopFor(&Block);		MachineLoop *ContainingLoop = MLI->getLoopFor(&Block);
if (ContainingLoop && loopBottom(ContainingLoop) == &Block) {		if (ContainingLoop && isLoopBottom(ContainingLoop, &Block)) {
LoopWaitcntData *WaitcntData = LoopWaitcntDataMap[ContainingLoop].get();		LoopWaitcntData *WaitcntData = LoopWaitcntDataMap[ContainingLoop].get();
WaitcntData->print();		WaitcntData->print();
DEBUG(dbgs() << '\n';);		DEBUG(dbgs() << '\n';);

// The iterative waitcnt insertion algorithm aims for optimal waitcnt		// The iterative waitcnt insertion algorithm aims for optimal waitcnt
// placement and doesn't always guarantee convergence for a loop. Each		// placement and doesn't always guarantee convergence for a loop. Each
// loop should take at most 2 iterations for it to converge naturally.		// loop should take at most 2 iterations for it to converge naturally.
// When this max is reached and result doesn't converge, we force		// When this max is reached and result doesn't converge, we force
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	RegisterEncoding.VGPRL =
RegisterEncoding.VGPR0 + HardwareLimits.NumVGPRsMax - 1;		RegisterEncoding.VGPR0 + HardwareLimits.NumVGPRsMax - 1;
RegisterEncoding.SGPR0 = TRI->getEncodingValue(AMDGPU::SGPR0);		RegisterEncoding.SGPR0 = TRI->getEncodingValue(AMDGPU::SGPR0);
RegisterEncoding.SGPRL =		RegisterEncoding.SGPRL =
RegisterEncoding.SGPR0 + HardwareLimits.NumSGPRsMax - 1;		RegisterEncoding.SGPR0 + HardwareLimits.NumSGPRsMax - 1;

TrackedWaitcntSet.clear();		TrackedWaitcntSet.clear();
BlockVisitedSet.clear();		BlockVisitedSet.clear();
VCCZBugHandledSet.clear();		VCCZBugHandledSet.clear();
		LoopWaitcntDataMap.clear();

// Walk over the blocks in reverse post-dominator order, inserting		// Walk over the blocks in reverse post-dominator order, inserting
// s_waitcnt where needed.		// s_waitcnt where needed.
ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);		ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
bool Modified = false;		bool Modified = false;
for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator		for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator
I = RPOT.begin(),		I = RPOT.begin(),
E = RPOT.end(), J = RPOT.begin();		E = RPOT.end(), J = RPOT.begin();
Show All 10 Lines	for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator
ScoreBrackets->setPostOrder(MBB.getNumber());		ScoreBrackets->setPostOrder(MBB.getNumber());
MachineLoop *ContainingLoop = MLI->getLoopFor(&MBB);		MachineLoop *ContainingLoop = MLI->getLoopFor(&MBB);
if (ContainingLoop && LoopWaitcntDataMap[ContainingLoop] == nullptr)		if (ContainingLoop && LoopWaitcntDataMap[ContainingLoop] == nullptr)
LoopWaitcntDataMap[ContainingLoop] = llvm::make_unique<LoopWaitcntData>();		LoopWaitcntDataMap[ContainingLoop] = llvm::make_unique<LoopWaitcntData>();

// If we are walking into the block from before the loop, then guarantee		// If we are walking into the block from before the loop, then guarantee
// at least 1 re-walk over the loop to propagate the information, even if		// at least 1 re-walk over the loop to propagate the information, even if
// no S_WAITCNT instructions were generated.		// no S_WAITCNT instructions were generated.
if (ContainingLoop && ContainingLoop->getHeader() == &MBB && J < I &&		if (ContainingLoop && ContainingLoop->getHeader() == &MBB) {
(!BlockWaitcntProcessedSet.count(&MBB))) {		unsigned Count = countNumBottomBlocks(ContainingLoop);

		// If the loop has multiple back-edges, and so more than one "bottom"
		// basic block, we have to guarantee a re-walk over every blocks.
		if ((std::count(BlockWaitcntProcessedSet.begin(),
		BlockWaitcntProcessedSet.end(), &MBB) < Count)) {
BlockWaitcntBracketsMap[&MBB]->setRevisitLoop(true);		BlockWaitcntBracketsMap[&MBB]->setRevisitLoop(true);
DEBUG(dbgs() << "set-revisit: Block"		DEBUG(dbgs() << "set-revisit: Block"
<< ContainingLoop->getHeader()->getNumber() << '\n';);		<< ContainingLoop->getHeader()->getNumber() << '\n';);
}		}
		}

// Walk over the instructions.		// Walk over the instructions.
insertWaitcntInBlock(MF, MBB);		insertWaitcntInBlock(MF, MBB);

// Flag that waitcnts have been processed at least once.		// Flag that waitcnts have been processed at least once.
BlockWaitcntProcessedSet.insert(&MBB);		BlockWaitcntProcessedSet.push_back(&MBB);

// See if we want to revisit the loop.		// See if we want to revisit the loop. If a loop has multiple back-edges,
if (ContainingLoop && loopBottom(ContainingLoop) == &MBB) {		// we shouldn't revisit the same "bottom" basic block.
		if (ContainingLoop && isLoopBottom(ContainingLoop, &MBB) &&
		std::count(BlockWaitcntProcessedSet.begin(),
		BlockWaitcntProcessedSet.end(), &MBB) == 1) {
MachineBasicBlock *EntryBB = ContainingLoop->getHeader();		MachineBasicBlock *EntryBB = ContainingLoop->getHeader();
BlockWaitcntBrackets *EntrySB = BlockWaitcntBracketsMap[EntryBB].get();		BlockWaitcntBrackets *EntrySB = BlockWaitcntBracketsMap[EntryBB].get();
if (EntrySB && EntrySB->getRevisitLoop()) {		if (EntrySB && EntrySB->getRevisitLoop()) {
EntrySB->setRevisitLoop(false);		EntrySB->setRevisitLoop(false);
J = I;		J = I;
int32_t PostOrder = EntrySB->getPostOrder();		int32_t PostOrder = EntrySB->getPostOrder();
// TODO: Avoid this loop. Find another way to set I.		// TODO: Avoid this loop. Find another way to set I.
for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator		for (ReversePostOrderTraversal<MachineFunction *>::rpo_iterator
▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/waitcnt-back-edge-loop.mir

This file was added.

				# RUN: llc -o - %s -march=amdgcn -mcpu=fiji -run-pass=si-insert-waitcnts -verify-machineinstrs \| FileCheck -check-prefix=GCN %s

				# GCN-LABEL: waitcnt-back-edge-loop
				# GCN: bb.2
				# GCN: S_WAITCNT 112
				# GCN: $vgpr5 = V_CVT_I32_F32_e32 killed $vgpr5, implicit $exec

				---
				name: waitcnt-back-edge-loop
				body: \|
				bb.0:
				successors: %bb.1

				$vgpr1 = V_MOV_B32_e32 0, implicit $exec, implicit-def $vgpr1_vgpr2
				$vgpr2 = V_MOV_B32_e32 0, implicit $exec, implicit-def $vgpr1_vgpr2
				$vgpr4 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* null`, addrspace 1)
				$vgpr0 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* null`, addrspace 1)
				$sgpr0_sgpr1 = V_CMP_EQ_U32_e64 3, killed $sgpr4, implicit $exec
				$vgpr3 = V_CNDMASK_B32_e64 -1082130432, 1065353216, killed $sgpr0_sgpr1, implicit $exec
				$vgpr5 = V_MOV_B32_e32 $vgpr0, implicit $exec, implicit $exec
				S_BRANCH %bb.1

				bb.3:
				successors: %bb.1

				$vgpr5 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* null`, addrspace 1)

				bb.1:
				successors: %bb.5, %bb.2

				$vgpr5 = V_CVT_I32_F32_e32 killed $vgpr5, implicit $exec
				V_CMP_NE_U32_e32 0, $vgpr5, implicit-def $vcc, implicit $exec
				$vcc = S_AND_B64 $exec, killed $vcc, implicit-def dead $scc
				S_CBRANCH_VCCZ %bb.5, implicit killed $vcc

				bb.2:
				successors: %bb.4, %bb.3

				V_CMP_EQ_U32_e32 9, killed $vgpr5, implicit-def $vcc, implicit $exec
				$vcc = S_AND_B64 $exec, killed $vcc, implicit-def dead $scc
				S_CBRANCH_VCCZ %bb.3, implicit killed $vcc

				bb.4:
				successors: %bb.3, %bb.1

				$vgpr5 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* null`, addrspace 1)
				$vgpr4 = V_CVT_I32_F32_e32 $vgpr5, implicit $exec
				V_CMP_EQ_U32_e32 2, killed $vgpr4, implicit-def $vcc, implicit $exec
				$vcc = S_AND_B64 $exec, killed $vcc, implicit-def dead $scc
				$vgpr4 = V_MOV_B32_e32 $vgpr5, implicit $exec, implicit $exec
				S_CBRANCH_VCCZ %bb.1, implicit killed $vcc
				S_BRANCH %bb.3

				bb.5:

				$vgpr4 = V_MAC_F32_e32 killed $vgpr0, killed $vgpr3, killed $vgpr4, implicit $exec
				EXP_DONE 12, killed $vgpr4, undef $vgpr0, undef $vgpr0, undef $vgpr0, 0, 0, 15, implicit $exec
				S_ENDPGM
				...