This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1/4
SIInsertWaitcnts.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
waitcnt.mir

Differential D117544

[AMDGPU] Fix missing waitcnt issue
ClosedPublic

Authored by piotr on Jan 18 2022, 12:19 AM.

Download Raw Diff

Details

Reviewers

foad
arsenm
rampitec
kerbowa
bsaleil

Commits

rG8dfb417e67e3: [AMDGPU] Fix missing waitcnt issue

Summary

Ignore out of order counters when merging brackets. The fact that
there was a pending event in the old state does not guarantee that
the waitcnt was generated, so we still need to conservatively re-process
the block.

The patch fixes a correctness issue where the block was not re-processed
and the waitcnt not inserted in consequence.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

piotr created this revision.Jan 18 2022, 12:19 AM

Herald added subscribers: foad, kerbowa, hiraditya and 8 others. · View Herald TranscriptJan 18 2022, 12:19 AM

piotr requested review of this revision.Jan 18 2022, 12:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2022, 12:19 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

piotr added reviewers: foad, arsenm, rampitec, kerbowa, bsaleil.Jan 18 2022, 12:23 AM

foad added inline comments.Jan 18 2022, 12:27 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428	As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder && MyPending != 0` also fix the bug?

foad added inline comments.Jan 18 2022, 12:29 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428	As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder && MyPending != 0` also fix the bug? Sorry I meant `RegStrictDom && !(OldOutOfOrder && MyPending != 0)`.

Harbormaster completed remote builds in B143952: Diff 400747.Jan 18 2022, 12:50 AM

piotr added inline comments.Jan 18 2022, 1:10 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428	As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder && MyPending != 0` also fix the bug? Sorry I meant `RegStrictDom && !(OldOutOfOrder && MyPending != 0)`. Do you mean this: if (RegStrictDom && !(OldOutOfOrder && MyPending == 0)) I think that would also fix the bug in the test, but I wasn't convinced that was a complete fix, so I went for a safer and simpler approach.

As a bit of background, the OldOutOfOrder test was introduced by @nhaehnle in a large refactoring in D54231. I think it's just a performance optimisation: if the old state had events that completed out of order, then any use of the corresponding registers would have to be preceded by a "waitcnt 0", so there's no need to reprocess that block because the waitcnts are already as strict as they can be. But this goes wrong in the case where the merge introduces a wait on a particular register that had no wait in the old state (so no waitcnt would have been generated for it the last time the block was processed).

Did you notice any compile-time degradation from your fix?

In our usual compile-time tests it shows 0.056% degradation on average, worst case 0.9%.

In D117544#3250636, @piotr wrote:

In our usual compile-time tests it shows 0.056% degradation on average, worst case 0.9%.

In that case I agree we should go with your fix as it is the simplest.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1428	Do you mean this: I'm not sure what I meant; sorry for the noise. Now I think it might be possible to salvage the original optimisation by remembering whether any individual register has a wait in the merged state, when it had no wait in the original state. But I don't think it's worth the complexity.

This revision is now accepted and ready to land.Jan 18 2022, 1:44 AM

Remove references to "waw" as the issue can also trigger in other scenarios, as Jay pointed out to me (thanks).

foad accepted this revision.Jan 18 2022, 5:23 AM

Harbormaster completed remote builds in B143996: Diff 400815.Jan 18 2022, 5:24 AM

This revision was landed with ongoing or failed builds.Jan 19 2022, 1:55 AM

Closed by commit rG8dfb417e67e3: [AMDGPU] Fix missing waitcnt issue (authored by piotr). · Explain Why

This revision was automatically updated to reflect the committed changes.

piotr added a commit: rG8dfb417e67e3: [AMDGPU] Fix missing waitcnt issue.

foad mentioned this in rG6ed6e8e3b864: [AMDGPU] Remove RegStrictDom variable. NFC..Nov 22 2022, 9:09 AM

foad mentioned this in rGbc0c89d3d853: [AMDGPU] More cleanup after D117544. NFC..Nov 22 2022, 1:15 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInsertWaitcnts.cpp

3 lines

test/

CodeGen/

AMDGPU/

waitcnt.mir

36 lines

Diff 401138

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Show First 20 Lines • Show All 1,376 Lines • ▼ Show 20 Lines
bool WaitcntBrackets::merge(const WaitcntBrackets &Other) {		bool WaitcntBrackets::merge(const WaitcntBrackets &Other) {
bool StrictDom = false;		bool StrictDom = false;

VgprUB = std::max(VgprUB, Other.VgprUB);		VgprUB = std::max(VgprUB, Other.VgprUB);
SgprUB = std::max(SgprUB, Other.SgprUB);		SgprUB = std::max(SgprUB, Other.SgprUB);

for (auto T : inst_counter_types()) {		for (auto T : inst_counter_types()) {
// Merge event flags for this counter		// Merge event flags for this counter
const bool OldOutOfOrder = counterOutOfOrder(T);
const unsigned OldEvents = PendingEvents & WaitEventMaskForInst[T];		const unsigned OldEvents = PendingEvents & WaitEventMaskForInst[T];
const unsigned OtherEvents = Other.PendingEvents & WaitEventMaskForInst[T];		const unsigned OtherEvents = Other.PendingEvents & WaitEventMaskForInst[T];
if (OtherEvents & ~OldEvents)		if (OtherEvents & ~OldEvents)
StrictDom = true;		StrictDom = true;
PendingEvents \|= OtherEvents;		PendingEvents \|= OtherEvents;

// Merge scores for this counter		// Merge scores for this counter
const unsigned MyPending = ScoreUBs[T] - ScoreLBs[T];		const unsigned MyPending = ScoreUBs[T] - ScoreLBs[T];
Show All 26 Lines	for (auto T : inst_counter_types()) {
}		}

if (T == LGKM_CNT) {		if (T == LGKM_CNT) {
for (int J = 0; J <= SgprUB; J++) {		for (int J = 0; J <= SgprUB; J++) {
RegStrictDom \|= mergeScore(M, SgprScores[J], Other.SgprScores[J]);		RegStrictDom \|= mergeScore(M, SgprScores[J], Other.SgprScores[J]);
}		}
}		}

if (RegStrictDom && !OldOutOfOrder)		if (RegStrictDom)
foadUnsubmitted Not Done Reply Inline Actions As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder && MyPending != 0` also fix the bug? foad: As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder && MyPending !
foadUnsubmitted Not Done Reply Inline Actions As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder && MyPending != 0` also fix the bug? Sorry I meant `RegStrictDom && !(OldOutOfOrder && MyPending != 0)`. foad: > As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder &&…
piotrAuthorUnsubmitted Done Reply Inline Actions As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder && MyPending != 0` also fix the bug? Sorry I meant `RegStrictDom && !(OldOutOfOrder && MyPending != 0)`. Do you mean this: if (RegStrictDom && !(OldOutOfOrder && MyPending == 0)) I think that would also fix the bug in the test, but I wasn't convinced that was a complete fix, so I went for a safer and simpler approach. piotr: > > As an experiment, would changing the condition to `RegStrictDom && !OldOutOfOrder &&…
foadUnsubmitted Not Done Reply Inline Actions Do you mean this: I'm not sure what I meant; sorry for the noise. Now I think it might be possible to salvage the original optimisation by remembering whether any individual register has a wait in the merged state, when it had no wait in the original state. But I don't think it's worth the complexity. foad: > Do you mean this: I'm not sure what I meant; sorry for the noise. Now I think it might be…
StrictDom = true;		StrictDom = true;
}		}

return StrictDom;		return StrictDom;
}		}

// Generate s_waitcnt instructions where needed.		// Generate s_waitcnt instructions where needed.
bool SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,		bool SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
▲ Show 20 Lines • Show All 292 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/waitcnt.mir

Show All 38 Lines	--- \|

define amdgpu_kernel void @cross_bundle() {		define amdgpu_kernel void @cross_bundle() {
ret void		ret void
}		}

define amdgpu_kernel void @subregs16bit() {		define amdgpu_kernel void @subregs16bit() {
ret void		ret void
}		}

		define amdgpu_kernel void @waitcnt_backedge() {
		ret void
		}
...		...
---		---

# CHECK-LABEL: name: flat_zero_waitcnt		# CHECK-LABEL: name: flat_zero_waitcnt

# CHECK-LABEL: bb.0:		# CHECK-LABEL: bb.0:
# CHECK: FLAT_LOAD_DWORD		# CHECK: FLAT_LOAD_DWORD
# CHECK: FLAT_LOAD_DWORDX4		# CHECK: FLAT_LOAD_DWORDX4
▲ Show 20 Lines • Show All 272 Lines • ▼ Show 20 Lines	machineFunctionInfo:
isEntryFunction: true		isEntryFunction: true
body: \|		body: \|
bb.0:		bb.0:
liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $vgpr4		liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $vgpr4
$vgpr0 = FLAT_LOAD_USHORT killed $vgpr0_vgpr1, 0, 0, implicit $exec, implicit $flat_scr		$vgpr0 = FLAT_LOAD_USHORT killed $vgpr0_vgpr1, 0, 0, implicit $exec, implicit $flat_scr
$vgpr1 = FLAT_LOAD_USHORT killed $vgpr2_vgpr3, 0, 0, implicit $exec, implicit $flat_scr		$vgpr1 = FLAT_LOAD_USHORT killed $vgpr2_vgpr3, 0, 0, implicit $exec, implicit $flat_scr
V_NOP_e32 implicit $exec, implicit $vgpr0_lo16, implicit $vgpr1_lo16		V_NOP_e32 implicit $exec, implicit $vgpr0_lo16, implicit $vgpr1_lo16
...		...

		---
		# Waitcnt required before the use of $sgpr10_sgpr11, as the S_LOAD also writes
		# to $sgpr10_sgpr11, and can occur first in the program running order.

		# CHECK-LABEL: name: waitcnt_backedge
		# CHECK: S_WAITCNT
		# CHECK: $sgpr10_sgpr11 = S_CSELECT_B64
		# CHECK: $sgpr10_sgpr11 = S_LOAD_DWORDX2_IMM


		name: waitcnt_backedge
		body: \|
		bb.0:
		renamable $sgpr4_sgpr5_sgpr6_sgpr7 = S_LOAD_DWORDX4_IMM renamable $sgpr2_sgpr3, 32, 0 :: (load (s128) from `i32 addrspace(4)* undef`, addrspace 4)

		bb.4:
		renamable $sgpr10_sgpr11 = S_CSELECT_B64 -1, 0, implicit killed $scc
		renamable $vgpr1 = BUFFER_LOAD_DWORD_OFFEN killed renamable $vgpr5, renamable $sgpr4_sgpr5_sgpr6_sgpr7, 0, 0, 1, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
		renamable $sgpr10_sgpr11 = S_LOAD_DWORDX2_IMM killed renamable $sgpr0_sgpr1, 0, 0 :: (load (s64) from `i32 addrspace(4)* undef`, align 4, addrspace 4)
		S_CBRANCH_SCC0 %bb.9, implicit killed $scc

		bb.9:
		renamable $vgpr1 = V_CNDMASK_B32_e64 0, 0, 0, 1, killed $sgpr14_sgpr15, implicit $exec
		S_CBRANCH_SCC0 %bb.14, implicit killed $scc

		bb.10:
		S_BRANCH %bb.4

		bb.14:
		S_ENDPGM 0
		...