This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable WQM if demotes and softwqm are combined
AbandonedPublic

Authored by critson on May 4 2022, 10:56 PM.

Download Raw Diff

Details

Reviewers

foad
ruiling
piotr

Summary

Demotes may be used to explicitly create helper invocations.
These helper invocations may be intend to have observable effects
in WQM, e.g. in fragment shader subgroup operations.
Facilitate this behaviour by forcing softwqm operations to be run
in WQM when demotes are present in a shader.
Conversely this allows such operations to be marked softwqm so
helper lanes are only enabled if demotes are present.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.May 4 2022, 10:56 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 10:56 PM

Herald added subscribers: jsilvanus, hsmhsm, kerbowa and 9 others. · View Herald Transcript

critson requested review of this revision.May 4 2022, 10:56 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 10:56 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B162833: Diff 427200.May 4 2022, 11:39 PM

The demote itself does not need to enable wqm. So I think soft_wqm should not depend on demote to enable wqm. If some graphics operation needs to run under wqm, it should directly use the non-soft version.

softwqm is used by our implementation of subgroup operations: if helper invocations are there, then subgroup operations must interact with them.

Helper invocations exist in one of two cases:

There are "hardwqm" instructions (e.g. sampler).
Invocations are explicitly demoted.

In the latter case, we need to make sure that the demoted invocations are still live for the purpose of feeding their values into those subgroup operations. Simply pretending that hardwqm instructions exists seems like a correct way to ensure that. Perhaps you have an alternative in mind @ruiling?

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll
284–301	Can you pre-submit this test with the incorrect code sequence (ideally, update_llc_test_checks?) and then update this review so that the diff is more clearly visible?

critson mentioned this in rG78ab7adbd39e: [AMDGPU] Pre-commit test for D124981. NFC..May 9 2022, 5:49 PM

In D124981#3501190, @nhaehnle wrote:

softwqm is used by our implementation of subgroup operations: if helper invocations are there, then subgroup operations must interact with them.

Helper invocations exist in one of two cases:

There are "hardwqm" instructions (e.g. sampler).

Invocations are explicitly demoted.

There are two possible sources of helper invocations:

A quad may have pixels not covered by current primitive
An active invocation is demoted into helper invocation.

Normally we don't activate helper invocations to optimize for memory bandwidth/power. But if graphics specification require implementation has defined behavior that need helper invocation's participation for certain shader operations, then we have to turn on the helper invocations, we need to place a non-soft wqm. My understanding of the working logic is: it is the observer(image sample, derivative, subgroupQuad operation) to decide whether we should turn on the helper invocations for a quad with at least one active invocation. My understanding of soft_wqm (a kind of weak observer) is let other strong observer to decide whether we should turn on wqm. I guess we are making this change to fix subgroupQuad*** issue, under which I believe we need a non-soft wqm. I did not follow all the details of how demote was implemented, so please correct me if I misread the change. But I still think that if there is no strong observer (hard-wqm), don't enable wqm. If an operation really need to be executed under wqm, then it should use hard-wqm.

Rebase on to pre-committed test

I think there is perhaps another way to see this.

From the specification for SPV_EXT_demote_to_helper_invocation:

Demote fragment shader invocation to a helper invocation. [...]
The implementation may terminate helper invocations before the end of the shader as an optimization, but doing so must not affect derivative calculations and does not make control flow non-uniform.

One way to read this is that demote should always create helper lanes, i.e. demote should always run WQM.
Our implementation doesn't do this and instead executes demote as a kill outside WQM regions.
In sense this is actually fixing an issue with the demote implementation, rather than with softwqm.

I agree with the description of softwqm as a "weak observer", but demote is suppose to be a strong source of observable derivatives.

Harbormaster completed remote builds in B163671: Diff 428338.May 10 2022, 5:24 AM

Having gone over the argument with @ruiling separately, I think he has a point. Have you double-checked whether this change is still required after the related LLPC changes have been made?

This change is not required if front-end uses explicit WQM.

critson mentioned this in rG698fda0e3ecc: [AMDGPU] Remove pre-committed test for D124981. NFC..May 12 2022, 12:04 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIWholeQuadMode.cpp

9 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.softwqm.ll

5 lines

Diff 428338

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

Show First 20 Lines • Show All 483 Lines • ▼ Show 20 Lines
char SIWholeQuadMode::scanInstructions(MachineFunction &MF,		char SIWholeQuadMode::scanInstructions(MachineFunction &MF,
std::vector<WorkItem> &Worklist) {		std::vector<WorkItem> &Worklist) {
char GlobalFlags = 0;		char GlobalFlags = 0;
bool WQMOutputs = MF.getFunction().hasFnAttribute("amdgpu-ps-wqm-outputs");		bool WQMOutputs = MF.getFunction().hasFnAttribute("amdgpu-ps-wqm-outputs");
SmallVector<MachineInstr *, 4> SetInactiveInstrs;		SmallVector<MachineInstr *, 4> SetInactiveInstrs;
SmallVector<MachineInstr *, 4> SoftWQMInstrs;		SmallVector<MachineInstr *, 4> SoftWQMInstrs;
bool HasImplicitDerivatives =		bool HasImplicitDerivatives =
MF.getFunction().getCallingConv() == CallingConv::AMDGPU_PS;		MF.getFunction().getCallingConv() == CallingConv::AMDGPU_PS;
		bool HasDemotes = false;

// We need to visit the basic blocks in reverse post-order so that we visit		// We need to visit the basic blocks in reverse post-order so that we visit
// defs before uses, in particular so that we don't accidentally mark an		// defs before uses, in particular so that we don't accidentally mark an
// instruction as needing e.g. WQM before visiting it and realizing it needs		// instruction as needing e.g. WQM before visiting it and realizing it needs
// WQM disabled.		// WQM disabled.
ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);		ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
for (MachineBasicBlock *MBB : RPOT) {		for (MachineBasicBlock *MBB : RPOT) {
BlockInfo &BBI = Blocks[MBB];		BlockInfo &BBI = Blocks[MBB];
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	for (MachineInstr &MI : *MBB) {
} else {		} else {
if (Opcode == AMDGPU::SI_PS_LIVE \|\| Opcode == AMDGPU::SI_LIVE_MASK) {		if (Opcode == AMDGPU::SI_PS_LIVE \|\| Opcode == AMDGPU::SI_LIVE_MASK) {
LiveMaskQueries.push_back(&MI);		LiveMaskQueries.push_back(&MI);
} else if (Opcode == AMDGPU::SI_KILL_I1_TERMINATOR \|\|		} else if (Opcode == AMDGPU::SI_KILL_I1_TERMINATOR \|\|
Opcode == AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR \|\|		Opcode == AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR \|\|
Opcode == AMDGPU::SI_DEMOTE_I1) {		Opcode == AMDGPU::SI_DEMOTE_I1) {
KillInstrs.push_back(&MI);		KillInstrs.push_back(&MI);
BBI.NeedsLowering = true;		BBI.NeedsLowering = true;
		if (Opcode == AMDGPU::SI_DEMOTE_I1)
		HasDemotes = true;
} else if (WQMOutputs) {		} else if (WQMOutputs) {
// The function is in machine SSA form, which means that physical		// The function is in machine SSA form, which means that physical
// VGPRs correspond to shader inputs and outputs. Inputs are		// VGPRs correspond to shader inputs and outputs. Inputs are
// only used, outputs are only defined.		// only used, outputs are only defined.
// FIXME: is this still valid?		// FIXME: is this still valid?
for (const MachineOperand &MO : MI.defs()) {		for (const MachineOperand &MO : MI.defs()) {
if (!MO.isReg())		if (!MO.isReg())
continue;		continue;
Show All 12 Lines	for (MachineInstr &MI : *MBB) {
continue;		continue;
}		}

markInstruction(MI, Flags, Worklist);		markInstruction(MI, Flags, Worklist);
GlobalFlags \|= Flags;		GlobalFlags \|= Flags;
}		}
}		}

		// Demotes may be used to intentionally introduce new helper lanes.
		// Enable WQM to facilitate this effect if there are operations which
		// would change behaviour when run in WQM, i.e. SOFT_WQM instructions.
		if (HasDemotes && !SoftWQMInstrs.empty())
		GlobalFlags \|= StateWQM;

// Mark sure that any SET_INACTIVE instructions are computed in WQM if WQM is		// Mark sure that any SET_INACTIVE instructions are computed in WQM if WQM is
// ever used anywhere in the function. This implements the corresponding		// ever used anywhere in the function. This implements the corresponding
// semantics of @llvm.amdgcn.set.inactive.		// semantics of @llvm.amdgcn.set.inactive.
// Similarly for SOFT_WQM instructions, implementing @llvm.amdgcn.softwqm.		// Similarly for SOFT_WQM instructions, implementing @llvm.amdgcn.softwqm.
if (GlobalFlags & StateWQM) {		if (GlobalFlags & StateWQM) {
for (MachineInstr *MI : SetInactiveInstrs)		for (MachineInstr *MI : SetInactiveInstrs)
markInstruction(*MI, StateWQM, Worklist);		markInstruction(*MI, StateWQM, Worklist);
for (MachineInstr *MI : SoftWQMInstrs)		for (MachineInstr *MI : SoftWQMInstrs)
▲ Show 20 Lines • Show All 987 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll

Show First 20 Lines • Show All 275 Lines • ▼ Show 20 Lines	ELSE:
call void @llvm.amdgcn.struct.buffer.store.f32(float %data.sample, <4 x i32> undef, i32 %c, i32 0, i32 0, i32 0)		call void @llvm.amdgcn.struct.buffer.store.f32(float %data.sample, <4 x i32> undef, i32 %c, i32 0, i32 0, i32 0)
br label %END		br label %END

END:		END:
%r = phi float [ %data.if, %IF ], [ %data, %ELSE ]		%r = phi float [ %data.if, %IF ], [ %data, %ELSE ]
ret float %r		ret float %r
}		}

; Check that WQM is triggered for softwqm with demote.		; Check that WQM is triggered for softwqm with demote.
;		;
define amdgpu_ps float @test_demote_1(i32 inreg %idx0, i32 inreg %idx1) {		define amdgpu_ps float @test_demote_1(i32 inreg %idx0, i32 inreg %idx1) {
; CHECK-LABEL: test_demote_1:		; CHECK-LABEL: test_demote_1:
; CHECK: ; %bb.0: ; %main_body		; CHECK: ; %bb.0: ; %main_body
; CHECK-NEXT: s_mov_b64 s[2:3], exec		; CHECK-NEXT: s_mov_b64 s[2:3], exec
		; CHECK-NEXT: s_wqm_b64 exec, exec
; CHECK-NEXT: v_mov_b32_e32 v0, s0		; CHECK-NEXT: v_mov_b32_e32 v0, s0
; CHECK-NEXT: buffer_load_dword v0, v0, s[0:3], 0 idxen		; CHECK-NEXT: buffer_load_dword v0, v0, s[0:3], 0 idxen
; CHECK-NEXT: v_mov_b32_e32 v1, s1		; CHECK-NEXT: v_mov_b32_e32 v1, s1
; CHECK-NEXT: buffer_load_dword v1, v1, s[0:3], 0 idxen		; CHECK-NEXT: buffer_load_dword v1, v1, s[0:3], 0 idxen
; CHECK-NEXT: s_waitcnt vmcnt(1)		; CHECK-NEXT: s_waitcnt vmcnt(1)
; CHECK-NEXT: v_cmp_le_f32_e32 vcc, 0, v0		; CHECK-NEXT: v_cmp_le_f32_e32 vcc, 0, v0
; CHECK-NEXT: s_xor_b64 s[0:1], vcc, exec		; CHECK-NEXT: s_xor_b64 s[0:1], vcc, exec
; CHECK-NEXT: s_andn2_b64 s[2:3], s[2:3], s[0:1]		; CHECK-NEXT: s_andn2_b64 s[2:3], s[2:3], s[0:1]
; CHECK-NEXT: s_cbranch_scc0 .LBB8_2		; CHECK-NEXT: s_cbranch_scc0 .LBB8_2
; CHECK-NEXT: ; %bb.1: ; %main_body		; CHECK-NEXT: ; %bb.1: ; %main_body
; CHECK-NEXT: s_and_b64 exec, exec, s[2:3]		; CHECK-NEXT: s_wqm_b64 s[0:1], s[2:3]
		nhaehnleUnsubmitted Not Done Reply Inline Actions Can you pre-submit this test with the incorrect code sequence (ideally, update_llc_test_checks?) and then update this review so that the diff is more clearly visible? nhaehnle: Can you pre-submit this test with the incorrect code sequence (ideally, update_llc_test_checks?
		; CHECK-NEXT: s_and_b64 exec, exec, s[0:1]
; CHECK-NEXT: s_waitcnt vmcnt(0)		; CHECK-NEXT: s_waitcnt vmcnt(0)
; CHECK-NEXT: v_add_f32_e32 v0, v0, v1		; CHECK-NEXT: v_add_f32_e32 v0, v0, v1
; CHECK-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $exec killed $exec		; CHECK-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $exec killed $exec
		; CHECK-NEXT: s_and_b64 exec, exec, s[2:3]
; CHECK-NEXT: s_branch .LBB8_3		; CHECK-NEXT: s_branch .LBB8_3
; CHECK-NEXT: .LBB8_2:		; CHECK-NEXT: .LBB8_2:
; CHECK-NEXT: s_mov_b64 exec, 0		; CHECK-NEXT: s_mov_b64 exec, 0
; CHECK-NEXT: exp null off, off, off, off done vm		; CHECK-NEXT: exp null off, off, off, off done vm
; CHECK-NEXT: s_endpgm		; CHECK-NEXT: s_endpgm
; CHECK-NEXT: .LBB8_3:		; CHECK-NEXT: .LBB8_3:
main_body:		main_body:
%src0 = call float @llvm.amdgcn.struct.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i32 0, i32 0)		%src0 = call float @llvm.amdgcn.struct.buffer.load.f32(<4 x i32> undef, i32 %idx0, i32 0, i32 0, i32 0)
Show All 23 Lines