This is an archive of the discontinued LLVM Phabricator instance.

Differential D20839

AMDGPU: Add amdgpu-ps-wqm-outputs function attributes
ClosedPublic

Authored by nhaehnle on May 31 2016, 3:16 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm
mareko

Commits

rGc00e03b8f57c: AMDGPU: Add amdgpu-ps-wqm-outputs function attributes
rL272063: AMDGPU: Add amdgpu-ps-wqm-outputs function attributes

Summary

The presence of this attribute indicates that VGPR outputs should be computed
in whole quad mode. This will be used by Mesa for prolog pixel shaders, so
that derivatives can be taken of shader inputs computed by the prolog, fixing
a bug.

The generated code could certainly be improved: if a prolog pixel shader is
used (which isn't common in modern OpenGL - they're used for gl_Color, polygon
stipples, and forcing per-sample interpolation), Mesa will use this attribute
unconditionally, because it has to be conservative. So WQM may be used in the
prolog when it isn't really needed, and furthermore a silly back-and-forth
switch is likely to happen at the boundary between prolog and main shader
parts.

Fixing this is a bit involved: we'd first have to add a mechanism by which
LLVM writes the WQM-related input requirements to the main shader part binary,
and then Mesa specializes the prolog part accordingly. At that point, we may
as well just compile a monolithic shader...

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=95130

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle updated this revision to Diff 59138.May 31 2016, 3:16 PM

nhaehnle retitled this revision from to AMDGPU: Add amdgpu-ps-wqm-outputs function attributes.

nhaehnle updated this object.

nhaehnle added reviewers: arsenm, • tstellarAMD, mareko.

nhaehnle added a subscriber: llvm-commits.

Herald added subscribers: kzhuravl, arsenm. · View Herald TranscriptMay 31 2016, 3:16 PM

I don't understand the comment.

If a pixel shader doesn't use DDX/DDY and sample/gather, WQM can be disabled in the prolog. tgsi_scan can get that information.

If a pixel shader uses those, WQM can be enabled at the beginning of the prolog and disabled at the end, so that the main part starts with the original EXEC mask. This adds 2 unnecessary SALU (s_mov_b64, s_wqm_b64) at the prolog->main boundary, but I don't think that matters much. Prologs are usually empty with the GL core profile anyway.

arsenm added inline comments.May 31 2016, 6:19 PM

lib/Target/AMDGPU/SIWholeQuadMode.cpp
172 ↗	(On Diff #59138)	Looking for specific copies is usually concerning

Instead of looking for copies, look for defs of physical VGPRs. Since this
is before register allocation, the only physical VGPRs should be shader
inputs and outputs, and we never overwrite the inputs, so that should do the
right thing.

Marek, the point of the comment is that even when the main part uses
derivatives, it's probably not very common to use derivatives of gl_Color at
least, and in that case the prolog still wouldn't need WQM (per-sample
interpolants are different).

But as you say, PS prologs aren't very common, the additional number of
instructions is small, and memory accesses are a non-issue - that's why I
didn't bother to do anything more involved.

I think we also want to run in WQM when changing the interpolation weights in the prolog. For now, that's just for force_persample_interp, which is pretty rare, but when we start to use bc_optimize, we'll run into WQM issues more often. Sadly I didn't realize this when I was working on the prolog. From mesa - si_shader.h:

/* TODO:

- add force_center_interp if MSAA is disabled and centroid or
sample are present
- add force_center_interp_bc_optimize to force center interpolation
based on the bc_optimize SGPR bit if MSAA is enabled, centroid is
present and sample isn't present.

*/

Since the decision when to use WQM is up to Mesa, are there any other comments on this LLVM patch?

In D20839#446982, @nhaehnle wrote:

Since the decision when to use WQM is up to Mesa, are there any other comments on this LLVM patch?

Not from me.

Closed by commit rL272063: AMDGPU: Add amdgpu-ps-wqm-outputs function attributes (authored by nha). · Explain WhyJun 7 2016, 2:44 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIWholeQuadMode.cpp

29 lines

test/

CodeGen/

AMDGPU/

wqm.ll

14 lines

Diff 59955

llvm/trunk/lib/Target/AMDGPU/SIWholeQuadMode.cpp

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	FunctionPass *llvm::createSIWholeQuadModePass() {
return new SIWholeQuadMode;		return new SIWholeQuadMode;
}		}

// Scan instructions to determine which ones require an Exact execmask and		// Scan instructions to determine which ones require an Exact execmask and
// which ones seed WQM requirements.		// which ones seed WQM requirements.
char SIWholeQuadMode::scanInstructions(MachineFunction &MF,		char SIWholeQuadMode::scanInstructions(MachineFunction &MF,
std::vector<WorkItem> &Worklist) {		std::vector<WorkItem> &Worklist) {
char GlobalFlags = 0;		char GlobalFlags = 0;
		bool WQMOutputs = MF.getFunction()->hasFnAttribute("amdgpu-ps-wqm-outputs");

for (auto BI = MF.begin(), BE = MF.end(); BI != BE; ++BI) {		for (auto BI = MF.begin(), BE = MF.end(); BI != BE; ++BI) {
MachineBasicBlock &MBB = *BI;		MachineBasicBlock &MBB = *BI;

for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {		for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
MachineInstr &MI = *II;		MachineInstr &MI = *II;
unsigned Opcode = MI.getOpcode();		unsigned Opcode = MI.getOpcode();
char Flags;		char Flags = 0;

if (TII->isWQM(Opcode) \|\| TII->isDS(Opcode)) {		if (TII->isWQM(Opcode) \|\| TII->isDS(Opcode)) {
Flags = StateWQM;		Flags = StateWQM;
} else if (TII->get(Opcode).mayStore() &&		} else if (TII->get(Opcode).mayStore() &&
(MI.getDesc().TSFlags & SIInstrFlags::VM_CNT)) {		(MI.getDesc().TSFlags & SIInstrFlags::VM_CNT)) {
Flags = StateExact;		Flags = StateExact;
} else {		} else {
// Handle export instructions with the exec mask valid flag set		// Handle export instructions with the exec mask valid flag set
if (Opcode == AMDGPU::EXP) {		if (Opcode == AMDGPU::EXP) {
if (MI.getOperand(4).getImm() != 0)		if (MI.getOperand(4).getImm() != 0)
ExecExports.push_back(&MI);		ExecExports.push_back(&MI);
} else if (Opcode == AMDGPU::SI_PS_LIVE) {		} else if (Opcode == AMDGPU::SI_PS_LIVE) {
LiveMaskQueries.push_back(&MI);		LiveMaskQueries.push_back(&MI);
		} else if (WQMOutputs) {
		// The function is in machine SSA form, which means that physical
		// VGPRs correspond to shader inputs and outputs. Inputs are
		// only used, outputs are only defined.
		for (const MachineOperand &MO : MI.defs()) {
		if (!MO.isReg())
		continue;

		unsigned Reg = MO.getReg();

		if (!TRI->isVirtualRegister(Reg) &&
		TRI->hasVGPRs(TRI->getPhysRegClass(Reg))) {
		Flags = StateWQM;
		break;
		}
		}
}		}

		if (!Flags)
continue;		continue;
}		}

Instructions[&MI].Needs = Flags;		Instructions[&MI].Needs = Flags;
Worklist.push_back(&MI);		Worklist.push_back(&MI);
GlobalFlags \|= Flags;		GlobalFlags \|= Flags;
}		}

		if (WQMOutputs && MBB.succ_empty()) {
		// This is a prolog shader. Make sure we go back to exact mode at the end.
		Blocks[&MBB].OutNeeds = StateExact;
		Worklist.push_back(&MBB);
		GlobalFlags \|= StateExact;
		}
}		}

return GlobalFlags;		return GlobalFlags;
}		}

void SIWholeQuadMode::propagateInstruction(const MachineInstr &MI,		void SIWholeQuadMode::propagateInstruction(const MachineInstr &MI,
std::vector<WorkItem>& Worklist) {		std::vector<WorkItem>& Worklist) {
const MachineBasicBlock &MBB = *MI.getParent();		const MachineBasicBlock &MBB = *MI.getParent();
▲ Show 20 Lines • Show All 300 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/wqm.ll

Show First 20 Lines • Show All 326 Lines • ▼ Show 20 Lines	main_body:
%gep = getelementptr float, float addrspace(1)* %ptr, i32 %idx		%gep = getelementptr float, float addrspace(1)* %ptr, i32 %idx
store float %data, float addrspace(1)* %gep		store float %data, float addrspace(1)* %gep

call void @llvm.AMDGPU.kill(float %z)		call void @llvm.AMDGPU.kill(float %z)

ret <4 x float> %tex		ret <4 x float> %tex
}		}

		; Check prolog shaders.
		;
		; CHECK-LABEL: {{^}}test_prolog_1:
		; CHECK: s_mov_b64 [[ORIG:s\[[0-9]+:[0-9]+\]]], exec
		; CHECK: s_wqm_b64 exec, exec
		; CHECK: v_add_f32_e32 v0,
		; CHECK: s_and_b64 exec, exec, [[ORIG]]
		define amdgpu_ps float @test_prolog_1(float %a, float %b) #4 {
		main_body:
		%s = fadd float %a, %b
		ret float %s
		}

declare void @llvm.amdgcn.image.store.v4i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1		declare void @llvm.amdgcn.image.store.v4i32(<4 x float>, <4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #1

declare <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #2		declare <4 x float> @llvm.amdgcn.image.load.v4i32(<4 x i32>, <8 x i32>, i32, i1, i1, i1, i1) #2

declare <4 x float> @llvm.SI.image.sample.i32(i32, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3		declare <4 x float> @llvm.SI.image.sample.i32(i32, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3
declare <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32>, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3		declare <4 x float> @llvm.SI.image.sample.v4i32(<4 x i32>, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) #3

declare void @llvm.AMDGPU.kill(float)		declare void @llvm.AMDGPU.kill(float)
declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)		declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)

attributes #1 = { nounwind }		attributes #1 = { nounwind }
attributes #2 = { nounwind readonly }		attributes #2 = { nounwind readonly }
attributes #3 = { nounwind readnone }		attributes #3 = { nounwind readnone }
		attributes #4 = { "amdgpu-ps-wqm-outputs" }