This is an archive of the discontinued LLVM Phabricator instance.

In D55314#1321167, @hakzsam wrote:

Hi,

This change breaks most of the subgroups tests with RADV (ie. dEQP-VK.subgroups.arithmetic.*).

Any reasons why you enabled it by default? Looks like it now triggers a new bug in the AMDGPU backend.

Thanks!

Hi,

this is awaited feature and other reason is to detect a situation such yours. I think it's better to turn it off and reproduce the failures: how can I do this?

In D55314#1321222, @vpykhtin wrote:

In D55314#1321167, @hakzsam wrote:

Hi,

This change breaks most of the subgroups tests with RADV (ie. dEQP-VK.subgroups.arithmetic.*).

Any reasons why you enabled it by default? Looks like it now triggers a new bug in the AMDGPU backend.

Thanks!

Hi,

this is awaited feature and other reason is to detect a situation such yours. I think it's better to turn it off and reproduce the failures: how can I do this?

You'll need to build Mesa with radv enabled against your LLVM, then run the Vulkan CTS against it, specifically dEQP-VK.subgroups.arithmetic.* . AMDVLK might also work, since they'll emit similar code. I would suggest using the new atomic optimizer code in LLVM, but it seems it's already broken. It uses llvm.amdgcn.mov.dpp instead of llvm.amdgcn.update.dpp which means the lanes that aren't written are undefined, so if you get unlucky in register allocation it won't work, regardless of whether your pass runs or not.

I looked at your pass, and there are a few problems with it. I think we should revert this commit until the pass has been properly rewritten.

It doesn't check for the case where the DPP mov and use are in different basic blocks with potentially different EXEC. This should be easy to fix.
It doesn't handle the non-full row_mask or bank_mask case. If either one doesn't have the default value of 0xffff, then some lanes are not written, regardless of what bound_ctrl is set to. So if there's such a move that feeds into, say, an addition, then the old value needs to be added for the disabled lanes, which blindly copying over the row_mask and bank_mask and setting old to undef won't do.
This is fundamentally not doing what frontends implementing reductions actually want. In practice, when you're using DPP to implement a wavefront reduction correctly, every DPP operation looks like this:

%tmp = V_MOV_B32_dpp identity, %a, ... ; from llvm.amdgcn.update.dpp
%out = op %tmp, %b

where "identity" is the identity for the operation (0 for add, -1 for unsigned max, etc.), which is the old value for the move. This can be optimized to

%out = op_dpp %b, %a, %b, ...

regardless of what the DPP flags are, as long as bound_ctrl:0 is not set. This is what we should actually be optimizing.

Ok, I'll disable it. I'm not sure about 3rd point: are you sayng the pass doesn't actually perform the optimization or it's fundamentally wrong? Because it implemented to handle "identity" cases for add, mul and min/max.

In D55314#1321326, @vpykhtin wrote:

Ok, I'll disable it. I'm not sure about 3rd point: are you sayng the pass doesn't actually perform the optimization or it's fundamentally wrong? Because it implemented to handle "identity" cases for add, mul and min/max.

Ah right, you are handling it in foldOldOpnd. But it seems you're missing and, or, and add. We can't rely on bound_ctrl:0, even if OldOpndValue is zero, so we should always try to fold the old operand first if possible.

Ok, thank you very much for the review and explanation, I'll try to address it shortly.

foad mentioned this in D124182: [AMDGPU] Combine DPP mov even if old reg def is in different BB.Apr 21 2022, 11:11 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUTargetMachine.cpp

2 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.update.dpp.ll

4 lines

Diff 176824

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

	Show First 20 Lines • Show All 103 Lines • ▼ Show 20 Lines
	static cl::opt<bool> EnableSDWAPeephole(			static cl::opt<bool> EnableSDWAPeephole(
	"amdgpu-sdwa-peephole",			"amdgpu-sdwa-peephole",
	cl::desc("Enable SDWA peepholer"),			cl::desc("Enable SDWA peepholer"),
	cl::init(true));			cl::init(true));

	static cl::opt<bool> EnableDPPCombine(			static cl::opt<bool> EnableDPPCombine(
	"amdgpu-dpp-combine",			"amdgpu-dpp-combine",
	cl::desc("Enable DPP combiner"),			cl::desc("Enable DPP combiner"),
	cl::init(false));			cl::init(true));

	// Enable address space based alias analysis			// Enable address space based alias analysis
	static cl::opt<bool> EnableAMDGPUAliasAnalysis("enable-amdgpu-aa", cl::Hidden,			static cl::opt<bool> EnableAMDGPUAliasAnalysis("enable-amdgpu-aa", cl::Hidden,
	cl::desc("Enable AMDGPU Alias Analysis"),			cl::desc("Enable AMDGPU Alias Analysis"),
	cl::init(true));			cl::init(true));

	// Option to run late CFG structurizer			// Option to run late CFG structurizer
	static cl::opt<bool, true> LateCFGStructurize(			static cl::opt<bool, true> LateCFGStructurize(
	▲ Show 20 Lines • Show All 804 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll

	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-OPT %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-dpp-combine=false -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-OPT %s
	; RUN: llc -O0 -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-NOOPT %s			; RUN: llc -O0 -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-dpp-combine=false -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-NOOPT %s

	; VI-LABEL: {{^}}dpp_test:			; VI-LABEL: {{^}}dpp_test:
	; VI: v_mov_b32_e32 v0, s{{[0-9]+}}			; VI: v_mov_b32_e32 v0, s{{[0-9]+}}
	; VI: v_mov_b32_e32 v1, s{{[0-9]+}}			; VI: v_mov_b32_e32 v1, s{{[0-9]+}}
	; VI-OPT: s_nop 1			; VI-OPT: s_nop 1
	; VI-NOOPT: s_nop 0			; VI-NOOPT: s_nop 0
	; VI-NOOPT: s_nop 0			; VI-NOOPT: s_nop 0
	; VI: v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0 ; encoding: [0xfa,0x02,0x00,0x7e,0x01,0x01,0x08,0x11]			; VI: v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0 ; encoding: [0xfa,0x02,0x00,0x7e,0x01,0x01,0x08,0x11]
	Show All 35 Lines

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Turn on the DPP combiner by defaultClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 176824

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll

AMDGPU: Turn on the DPP combiner by default
ClosedPublic