This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Relax restriction on folding immediates into physregs
ClosedPublic

Authored by arsenm on Jul 20 2020, 5:28 AM.

Download Raw Diff

Details

Reviewers

rampitec
foad
kerbowa
nhaehnle

Summary

I never completed the work on the patches referenced by
f8bf7d7f42f28fa18144091022236208e199f331, but this was intended to
avoid folding immediate writes into m0 which the coalescer doesn't
understand very well. Relax this to allow simple SGPR immediates to
fold directly into VGPR copies. This pattern shows up routinely in
current GlobalISel code since nothing is smart enough to emit VGPR
constants yet.

Diff Detail

Event Timeline

arsenm created this revision.Jul 20 2020, 5:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2020, 5:28 AM

Herald added subscribers: hiraditya, t-tye, tpr and 5 others. · View Herald Transcript

I do not think this can work with physregs given the CFG.

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
654	Source must be strictly scalar. You can have a copy of vector to vector (say a to v) and it can be initialized with a different mask. In fact, how can we be sure even an SGPR was not initialized in a different CFG path? This shall never happen to a virtual register in SSA, but can easily happen to a physreg.
658	This does fit into a single line.

arsenm added inline comments.Jul 28 2020, 4:16 PM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
654	This is folding a constant. The mask doesn't matter. This is also not analyzing physreg defs, it's searching the uses of an SSA virtual register.

rampitec added inline comments.Jul 28 2020, 4:25 PM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
654	This is folding a constant. The mask doesn't matter. This is also not analyzing physreg defs, it's searching the uses of an SSA virtual register. Ah, right! It is destination physical, not source.

It can probably create a worse code sometimes. Assume you have a constant which cannot be directly assigned to a register (v_accvgpr_write_b32 only accepts inline literals). If source def will not become dead it will create an extra copy. You probably need some legality checks here.

In D84153#2180517, @rampitec wrote:

It can probably create a worse code sometimes. Assume you have a constant which cannot be directly assigned to a register (v_accvgpr_write_b32 only accepts inline literals). If source def will not become dead it will create an extra copy. You probably need some legality checks here.

But that's really orthogonal to what you are doing and is a separate change. It equally related to virtual registers.

LGTM, just fix formatting at line 657.

This revision is now accepted and ready to land.Jul 28 2020, 4:36 PM

766cb615a3b96025192707f4670cdf171da84034

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIFoldOperands.cpp

48 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

llvm.amdgcn.kernarg.segment.ptr.ll

6 lines

zextload.ll

27 lines

fold-imm-copy.mir

28 lines

Diff 279194

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

Show First 20 Lines • Show All 637 Lines • ▼ Show 20 Lines	if (frameIndexMayFold(TII, *UseMI, UseOpIdx, OpToFold)) {
return;		return;
}		}

bool FoldingImmLike =		bool FoldingImmLike =
OpToFold.isImm() \|\| OpToFold.isFI() \|\| OpToFold.isGlobal();		OpToFold.isImm() \|\| OpToFold.isFI() \|\| OpToFold.isGlobal();

if (FoldingImmLike && UseMI->isCopy()) {		if (FoldingImmLike && UseMI->isCopy()) {
Register DestReg = UseMI->getOperand(0).getReg();		Register DestReg = UseMI->getOperand(0).getReg();
		Register SrcReg = UseMI->getOperand(1).getReg();
		assert(SrcReg.isVirtual());

// Don't fold into a copy to a physical register. Doing so would interfere		const TargetRegisterClass *SrcRC = MRI->getRegClass(SrcReg);
// with the register coalescer's logic which would avoid redundant
// initalizations.
if (DestReg.isPhysical())
return;

const TargetRegisterClass *DestRC = MRI->getRegClass(DestReg);		// Don't fold into a copy to a physical register with the same class. Doing
		// so would interfere with the register coalescer's logic which would avoid
		// redundant initalizations.
		if (DestReg.isPhysical() && SrcRC->contains(DestReg))
		rampitecUnsubmitted Not Done Reply Inline Actions Source must be strictly scalar. You can have a copy of vector to vector (say a to v) and it can be initialized with a different mask. In fact, how can we be sure even an SGPR was not initialized in a different CFG path? This shall never happen to a virtual register in SSA, but can easily happen to a physreg. rampitec: Source must be strictly scalar. You can have a copy of vector to vector (say a to v) and it can…
		arsenmAuthorUnsubmitted Done Reply Inline Actions This is folding a constant. The mask doesn't matter. This is also not analyzing physreg defs, it's searching the uses of an SSA virtual register. arsenm: This is folding a constant. The mask doesn't matter. This is also not analyzing physreg defs…
		rampitecUnsubmitted Not Done Reply Inline Actions This is folding a constant. The mask doesn't matter. This is also not analyzing physreg defs, it's searching the uses of an SSA virtual register. Ah, right! It is destination physical, not source. rampitec: > This is folding a constant. The mask doesn't matter. This is also not analyzing physreg defs…
		return;

Register SrcReg = UseMI->getOperand(1).getReg();		const TargetRegisterClass *DestRC =
if (SrcReg.isVirtual()) { // XXX - This can be an assert?		TRI->getRegClassForReg(*MRI, DestReg);
		rampitecUnsubmitted Not Done Reply Inline Actions This does fit into a single line. rampitec: This does fit into a single line.
const TargetRegisterClass * SrcRC = MRI->getRegClass(SrcReg);
if (TRI->isSGPRClass(SrcRC) && TRI->hasVectorRegisters(DestRC)) {		if (TRI->isSGPRClass(SrcRC) && TRI->hasVectorRegisters(DestRC)) {
MachineRegisterInfo::use_iterator NextUse;		MachineRegisterInfo::use_iterator NextUse;
SmallVector<FoldCandidate, 4> CopyUses;		SmallVector<FoldCandidate, 4> CopyUses;
for (MachineRegisterInfo::use_iterator		for (MachineRegisterInfo::use_iterator
Use = MRI->use_begin(DestReg), E = MRI->use_end();		Use = MRI->use_begin(DestReg), E = MRI->use_end();
Use != E; Use = NextUse) {		Use != E; Use = NextUse) {
NextUse = std::next(Use);		NextUse = std::next(Use);
FoldCandidate FC = FoldCandidate(Use->getParent(),		FoldCandidate FC = FoldCandidate(Use->getParent(),
Use.getOperandNo(), &UseMI->getOperand(1));		Use.getOperandNo(), &UseMI->getOperand(1));
CopyUses.push_back(FC);		CopyUses.push_back(FC);
}		}
for (auto & F : CopyUses) {		for (auto & F : CopyUses) {
foldOperand(*F.OpToFold, F.UseMI, F.UseOpNo,		foldOperand(*F.OpToFold, F.UseMI, F.UseOpNo,
FoldList, CopiesToReplace);		FoldList, CopiesToReplace);
}		}
}		}
}

if (DestRC == &AMDGPU::AGPR_32RegClass &&		if (DestRC == &AMDGPU::AGPR_32RegClass &&
TII->isInlineConstant(OpToFold, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {		TII->isInlineConstant(OpToFold, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
UseMI->setDesc(TII->get(AMDGPU::V_ACCVGPR_WRITE_B32));		UseMI->setDesc(TII->get(AMDGPU::V_ACCVGPR_WRITE_B32));
UseMI->getOperand(1).ChangeToImmediate(OpToFold.getImm());		UseMI->getOperand(1).ChangeToImmediate(OpToFold.getImm());
CopiesToReplace.push_back(UseMI);		CopiesToReplace.push_back(UseMI);
return;		return;
}		}
▲ Show 20 Lines • Show All 895 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.kernarg.segment.ptr.ll

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @opencl_test_implicit_alignment_no_explicit_kernargs_round_up() #3 {
%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()		%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()
%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*		%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*
%val = load volatile i32, i32 addrspace(4)* %arg.ptr		%val = load volatile i32, i32 addrspace(4)* %arg.ptr
store volatile i32 %val, i32 addrspace(1)* null		store volatile i32 %val, i32 addrspace(1)* null
ret void		ret void
}		}

; ALL-LABEL: {{^}}func_kernarg_segment_ptr:		; ALL-LABEL: {{^}}func_kernarg_segment_ptr:
; ALL: s_mov_b32 [[S_LO:s[0-9]+]], 0{{$}}		; ALL: v_mov_b32_e32 v0, 0{{$}}
; ALL: s_mov_b32 [[S_HI:s[0-9]+]], 0{{$}}		; ALL: v_mov_b32_e32 v1, 0{{$}}
; ALL: v_mov_b32_e32 v0, [[S_LO]]{{$}}
; ALL: v_mov_b32_e32 v1, [[S_HI]]{{$}}
define i8 addrspace(4)* @func_kernarg_segment_ptr() {		define i8 addrspace(4)* @func_kernarg_segment_ptr() {
%ptr = call i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()		%ptr = call i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()
ret i8 addrspace(4)* %ptr		ret i8 addrspace(4)* %ptr
}		}

declare i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() #0		declare i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() #0
declare i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #0		declare i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #0

attributes #0 = { nounwind readnone }		attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }		attributes #1 = { nounwind }
attributes #2 = { nounwind "amdgpu-implicitarg-num-bytes"="48" }		attributes #2 = { nounwind "amdgpu-implicitarg-num-bytes"="48" }
attributes #3 = { nounwind "amdgpu-implicitarg-num-bytes"="38" }		attributes #3 = { nounwind "amdgpu-implicitarg-num-bytes"="38" }

llvm/test/CodeGen/AMDGPU/GlobalISel/zextload.ll

Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines	; GFX6-NEXT: s_setpc_b64 s[30:31]
ret i64 %ext		ret i64 %ext
}		}

define i96 @zextload_global_i32_to_i96(i32 addrspace(1)* %ptr) {		define i96 @zextload_global_i32_to_i96(i32 addrspace(1)* %ptr) {
; GFX9-LABEL: zextload_global_i32_to_i96:		; GFX9-LABEL: zextload_global_i32_to_i96:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_dword v0, v[0:1], off		; GFX9-NEXT: global_load_dword v0, v[0:1], off
; GFX9-NEXT: s_mov_b32 s4, 0
; GFX9-NEXT: v_mov_b32_e32 v1, 0		; GFX9-NEXT: v_mov_b32_e32 v1, 0
; GFX9-NEXT: v_mov_b32_e32 v2, s4		; GFX9-NEXT: v_mov_b32_e32 v2, 0
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX8-LABEL: zextload_global_i32_to_i96:		; GFX8-LABEL: zextload_global_i32_to_i96:
; GFX8: ; %bb.0:		; GFX8: ; %bb.0:
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: flat_load_dword v0, v[0:1]		; GFX8-NEXT: flat_load_dword v0, v[0:1]
; GFX8-NEXT: s_mov_b32 s4, 0
; GFX8-NEXT: v_mov_b32_e32 v1, 0		; GFX8-NEXT: v_mov_b32_e32 v1, 0
; GFX8-NEXT: v_mov_b32_e32 v2, s4		; GFX8-NEXT: v_mov_b32_e32 v2, 0
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX8-NEXT: s_setpc_b64 s[30:31]		; GFX8-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX6-LABEL: zextload_global_i32_to_i96:		; GFX6-LABEL: zextload_global_i32_to_i96:
; GFX6: ; %bb.0:		; GFX6: ; %bb.0:
; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX6-NEXT: s_mov_b32 s6, 0		; GFX6-NEXT: s_mov_b32 s6, 0
; GFX6-NEXT: s_mov_b32 s7, 0xf000		; GFX6-NEXT: s_mov_b32 s7, 0xf000
; GFX6-NEXT: s_mov_b64 s[4:5], 0		; GFX6-NEXT: s_mov_b64 s[4:5], 0
; GFX6-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64		; GFX6-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64
; GFX6-NEXT: s_mov_b32 s4, 0
; GFX6-NEXT: v_mov_b32_e32 v1, 0		; GFX6-NEXT: v_mov_b32_e32 v1, 0
; GFX6-NEXT: v_mov_b32_e32 v2, s4		; GFX6-NEXT: v_mov_b32_e32 v2, 0
; GFX6-NEXT: s_waitcnt vmcnt(0)		; GFX6-NEXT: s_waitcnt vmcnt(0)
; GFX6-NEXT: s_setpc_b64 s[30:31]		; GFX6-NEXT: s_setpc_b64 s[30:31]
%load = load i32, i32 addrspace(1)* %ptr		%load = load i32, i32 addrspace(1)* %ptr
%ext = zext i32 %load to i96		%ext = zext i32 %load to i96
ret i96 %ext		ret i96 %ext
}		}

define i128 @zextload_global_i32_to_i128(i32 addrspace(1)* %ptr) {		define i128 @zextload_global_i32_to_i128(i32 addrspace(1)* %ptr) {
; GFX9-LABEL: zextload_global_i32_to_i128:		; GFX9-LABEL: zextload_global_i32_to_i128:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_dword v0, v[0:1], off		; GFX9-NEXT: global_load_dword v0, v[0:1], off
; GFX9-NEXT: s_mov_b32 s4, 0
; GFX9-NEXT: s_mov_b32 s5, 0
; GFX9-NEXT: v_mov_b32_e32 v1, 0		; GFX9-NEXT: v_mov_b32_e32 v1, 0
; GFX9-NEXT: v_mov_b32_e32 v2, s4		; GFX9-NEXT: v_mov_b32_e32 v2, 0
; GFX9-NEXT: v_mov_b32_e32 v3, s5		; GFX9-NEXT: v_mov_b32_e32 v3, 0
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX8-LABEL: zextload_global_i32_to_i128:		; GFX8-LABEL: zextload_global_i32_to_i128:
; GFX8: ; %bb.0:		; GFX8: ; %bb.0:
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: flat_load_dword v0, v[0:1]		; GFX8-NEXT: flat_load_dword v0, v[0:1]
; GFX8-NEXT: s_mov_b32 s4, 0
; GFX8-NEXT: s_mov_b32 s5, 0
; GFX8-NEXT: v_mov_b32_e32 v1, 0		; GFX8-NEXT: v_mov_b32_e32 v1, 0
; GFX8-NEXT: v_mov_b32_e32 v2, s4		; GFX8-NEXT: v_mov_b32_e32 v2, 0
; GFX8-NEXT: v_mov_b32_e32 v3, s5		; GFX8-NEXT: v_mov_b32_e32 v3, 0
; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX8-NEXT: s_setpc_b64 s[30:31]		; GFX8-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX6-LABEL: zextload_global_i32_to_i128:		; GFX6-LABEL: zextload_global_i32_to_i128:
; GFX6: ; %bb.0:		; GFX6: ; %bb.0:
; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX6-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX6-NEXT: s_mov_b32 s6, 0		; GFX6-NEXT: s_mov_b32 s6, 0
; GFX6-NEXT: s_mov_b32 s7, 0xf000		; GFX6-NEXT: s_mov_b32 s7, 0xf000
; GFX6-NEXT: s_mov_b64 s[4:5], 0		; GFX6-NEXT: s_mov_b64 s[4:5], 0
; GFX6-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64		; GFX6-NEXT: buffer_load_dword v0, v[0:1], s[4:7], 0 addr64
; GFX6-NEXT: s_mov_b32 s4, 0
; GFX6-NEXT: s_mov_b32 s5, 0
; GFX6-NEXT: v_mov_b32_e32 v1, 0		; GFX6-NEXT: v_mov_b32_e32 v1, 0
; GFX6-NEXT: v_mov_b32_e32 v2, s4		; GFX6-NEXT: v_mov_b32_e32 v2, 0
; GFX6-NEXT: v_mov_b32_e32 v3, s5		; GFX6-NEXT: v_mov_b32_e32 v3, 0
; GFX6-NEXT: s_waitcnt vmcnt(0)		; GFX6-NEXT: s_waitcnt vmcnt(0)
; GFX6-NEXT: s_setpc_b64 s[30:31]		; GFX6-NEXT: s_setpc_b64 s[30:31]
%load = load i32, i32 addrspace(1)* %ptr		%load = load i32, i32 addrspace(1)* %ptr
%ext = zext i32 %load to i128		%ext = zext i32 %load to i128
ret i128 %ext		ret i128 %ext
}		}

llvm/test/CodeGen/AMDGPU/fold-imm-copy.mir

	Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	body: \|			body: \|
	bb.0:			bb.0:
	%0:sreg_64 = S_MOV_B64 -8			%0:sreg_64 = S_MOV_B64 -8
	%1:sgpr_32 = COPY %0.sub0			%1:sgpr_32 = COPY %0.sub0
	%2:sgpr_32 = COPY %0.sub1			%2:sgpr_32 = COPY %0.sub1
	S_ENDPGM 0, implicit %1, implicit %2			S_ENDPGM 0, implicit %1, implicit %2

	...			...

				# GCN-LABEL: name: no_fold_imm_into_m0{{$}}
				# GCN: %0:sreg_32 = S_MOV_B32 -8
				# GCN-NEXT: $m0 = COPY %0

				---
				name: no_fold_imm_into_m0
				tracksRegLiveness: true
				body: \|
				bb.0:
				%0:sreg_32 = S_MOV_B32 -8
				$m0 = COPY %0
				S_ENDPGM 0, implicit $m0

				...

				# GCN-LABEL: name: fold_sgpr_imm_to_vgpr_copy{{$}}
				# GCN: $vgpr0 = V_MOV_B32_e32 -8, implicit $exec
				---
				name: fold_sgpr_imm_to_vgpr_copy
				tracksRegLiveness: true
				body: \|
				bb.0:
				%0:sreg_32 = S_MOV_B32 -8
				$vgpr0 = COPY %0
				S_ENDPGM 0, implicit $vgpr0

				...