This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
2/3
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
CopyToReg-into-sgpr-into-MOV_B32.ll

Differential D139852

[amdgpu] Lower CopyToReg into SGPR explicitly to avoid illegal vgpr to sgpr copy
AbandonedPublic

Authored by JonChesterfield on Dec 12 2022, 9:06 AM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec
alex-t
ronlieb
gregrodgers
b-sumner

Summary

Works around regression in D131246 to unblock LDS lowering in D139433.

The bug there was ISel using a single i32 constant node as the argument to a
node that wants it in a vgpr and another one that wants it in a sgpr. If the
lowering puts it in a vgpr and SIFixSGPRCopies fails to handle it, as is
presently the case, then we get an error and a miscompile.

This is very much a point fix. If we need to copy into a sgpr, it's going to
need an instruction to do so (unless the sgpr happens to have the right value
in it already, which we could catch as a peephole somewhere in MIR if we wish)
so selecting the s_mov_b32 immediately doesn't cost anything. It only looks
for i32 values as I believe that is the type we use for sgpr copies.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield created this revision.Dec 12 2022, 9:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2022, 9:06 AM

Herald added subscribers: kosarev, foad, kerbowa and 6 others. · View Herald Transcript

JonChesterfield requested review of this revision.Dec 12 2022, 9:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2022, 9:06 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Codegen for that test case without this patch applied. Emits an error error: <unknown>:0:0: in function kern void (ptr): illegal SGPR to VGPR copy which is also recorded as a comment in the asm.

kern:                                   ; @kern
; %bb.0:
	s_mov_b32 s32, 0
	s_add_u32 s12, s12, s17
	s_addc_u32 s13, s13, 0
	s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s12
	s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s13
	s_add_u32 s0, s0, s17
	s_addc_u32 s1, s1, 0
	v_writelane_b32 v40, s16, 0
	s_mov_b32 s13, s15
	s_mov_b32 s12, s14
	v_readlane_b32 s14, v40, 0
	s_mov_b64 s[16:17], s[8:9]
	v_mov_b32_e32 v3, v2
	v_mov_b32_e32 v2, v1
	v_mov_b32_e32 v1, v0
	s_load_dwordx2 s[8:9], s[16:17], 0x0
	v_mov_b32_e32 v0, 42
	s_waitcnt lgkmcnt(0)
	v_mov_b32_e32 v4, s8
	v_mov_b32_e32 v5, s9
	flat_store_dword v[4:5], v0
	s_mov_b64 s[18:19], 8
	s_mov_b32 s8, s16
	s_mov_b32 s9, s17
	s_mov_b32 s16, s18
	s_mov_b32 s15, s19
	s_add_u32 s8, s8, s16
	s_addc_u32 s15, s9, s15
        ; kill: def $sgpr8 killed $sgpr8 def $sgpr8_sgpr9
	s_mov_b32 s9, s15
	s_getpc_b64 s[16:17]
	s_add_u32 s16, s16, unknown_call@gotpcrel32@lo+4
	s_addc_u32 s17, s17, unknown_call@gotpcrel32@hi+12
	s_load_dwordx2 s[16:17], s[16:17], 0x0
	s_mov_b64 s[22:23], s[2:3]
	s_mov_b64 s[20:21], s[0:1]
	s_mov_b32 s15, 20
	v_lshlrev_b32_e64 v3, s15, v3
	s_mov_b32 s15, 10
	v_lshlrev_b32_e64 v2, s15, v2
	v_or3_b32 v31, v1, v2, v3
	 ; illegal copy v0 to s15
	s_mov_b64 s[0:1], s[20:21]
	s_mov_b64 s[2:3], s[22:23]
	s_waitcnt lgkmcnt(0)
	s_swappc_b64 s[30:31], s[16:17]
	s_endpgm

arsenm added inline comments.Dec 12 2022, 9:22 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11916–11918	Can you teach isVGPRImm to deal with this

JonChesterfield added inline comments.Dec 12 2022, 9:43 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11916–11918	I don't think so. The failing case here is where a single constant node is needed in a vgpr and in a sgpr. isVGPRImm would need to return true for one case and false for the other, but it takes the node that was CSE'd together for both. isVGPRImm will currently return false on isSGPRClass. Guessing slightly, the idea behind isVGPRImm is to prefer to materialise in a vgpr directly? Seems reasonable if so, but interacts badly with having no general means of copying from vgpr to sgpr. I would guess the general case involves a patch to SIFixSGPRCopies.

JonChesterfield marked an inline comment as done.Dec 12 2022, 9:56 AM

Harbormaster completed remote builds in B202602: Diff 482155.Dec 12 2022, 10:08 AM

arsenm added inline comments.Dec 12 2022, 10:15 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11916–11918	Correct. False is always a correct answer. Copy and rematerialize in VGPR later is easy

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

In D139852#3989251, @alex-t wrote:

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

What ensures this copy is uniform?

Even if it is uniform, is there a reason to expect readfirstlane to be better than a mov immediate?

In D139852#3989299, @JonChesterfield wrote:

In D139852#3989251, @alex-t wrote:

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

What ensures this copy is uniform?

The divergence-driven ISel does.
The call argument would be VGPR if it is divergent.

Even if it is uniform, is there a reason to expect readfirstlane to be better than a mov immediate?

Nevertheless, I would agree with you, that we only need to allow VGPR to physical SGPR copy if the VGPR is defined by the constant copy.
For now, it is the easiest way to fix the issue which blocks you. The common case (if it is okay to "readfirstlane" the general VGPR to physical SGPR ) needs further discussion.

In D139852#3989451, @alex-t wrote:

In D139852#3989299, @JonChesterfield wrote:

In D139852#3989251, @alex-t wrote:

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

What ensures this copy is uniform?

The divergence-driven ISel does.
The call argument would be VGPR if it is divergent.

The call argument here isn't a vgpr - it's part of the calling convention for passing implicit parameters around, so it's explicitly a sgpr. We could have defined it to be a vgpr containing a uniform value instead, but as far as I can tell that would be strictly worse.

What we have at present is a constant node gets lowered to an instruction that leaves the result in a vgpr, and then miscompilation. We could indeed lower the constant to a vgpr and then notice that we've done that and insert a readfirstlane to fix it up, instead of erroring, that mostly means updating something downstream of ISel to know that a given vector register is uniform based on the instruction it came from.

I'm not totally convinced by our compiler emitting invalid code and then trying to fix it up later. This is a case where we can emit something which is correct up front, without divergence analysis or regressions. It's also associated with function calls which are quite high complexity as the baseline.

I think the right fix here is

have globalisel model vector and scalar register files as separate things (if it doesn't already)
teach regalloc about vector and scalar register files as most of the fixup pass seems to be trying to undo things regalloc did wrong
delete sdag

Fixed in https://reviews.llvm.org/D139874

JonChesterfield mentioned this in D139874: [AMDGPU] Lower VGPR to physical SGPR COPY to S_MOV_B32 if VGPR contains the compile time constant.Dec 12 2022, 12:50 PM

Patch in D139874 fixes up the invalid copy downstream but is likely to become more complicated than this in the process. One approach would be to land this and then, once D139874 is completed and committed, delete this patch to ISel and update the test if necessary.

Alex' patch landed, abandoning this one

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

36 lines

test/

CodeGen/

AMDGPU/

CopyToReg-into-sgpr-into-MOV_B32.ll

62 lines

Diff 482155

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,871 Lines • ▼ Show 20 Lines

static bool isFrameIndexOp(SDValue Op) {		static bool isFrameIndexOp(SDValue Op) {
if (Op.getOpcode() == ISD::AssertZext)		if (Op.getOpcode() == ISD::AssertZext)
Op = Op.getOperand(0);		Op = Op.getOperand(0);

return isa<FrameIndexSDNode>(Op);		return isa<FrameIndexSDNode>(Op);
}		}

		static SDValue buildSMovImm32(SelectionDAG &DAG, const SDLoc &DL,
		uint64_t Val) {
		SDValue K = DAG.getTargetConstant(Val, DL, MVT::i32);
		return SDValue(DAG.getMachineNode(AMDGPU::S_MOV_B32, DL, MVT::i32, K), 0);
		}

/// Legalize target independent instructions (e.g. INSERT_SUBREG)		/// Legalize target independent instructions (e.g. INSERT_SUBREG)
/// with frame index operands.		/// with frame index operands.
/// LLVM assumes that inputs are to these instructions are registers.		/// LLVM assumes that inputs are to these instructions are registers.
SDNode SITargetLowering::legalizeTargetIndependentNode(SDNode Node,		SDNode SITargetLowering::legalizeTargetIndependentNode(SDNode Node,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
if (Node->getOpcode() == ISD::CopyToReg) {		if (Node->getOpcode() == ISD::CopyToReg) {
RegisterSDNode *DestReg = cast<RegisterSDNode>(Node->getOperand(1));		RegisterSDNode *DestReg = cast<RegisterSDNode>(Node->getOperand(1));
SDValue SrcVal = Node->getOperand(2);		SDValue SrcVal = Node->getOperand(2);
Show All 12 Lines	if (SrcVal.getValueType() == MVT::i1 && DestReg->getReg().isPhysical()) {
SDValue(Glued, Glued ? Glued->getNumValues() - 1 : 0));		SDValue(Glued, Glued ? Glued->getNumValues() - 1 : 0));
SDValue ToResultReg		SDValue ToResultReg
= DAG.getCopyToReg(ToVReg, SL, SDValue(DestReg, 0),		= DAG.getCopyToReg(ToVReg, SL, SDValue(DestReg, 0),
VReg, ToVReg.getValue(1));		VReg, ToVReg.getValue(1));
DAG.ReplaceAllUsesWith(Node, ToResultReg.getNode());		DAG.ReplaceAllUsesWith(Node, ToResultReg.getNode());
DAG.RemoveDeadNode(Node);		DAG.RemoveDeadNode(Node);
return ToResultReg.getNode();		return ToResultReg.getNode();
}		}

		if (SrcVal.getValueType() == MVT::i32 && DestReg->getReg().isPhysical()) {
		// CopyToReg may be writing a constant to a sgpr as part of a calling
		// convention. If that constant is selected to a vgpr then we later need
		// to copy it into a sgpr. Instead, special case the copying-to-sgpr here to
		arsenmUnsubmitted Done Reply Inline Actions Can you teach isVGPRImm to deal with this arsenm: Can you teach isVGPRImm to deal with this
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions I don't think so. The failing case here is where a single constant node is needed in a vgpr and in a sgpr. isVGPRImm would need to return true for one case and false for the other, but it takes the node that was CSE'd together for both. isVGPRImm will currently return false on isSGPRClass. Guessing slightly, the idea behind isVGPRImm is to prefer to materialise in a vgpr directly? Seems reasonable if so, but interacts badly with having no general means of copying from vgpr to sgpr. I would guess the general case involves a patch to SIFixSGPRCopies. JonChesterfield: I don't think so. The failing case here is where a single constant node is needed in a vgpr…
		arsenmUnsubmitted Not Done Reply Inline Actions Correct. False is always a correct answer. Copy and rematerialize in VGPR later is easy arsenm: Correct. False is always a correct answer. Copy and rematerialize in VGPR later is easy
		// force the instantiation into a sgpr independent of what lowering might
		// happen to other uses of that constant node.
		if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(SrcVal)) {
		MachineRegisterInfo &MRI = DAG.getMachineFunction().getRegInfo();
		const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();

		if (TRI->isSGPRReg(MRI, DestReg->getReg())) {
		uint64_t Value = C->getZExtValue();
		SDLoc DL(Node);
		// Fourth argument to CopyToReg (glue) can be missing
		SmallVector<SDValue, 4> Ops;
		for (unsigned I = 0; I < Node->getNumOperands(); I++) {
		Ops.push_back((I == 2) ? buildSMovImm32(DAG, DL, Value)
		: Node->getOperand(I));
		}
		return DAG.UpdateNodeOperands(Node, Ops);
		}
		}
		}
}		}

SmallVector<SDValue, 8> Ops;		SmallVector<SDValue, 8> Ops;
for (unsigned i = 0; i < Node->getNumOperands(); ++i) {		for (unsigned i = 0; i < Node->getNumOperands(); ++i) {
if (!isFrameIndexOp(Node->getOperand(i))) {		if (!isFrameIndexOp(Node->getOperand(i))) {
Ops.push_back(Node->getOperand(i));		Ops.push_back(Node->getOperand(i));
continue;		continue;
}		}
▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	void SITargetLowering::AdjustInstrPostInstrSelection(MachineInstr &MI,

if (TII->isMIMG(MI)) {		if (TII->isMIMG(MI)) {
if (!MI.mayStore())		if (!MI.mayStore())
AddIMGInit(MI);		AddIMGInit(MI);
TII->enforceOperandRCAlignment(MI, AMDGPU::OpName::vaddr);		TII->enforceOperandRCAlignment(MI, AMDGPU::OpName::vaddr);
}		}
}		}

static SDValue buildSMovImm32(SelectionDAG &DAG, const SDLoc &DL,
uint64_t Val) {
SDValue K = DAG.getTargetConstant(Val, DL, MVT::i32);
return SDValue(DAG.getMachineNode(AMDGPU::S_MOV_B32, DL, MVT::i32, K), 0);
}

MachineSDNode *SITargetLowering::wrapAddr64Rsrc(SelectionDAG &DAG,		MachineSDNode *SITargetLowering::wrapAddr64Rsrc(SelectionDAG &DAG,
const SDLoc &DL,		const SDLoc &DL,
SDValue Ptr) const {		SDValue Ptr) const {
const SIInstrInfo *TII = getSubtarget()->getInstrInfo();		const SIInstrInfo *TII = getSubtarget()->getInstrInfo();

// Build the half of the subregister with the constants before building the		// Build the half of the subregister with the constants before building the
// full 128-bit register. If we are building multiple resource descriptors,		// full 128-bit register. If we are building multiple resource descriptors,
// this will allow CSEing of the 2-component register.		// this will allow CSEing of the 2-component register.
▲ Show 20 Lines • Show All 1,121 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/CopyToReg-into-sgpr-into-MOV_B32.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -O0 -mcpu=gfx1030 < %s \| FileCheck %s

				target triple = "amdgcn-amd-amdhsa"

				; Unknown functions are conservatively passed all implicit parameters
				declare void @unknown_call()

				; Use the same constant as a sgpr parameter (for the kernel id) and for a vector operation
				define protected amdgpu_kernel void @kern(ptr %addr) !llvm.amdgcn.lds.kernel.id !0 {
				; CHECK-LABEL: kern:
				; CHECK: ; %bb.0:
				; CHECK-NEXT: s_mov_b32 s32, 0
				; CHECK-NEXT: s_add_u32 s12, s12, s17
				; CHECK-NEXT: s_addc_u32 s13, s13, 0
				; CHECK-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s12
				; CHECK-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s13
				; CHECK-NEXT: s_add_u32 s0, s0, s17
				; CHECK-NEXT: s_addc_u32 s1, s1, 0
				; CHECK-NEXT: v_writelane_b32 v40, s16, 0
				; CHECK-NEXT: s_mov_b32 s13, s15
				; CHECK-NEXT: s_mov_b32 s12, s14
				; CHECK-NEXT: v_readlane_b32 s14, v40, 0
				; CHECK-NEXT: s_mov_b64 s[16:17], s[8:9]
				; CHECK-NEXT: s_load_dwordx2 s[8:9], s[16:17], 0x0
				; CHECK-NEXT: v_mov_b32_e32 v5, 42
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: v_mov_b32_e32 v3, s8
				; CHECK-NEXT: v_mov_b32_e32 v4, s9
				; CHECK-NEXT: flat_store_dword v[3:4], v5
				; CHECK-NEXT: s_mov_b64 s[18:19], 8
				; CHECK-NEXT: s_mov_b32 s8, s16
				; CHECK-NEXT: s_mov_b32 s9, s17
				; CHECK-NEXT: s_mov_b32 s16, s18
				; CHECK-NEXT: s_mov_b32 s15, s19
				; CHECK-NEXT: s_add_u32 s8, s8, s16
				; CHECK-NEXT: s_addc_u32 s15, s9, s15
				; CHECK-NEXT: ; kill: def $sgpr8 killed $sgpr8 def $sgpr8_sgpr9
				; CHECK-NEXT: s_mov_b32 s9, s15
				; CHECK-NEXT: s_getpc_b64 s[16:17]
				; CHECK-NEXT: s_add_u32 s16, s16, unknown_call@gotpcrel32@lo+4
				; CHECK-NEXT: s_addc_u32 s17, s17, unknown_call@gotpcrel32@hi+12
				; CHECK-NEXT: s_load_dwordx2 s[16:17], s[16:17], 0x0
				; CHECK-NEXT: s_mov_b32 s15, 42
				; CHECK-NEXT: s_mov_b64 s[22:23], s[2:3]
				; CHECK-NEXT: s_mov_b64 s[20:21], s[0:1]
				; CHECK-NEXT: s_mov_b32 s18, 20
				; CHECK-NEXT: v_lshlrev_b32_e64 v2, s18, v2
				; CHECK-NEXT: s_mov_b32 s18, 10
				; CHECK-NEXT: v_lshlrev_b32_e64 v1, s18, v1
				; CHECK-NEXT: v_or3_b32 v31, v0, v1, v2
				; CHECK-NEXT: s_mov_b64 s[0:1], s[20:21]
				; CHECK-NEXT: s_mov_b64 s[2:3], s[22:23]
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
				; CHECK-NEXT: s_endpgm
				store i32 42, ptr %addr
				call fastcc void @unknown_call()
				ret void
				}

				!0 = !{i32 42}