This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Lower CopyToReg into SGPR explicitly to avoid illegal vgpr to sgpr copy
AbandonedPublic

Authored by JonChesterfield on Dec 12 2022, 9:06 AM.

Details

Summary

Works around regression in D131246 to unblock LDS lowering in D139433.

The bug there was ISel using a single i32 constant node as the argument to a
node that wants it in a vgpr and another one that wants it in a sgpr. If the
lowering puts it in a vgpr and SIFixSGPRCopies fails to handle it, as is
presently the case, then we get an error and a miscompile.

This is very much a point fix. If we need to copy into a sgpr, it's going to
need an instruction to do so (unless the sgpr happens to have the right value
in it already, which we could catch as a peephole somewhere in MIR if we wish)
so selecting the s_mov_b32 immediately doesn't cost anything. It only looks
for i32 values as I believe that is the type we use for sgpr copies.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2022, 9:06 AM
JonChesterfield requested review of this revision.Dec 12 2022, 9:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2022, 9:06 AM

Codegen for that test case without this patch applied. Emits an error error: <unknown>:0:0: in function kern void (ptr): illegal SGPR to VGPR copy which is also recorded as a comment in the asm.

kern:                                   ; @kern
; %bb.0:
	s_mov_b32 s32, 0
	s_add_u32 s12, s12, s17
	s_addc_u32 s13, s13, 0
	s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s12
	s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s13
	s_add_u32 s0, s0, s17
	s_addc_u32 s1, s1, 0
	v_writelane_b32 v40, s16, 0
	s_mov_b32 s13, s15
	s_mov_b32 s12, s14
	v_readlane_b32 s14, v40, 0
	s_mov_b64 s[16:17], s[8:9]
	v_mov_b32_e32 v3, v2
	v_mov_b32_e32 v2, v1
	v_mov_b32_e32 v1, v0
	s_load_dwordx2 s[8:9], s[16:17], 0x0
	v_mov_b32_e32 v0, 42
	s_waitcnt lgkmcnt(0)
	v_mov_b32_e32 v4, s8
	v_mov_b32_e32 v5, s9
	flat_store_dword v[4:5], v0
	s_mov_b64 s[18:19], 8
	s_mov_b32 s8, s16
	s_mov_b32 s9, s17
	s_mov_b32 s16, s18
	s_mov_b32 s15, s19
	s_add_u32 s8, s8, s16
	s_addc_u32 s15, s9, s15
        ; kill: def $sgpr8 killed $sgpr8 def $sgpr8_sgpr9
	s_mov_b32 s9, s15
	s_getpc_b64 s[16:17]
	s_add_u32 s16, s16, unknown_call@gotpcrel32@lo+4
	s_addc_u32 s17, s17, unknown_call@gotpcrel32@hi+12
	s_load_dwordx2 s[16:17], s[16:17], 0x0
	s_mov_b64 s[22:23], s[2:3]
	s_mov_b64 s[20:21], s[0:1]
	s_mov_b32 s15, 20
	v_lshlrev_b32_e64 v3, s15, v3
	s_mov_b32 s15, 10
	v_lshlrev_b32_e64 v2, s15, v2
	v_or3_b32 v31, v1, v2, v3
	 ; illegal copy v0 to s15
	s_mov_b64 s[0:1], s[20:21]
	s_mov_b64 s[2:3], s[22:23]
	s_waitcnt lgkmcnt(0)
	s_swappc_b64 s[30:31], s[16:17]
	s_endpgm
arsenm added inline comments.Dec 12 2022, 9:22 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11916–11918

Can you teach isVGPRImm to deal with this

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11916–11918

I don't think so.

The failing case here is where a single constant node is needed in a vgpr and in a sgpr. isVGPRImm would need to return true for one case and false for the other, but it takes the node that was CSE'd together for both.

isVGPRImm will currently return false on isSGPRClass. Guessing slightly, the idea behind isVGPRImm is to prefer to materialise in a vgpr directly? Seems reasonable if so, but interacts badly with having no general means of copying from vgpr to sgpr.

I would guess the general case involves a patch to SIFixSGPRCopies.

JonChesterfield marked an inline comment as done.Dec 12 2022, 9:56 AM
arsenm added inline comments.Dec 12 2022, 10:15 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11916–11918

Correct. False is always a correct answer. Copy and rematerialize in VGPR later is easy

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

What ensures this copy is uniform?

Even if it is uniform, is there a reason to expect readfirstlane to be better than a mov immediate?

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

What ensures this copy is uniform?

The divergence-driven ISel does.
The call argument would be VGPR if it is divergent.

Even if it is uniform, is there a reason to expect readfirstlane to be better than a mov immediate?

Nevertheless, I would agree with you, that we only need to allow VGPR to physical SGPR copy if the VGPR is defined by the constant copy.
For now, it is the easiest way to fix the issue which blocks you. The common case (if it is okay to "readfirstlane" the general VGPR to physical SGPR ) needs further discussion.

Since we ensure all the VGPR to SGPR copies are uniform, we just need to V_READFIRSTLANE_B32 here.

What ensures this copy is uniform?

The divergence-driven ISel does.
The call argument would be VGPR if it is divergent.

The call argument here isn't a vgpr - it's part of the calling convention for passing implicit parameters around, so it's explicitly a sgpr. We could have defined it to be a vgpr containing a uniform value instead, but as far as I can tell that would be strictly worse.

What we have at present is a constant node gets lowered to an instruction that leaves the result in a vgpr, and then miscompilation. We could indeed lower the constant to a vgpr and then notice that we've done that and insert a readfirstlane to fix it up, instead of erroring, that mostly means updating something downstream of ISel to know that a given vector register is uniform based on the instruction it came from.

I'm not totally convinced by our compiler emitting invalid code and then trying to fix it up later. This is a case where we can emit something which is correct up front, without divergence analysis or regressions. It's also associated with function calls which are quite high complexity as the baseline.

I think the right fix here is

  • have globalisel model vector and scalar register files as separate things (if it doesn't already)
  • teach regalloc about vector and scalar register files as most of the fixup pass seems to be trying to undo things regalloc did wrong
  • delete sdag

Patch in D139874 fixes up the invalid copy downstream but is likely to become more complicated than this in the process. One approach would be to land this and then, once D139874 is completed and committed, delete this patch to ISel and update the test if necessary.

JonChesterfield abandoned this revision.Dec 16 2022, 1:44 PM

Alex' patch landed, abandoning this one