This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Remove hasOneUse check from scalar select pattern
AbandonedPublic

Authored by foad on Apr 26 2022, 5:35 AM.

Details

Diff Detail

Event Timeline

foad created this revision.Apr 26 2022, 5:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2022, 5:35 AM
foad requested review of this revision.Apr 26 2022, 5:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2022, 5:35 AM
foad added inline comments.Apr 26 2022, 5:36 AM
llvm/lib/Target/AMDGPU/SOPInstructions.td
512

The hasOneUse check has always been there, since D81925 introduced SelectPat, but I don't see any reason for it.

arsenm accepted this revision.Apr 26 2022, 5:49 AM
This revision is now accepted and ready to land.Apr 26 2022, 5:49 AM
piotr added a comment.Apr 26 2022, 9:19 AM

I'd love this to go in, but when I added the hasOneUse() check it was certainly needed. If my old notes serve me well there was a crash in ctlz.ll test and I concluded this check was needed to avoid some shenanigans in the si-fix-sgpr-copies. I need to double check if the issue has been fixed or is just hidden.

foad added a comment.Apr 26 2022, 9:38 AM

I'd love this to go in, but when I added the hasOneUse() check it was certainly needed. If my old notes serve me well there was a crash in ctlz.ll test and I concluded this check was needed to avoid some shenanigans in the si-fix-sgpr-copies. I need to double check if the issue has been fixed or is just hidden.

Since posting the patch I've found one out-of-tree case that fails to compile:

; RUN: llc -march=amdgcn -mcpu=gfx1030 < %s
define amdgpu_cs void @main() {
bb:
  %i = load i32, i32 addrspace(3)* null, align 16
  br label %bb1
bb1:
  %i2 = phi i32 [ 0, %bb ], [ %i9, %bb5 ]
  br label %bb3
bb3:
  %i4 = icmp eq i32 %i2, 0
  br i1 %i4, label %bb5, label %bb3
bb5:
  %i6 = icmp ult i32 0, %i
  %i7 = sext i1 %i6 to i32
  %i8 = add i32 %i7, 1
  %i9 = and i32 %i8, %i7
  br label %bb1
}

The DAG just before selection looks like this, with two uses of t23:

t0: ch = EntryToken
  t3: i32,ch = load<(load (s32) from `i32 addrspace(3)* null`, align 16, addrspace 3)> t0, Constant:i32<0>, undef:i32
t23: i1 = setcc t3, Constant:i32<0>, setne:ch
      t17: i32,i1 = subcarry Constant:i32<1>, Constant:i32<0>, t23
    t25: i32 = select t23, t17, Constant:i32<0>
  t11: ch = CopyToReg t0, Register:i32 %0, t25
  t13: ch = CopyToReg t0, Register:i32 %4, Constant:i32<0>
t14: ch = TokenFactor t11, t13

After selection it has introduced a CopyToReg from $scc which I guess we don't support, because it will use SCC_CLASS which is not allocatable:

t0: ch = EntryToken
t1: i32 = S_MOV_B32 TargetConstant:i32<0>
    t29: i32 = V_MOV_B32_e32 TargetConstant:i32<0>
  t3: i32,ch = DS_READ_B32_gfx9<Mem:(load (s32) from `i32 addrspace(3)* null`, align 16, addrspace 3)> t29, TargetConstant:i16<0>, TargetConstant:i1<0>, t0
t23: i1 = S_CMP_LG_U32 t3, t1
        t7: i32 = S_MOV_B32 TargetConstant:i32<1>
      t17: i32,i1 = S_SUB_CO_PSEUDO t7, t1, t23
      t27: ch,glue = CopyToReg t0, Register:i1 $scc, t23
    t25: i32 = S_CSELECT_B32 t17, t1, t27:1
  t11: ch = CopyToReg t0, Register:i32 %0, t25
  t13: ch = CopyToReg t0, Register:i32 %4, t1
t14: ch = TokenFactor t11, t13

Shortly after this it crashes in ScheduleDAGSDNodes::EmitPhysRegCopy with: MachineRegisterInfo.cpp:160: llvm::Register llvm::MachineRegisterInfo::createVirtualRegister(const llvm::TargetRegisterClass *, llvm::StringRef): Assertion 'RegClass->isAllocatable() && "Virtual register RegClass must be allocatable."' failed.

I'd love this to go in, but when I added the hasOneUse() check it was certainly needed. If my old notes serve me well there was a crash in ctlz.ll test and I concluded this check was needed to avoid some shenanigans in the si-fix-sgpr-copies. I need to double check if the issue has been fixed or is just hidden.

Since posting the patch I've found one out-of-tree case that fails to compile:

; RUN: llc -march=amdgcn -mcpu=gfx1030 < %s
define amdgpu_cs void @main() {
bb:
  %i = load i32, i32 addrspace(3)* null, align 16
  br label %bb1
bb1:
  %i2 = phi i32 [ 0, %bb ], [ %i9, %bb5 ]
  br label %bb3
bb3:
  %i4 = icmp eq i32 %i2, 0
  br i1 %i4, label %bb5, label %bb3
bb5:
  %i6 = icmp ult i32 0, %i
  %i7 = sext i1 %i6 to i32
  %i8 = add i32 %i7, 1
  %i9 = and i32 %i8, %i7
  br label %bb1
}

Can you recommit this test

Shortly after this it crashes in ScheduleDAGSDNodes::EmitPhysRegCopy with: MachineRegisterInfo.cpp:160: llvm::Register llvm::MachineRegisterInfo::createVirtualRegister(const llvm::TargetRegisterClass *, llvm::StringRef): Assertion 'RegClass->isAllocatable() && "Virtual register RegClass must be allocatable."' failed.

The should be able to use SReg_32 which we do handle

foad added a comment.Apr 29 2022, 7:15 AM

The should be able to use SReg_32 which we do handle

I can get that to work with a patch like this: https://reviews.llvm.org/differential/diff/426045/

I'm not too happy that I had to change InstrEmitter::EmitCopyFromReg. It makes me wonder if we should be handling uniform compare+select patterns much more like a flags-based CPU does, either by gluing the s_cmp to the s_cselect, or using ISD::SELECT_CC instead of ISD::SELECT in the first place, so that it is all in one DAG node.

Also the codegen is not particularly pretty, but maybe it can be cleaned up by tweaking SIFixSGPRCopies, which has for some reason converted s_cmp to v_cmp but not converted the following s_cselect to v_cndmask:

diff --git a/llvm/test/CodeGen/AMDGPU/setcc-multiple-use.ll b/llvm/test/CodeGen/AMDGPU/setcc-multiple-use.ll
index bff4c0c1533a..6c99d04f6410 100644
--- a/llvm/test/CodeGen/AMDGPU/setcc-multiple-use.ll
+++ b/llvm/test/CodeGen/AMDGPU/setcc-multiple-use.ll
@@ -18,7 +18,9 @@ define i32 @f() {
 ; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0, v0
 ; CHECK-NEXT:    s_cmpk_lg_u32 vcc_lo, 0x0
 ; CHECK-NEXT:    s_subb_u32 s4, 1, 0
-; CHECK-NEXT:    v_cndmask_b32_e64 v0, 0, s4, vcc_lo
+; CHECK-NEXT:    s_and_b32 s5, vcc_lo, exec_lo
+; CHECK-NEXT:    s_cselect_b32 s4, s4, 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s4
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 bb:
   %i = load i32, i32 addrspace(3)* null, align 16
piotr added a comment.EditedApr 29 2022, 8:40 AM

The biggest headache comes from the fact that during moveToVALU when S_CMP gets converted to V_CMP the users of SCC need to be handled properly otherwise you end up with a weird copy from SCC. I think this is handled right now by adding an extra copy from VCC to SCC to make the connection between V_CMP and S_CSELECT until it is time for the handling of S_CSELECT. This gets more tricky when there are more uses of SCC I guess.

Shortly after this it crashes in ScheduleDAGSDNodes::EmitPhysRegCopy with: MachineRegisterInfo.cpp:160: llvm::Register llvm::MachineRegisterInfo::createVirtualRegister(const llvm::TargetRegisterClass *, llvm::StringRef): Assertion 'RegClass->isAllocatable() && "Virtual register RegClass must be allocatable."' failed.

The should be able to use SReg_32 which we do handle

The should be able to use SReg_32 which we do handle

I can get that to work with a patch like this: https://reviews.llvm.org/differential/diff/426045/

I'm not too happy that I had to change InstrEmitter::EmitCopyFromReg. It makes me wonder if we should be handling uniform compare+select patterns much more like a flags-based CPU does, either by gluing the s_cmp to the s_cselect, or using ISD::SELECT_CC instead of ISD::SELECT in the first place, so that it is all in one DAG node.

I was thinking we would add an SReg_32_PlusSCC class to use for for conditions (or I guess we could just add SCC directly to SReg_32). GlobalISel directly emits copies between SReg_32 and SCC but I guess InstrEmitter's magic requires an allocatable class

foad updated this revision to Diff 426675.May 3 2022, 6:00 AM

Check all uses of the condition are selects.

foad added a comment.May 12 2022, 1:30 AM

Ping for the new version, which only allows multiple uses if they are all selects. I realise this is not ideal but it is strictly more permissive than the current code, which only allows a single use (which must be a select).

llvm/lib/Target/AMDGPU/SOPInstructions.td
515

This is to cope with other uses that have already been selected to CopyToReg, S_CSELECT_B32. I don't know if there's a better way to handle this.

piotr added a comment.EditedMay 13 2022, 1:00 AM

Looks good to me - the extra s_cselect's generated are worth the complexity arising from this patch.

Can you add a test (or point me to an existing one), where there are multiple uses of scc and some selects, but not all, are used by vector instructions so they would normally be transformed to v_cndmask during moveToValu?

foad added a comment.Aug 2 2022, 8:43 AM

Ping for the new version, which only allows multiple uses if they are all selects.

I think this restriction might not be necessary after D128681.

arsenm requested changes to this revision.Sep 16 2022, 9:32 AM

Ping for the new version, which only allows multiple uses if they are all selects.

I think this restriction might not be necessary after D128681.

Should try with D128681

This revision now requires changes to proceed.Sep 16 2022, 9:32 AM
foad abandoned this revision.Sep 20 2022, 3:00 AM

Superseded by D133593.