Page MenuHomePhabricator

[AMDGPU] Fix SDST operand of V_DIV_SCALE to always be VCC
Changes PlannedPublic

Authored by Pierre-vh on Aug 16 2022, 5:47 AM.

Details

Reviewers
None
Summary

We previously allowed any scalar register but that was incorrect, the only correct operand is VCC.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Non-working draft/refactor

Adding isAsmParserOnly solves it, but now it's InstrInfo that fails:

[build] error: Multiple matches found for `V_DIV_SCALE_F32_e64', for the relation `getMCOpcodeGen', row fields ["v_div_scale_f32_e64"], column `["0"]'
[build] CurInstr: V_DIV_SCALE_F32_w64_gfx6_gfx7
[build] MatchInstr: V_DIV_SCALE_F32_w32_gfx6_gfx7

I'm going to update the diff with my latest (non-working) draft. It's a bit messy so I'll look into cleaning it up further.

You should only define w32/w64 for subtargets which have it (i.e. gfx10+).

Pierre-vh updated this revision to Diff 455121.Aug 24 2022, 2:53 AM

Non-working Draft.

Adding isAsmParserOnly solves it, but now it's InstrInfo that fails:

[build] error: Multiple matches found for `V_DIV_SCALE_F32_e64', for the relation `getMCOpcodeGen', row fields ["v_div_scale_f32_e64"], column `["0"]'
[build] CurInstr: V_DIV_SCALE_F32_w64_gfx6_gfx7
[build] MatchInstr: V_DIV_SCALE_F32_w32_gfx6_gfx7

I'm going to update the diff with my latest (non-working) draft. It's a bit messy so I'll look into cleaning it up further.

You should only define w32/w64 for subtargets which have it (i.e. gfx10+).

Unfortunately I keep having the issue even after I removed the <gfx10 w32/w64 variants.

[build] error: Multiple matches found for `V_DIV_SCALE_F32_e64', for the relation `getMCOpcodeGen', row fields ["v_div_scale_f32_e64"], column `["6"]'
[build] V_DIV_SCALE_F32_w64_gfx10
[build] V_DIV_SCALE_F32_w32_gfx10

Maybe I need to create an InstAlias for wave32/wave64? Not sure how it works

Pierre-vh updated this revision to Diff 455514.Aug 25 2022, 2:35 AM

It compiles now, but it still does not print vcc_lo for wave32.ll and I'm not sure what else to try.
I spent a while comparing the detailed records of TableGen and found no major difference. I also tried
looking for example instructions that have this w32 variant (like v_cmp_lt) and found no special-casing anywhere in the codebase, so
i'm not sure what to do

Pierre-vh updated this revision to Diff 455519.Aug 25 2022, 3:09 AM

Removing VOP_Real from the w32/w64 variants.
Still not working

Pierre-vh updated this revision to Diff 455521.Aug 25 2022, 3:33 AM

Various cleanups
vcc_lo/w32 variant is still not taken into account.

So far, if I understand correctly, my changes only affect the asm parser, but not yet the printer, correct? That's why I'm not seeing vcc_lo in the output.
It seems like some changes are needed on that front but I'm not sure where/how as it looks like this is the first VOP3 instruction to need special casing to print vcc/vcc_lo?

Maybe in needsImpliedVcc?

I also get an assertion failure currently

[build] llvm-mc: ../include/llvm/MC/MCInst.h:81: int64_t llvm::MCOperand::getImm() const: Assertion `isImm() && "This is not an immediate"' failed.
[build] PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
[build] Stack dump:
[build] 0.	Program arguments: /home/pvanhout/work/trunk/llvm-project/llvm/build/bin/llvm-mc -arch=amdgcn -mcpu=gfx900 -show-encoding /home/pvanhout/work/trunk/llvm-project/llvm/test/MC/AMDGPU/gfx9_asm_vop3_e64.s
[build]  #0 0x0000555ed26ecdf3 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Support/Unix/Signals.inc:569:13
[build]  #1 0x0000555ed26eb310 llvm::sys::RunSignalHandlers() /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Support/Signals.cpp:104:18
[build]  #2 0x0000555ed26ed40f SignalHandler(int) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Support/Unix/Signals.inc:407:1
[build]  #3 0x00007f9cdcf8d420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
[build]  #4 0x00007f9cdca2000b raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1
[build]  #5 0x00007f9cdc9ff859 abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81:7
[build]  #6 0x00007f9cdc9ff729 get_sysdep_segment_value /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:509:8
[build]  #7 0x00007f9cdc9ff729 _nl_load_domain /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:970:34
[build]  #8 0x00007f9cdca10fd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6)
[build]  #9 0x0000555ed240aae5 (/home/pvanhout/work/trunk/llvm-project/llvm/build/bin/llvm-mc+0x812ae5)
[build] #10 0x0000555ed24079f9 llvm::AMDGPUInstPrinter::printInstruction(llvm::MCInst const*, unsigned long, llvm::MCSubtargetInfo const&, llvm::raw_ostream&) /home/pvanhout/work/trunk/llvm-project/llvm/build/lib/Target/AMDGPU/AMDGPUGenAsmWriter.inc:0:0
[build] #11 0x0000555ed2405659 llvm::AMDGPUInstPrinter::printInst(llvm::MCInst const*, unsigned long, llvm::StringRef, llvm::MCSubtargetInfo const&, llvm::raw_ostream&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp:54:3
[build] #12 0x0000555ed2618ab4 llvm::MCTargetStreamer::prettyPrintAsm(llvm::MCInstPrinter&, unsigned long, llvm::MCInst const&, llvm::MCSubtargetInfo const&, llvm::raw_ostream&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCStreamer.cpp:1065:1
[build] #13 0x0000555ed25d66cf llvm::SmallVectorBase<unsigned long>::size() const /home/pvanhout/work/trunk/llvm-project/llvm/build/../include/llvm/ADT/SmallVector.h:77:32
[build] #14 0x0000555ed25d66cf llvm::SmallString<128u>::str() const /home/pvanhout/work/trunk/llvm-project/llvm/build/../include/llvm/ADT/SmallString.h:260:64
[build] #15 0x0000555ed25d66cf llvm::SmallString<128u>::operator llvm::StringRef() const /home/pvanhout/work/trunk/llvm-project/llvm/build/../include/llvm/ADT/SmallString.h:270:39
[build] #16 0x0000555ed25d66cf (anonymous namespace)::MCAsmStreamer::emitInstruction(llvm::MCInst const&, llvm::MCSubtargetInfo const&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCAsmStreamer.cpp:2297:24
[build] #17 0x0000555ed227b8fc (anonymous namespace)::AMDGPUAsmParser::MatchAndEmitInstruction(llvm::SMLoc, unsigned int&, llvm::SmallVectorImpl<std::unique_ptr<llvm::MCParsedAsmOperand, std::default_delete<llvm::MCParsedAsmOperand>>>&, llvm::MCStreamer&, unsigned long&, bool) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp:0:9
[build] #18 0x0000555ed266d294 (anonymous namespace)::AsmParser::parseAndMatchAndEmitTargetInstruction((anonymous namespace)::ParseStatementInfo&, llvm::StringRef, llvm::AsmToken, llvm::SMLoc) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCParser/AsmParser.cpp:2387:27
[build] #19 0x0000555ed2661e60 (anonymous namespace)::AsmParser::parseStatement((anonymous namespace)::ParseStatementInfo&, llvm::MCAsmParserSemaCallback*) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCParser/AsmParser.cpp:2320:10
[build] #20 0x0000555ed265bbee (anonymous namespace)::AsmParser::Run(bool, bool) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCParser/AsmParser.cpp:1004:16
[build] #21 0x0000555ed2245247 AssembleInput(char const*, llvm::Target const*, llvm::SourceMgr&, llvm::MCContext&, llvm::MCStreamer&, llvm::MCAsmInfo&, llvm::MCSubtargetInfo&, llvm::MCInstrInfo&, llvm::MCTargetOptions const&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../tools/llvm-mc/llvm-mc.cpp:344:13
[build] #22 0x0000555ed224402c main /home/pvanhout/work/trunk/llvm-project/llvm/build/../tools/llvm-mc/llvm-mc.cpp:0:11
[build] #23 0x00007f9cdca01083 __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:342:3
[build] #24 0x0000555ed224214e _start (/home/pvanhout/work/trunk/llvm-project/llvm/build/bin/llvm-mc+0x64a14e)

So far, if I understand correctly, my changes only affect the asm parser, but not yet the printer, correct? That's why I'm not seeing vcc_lo in the output.
It seems like some changes are needed on that front but I'm not sure where/how as it looks like this is the first VOP3 instruction to need special casing to print vcc/vcc_lo?

Maybe in needsImpliedVcc?

Yes, something along these lines and call printDefaultVccOperand. There are similar cases handled there but not 100% the same.

Pierre-vh updated this revision to Diff 456362.Aug 29 2022, 8:54 AM
  • Fix Parser/Printer

There is still a test (gfx10_asm_vop3.s) that I'm not sure how to update (so I deleted the checks for now)
I'd say the encoding needs to be updated everywhere, but should I duplicate the tests for w32/w64 then?
Or just use GFX10 check instead of W32/W64?
Moreover, there is no other test in that file that has a wavesize-dependent vcc/vcc_lo register usage so it'd
seem strange to leave div_scale there unless there's a pattern I'm not seeing.

Do I just not check for the "operands are not valid for this GPU or mode" errors and use GFX10 check, then update
all encodings as needed?

I feel like since another test already checks "operands are not valid for this GPU or mode", leaving the instructions
to just use VCC (or do half vcc_lo/half vcc) + using GFX10 check should be enough. What do you think?

rampitec added inline comments.Aug 29 2022, 1:47 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1010 ↗(On Diff #456362)

TRI->getVCC().

1012 ↗(On Diff #456362)

Is this really needed?

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
936–950

Tabs.

961

MIB->getNextNode() looks suspicious.

962

TRI.getVCC().

llvm/test/CodeGen/AMDGPU/frem.ll
2302

What happened here?

llvm/test/MC/AMDGPU/gfx10_asm_vop3.s
6796

Replace s[0:1]/s0 with vcc/vcc_lo

llvm/test/MC/AMDGPU/gfx7_asm_vop3_e64.s
8815 ↗(On Diff #456362)

Encoding is wrong.

llvm/test/MC/AMDGPU/vop3.s
411

Ditto.

llvm/test/MC/AMDGPU/wave32.s
386–388

Please keep checks order so it is easy to see changes.

llvm/test/MC/AMDGPU/wave_any.s
201–202

Ditto.

Pierre-vh updated this revision to Diff 456614.EditedAug 30 2022, 4:47 AM
Pierre-vh marked 10 inline comments as done.

Fixing remaining issues.
I had to hack a bit the AsmParser to make it all fit together. I think InstAlias could be used to do the w32/w64 -> normal instruction translation automatically but I don't know how to use it properly.

Pierre-vh added inline comments.Aug 30 2022, 4:47 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1012 ↗(On Diff #456362)

Yes, otherwise a lot of tests fail with "Using an undefined physical register".

llvm/test/CodeGen/AMDGPU/frem.ll
2302

I think something is wrong with the way I replace SDValue(N, 1) in SelectDIV_SCALE but I'm not sure how to fix it.

The addLiveIn is suspicious (I can't find any ISelDAGtoDAG impl that also uses it) but without it, it crashes.

llvm/test/MC/AMDGPU/wave32.s
386–388

I didn't change the order. Do you mean the -ERR check should always be at the bottom? (e.g. change GFX1064 to GFX1032 instead of adding the -ERR suffix & checking the error message)

llvm/test/MC/AMDGPU/wave_any.s
201–202

Is the check order the issue or the encoding? Encoding looks good to me

llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s
147 ↗(On Diff #456614)

Not sure why this popped up, I don't really understand the message

Pierre-vh updated this revision to Diff 456623.Aug 30 2022, 5:19 AM

Rebase

Waiting for feedback on:

  • AsmParser: Do I need to move to a InstAlias? Would it work?
  • "Truncated display due to cycle limit": Not sure where to start looking with that one.
  • DAGISel impl: Seems like it makes codegen worse in somee cases but I haven't found a cleaner way to adress the problem yet.
foad added inline comments.Aug 30 2022, 5:54 AM
llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s
147 ↗(On Diff #456614)

Add --timeline-max-cycles=0 to the RUN line? All the other files in this directory have it.

Pierre-vh updated this revision to Diff 456644.Aug 30 2022, 6:16 AM

Adding --timeline-max-cycles=0

Pierre-vh marked an inline comment as done.Aug 30 2022, 6:16 AM

So for the weird codegen in frem.ll, the issue is from the addLiveIn call I think.
In fact my changes to the DAGISel aren't needed I believe, but in practice there are segfaults in the following case: When a DIV_FMAS uses the result of DIV_SCALE.
In such cases, the following happens:

t92: f32,i1 = DIV_SCALE nofpexcept t59, t57, t59

ISEL: Starting selection on root node: t102: f32 = DIV_FMAS nofpexcept t101, t97, t100, t92:1
ISEL: Starting pattern match
  Initial Opcode index to 456907
  TypeSwitch[f32] from 456914 to 456917
  Skipped scope entry (due to false predicate) at index 456919, continuing at 456953
Creating new node: t127: ch,glue = CopyToReg t0, Register:i1 $vcc_lo, t92:1
  Morphed node: t102: f32 = V_DIV_FMAS_F32_e64 nofpexcept TargetConstant:i32<0>, t101, TargetConstant:i32<0>, t97, TargetConstant:i32<0>, t100, TargetConstant:i1<0>, TargetConstant:i32<0>, t127:1
ISEL: Match complete!

Then the following MIR is emitted.

%23:vgpr_32 = nofpexcept V_DIV_SCALE_F32_e64 0, %16:vgpr_32, 0, %17:vgpr_32, 0, %16:vgpr_32, 0, 0, implicit-def $vcc, implicit $mode, implicit $exec
%24:sreg_32 = COPY $vcc
$vcc_lo = COPY %24:sreg_32
%29:vgpr_32 = nofpexcept V_DIV_FMAS_F32_e64 0, killed %28:vgpr_32, 0, %22:vgpr_32, 0, %27:vgpr_32, 0, 0, implicit $mode, implicit $vcc, implicit $exec

And SIInstrInfo.cpp asserts when lowering the copy:

real copy:   renamable $vcc_lo = COPY $vcc
llc: ../lib/Target/AMDGPU/SIInstrInfo.cpp:762: virtual void llvm::SIInstrInfo::copyPhysReg(llvm::MachineBasicBlock &, MachineBasicBlock::iterator, const llvm::DebugLoc &, llvm::MCRegister, llvm::MCRegister, bool) const: Assertion `AMDGPU::VGPR_32RegClass.contains(SrcReg)' failed.

I feel like that copy shouldn't be there in the first place, but it seems like the DAGISel is adding it automatically.
The pattern used to select the FMAS is:

class DivFmasPat<ValueType vt, Instruction inst, Register CondReg> : GCNPat<
  (AMDGPUdiv_fmas (vt (VOP3Mods vt:$src0, i32:$src0_modifiers)),
                  (vt (VOP3Mods vt:$src1, i32:$src1_modifiers)),
                  (vt (VOP3Mods vt:$src2, i32:$src2_modifiers)),
                  (i1 CondReg)),
  (inst $src0_modifiers, $src0, $src1_modifiers, $src1, $src2_modifiers, $src2)
>;

let WaveSizePredicate = isWave64 in {
def : DivFmasPat<f32, V_DIV_FMAS_F32_e64, VCC>;
def : DivFmasPat<f64, V_DIV_FMAS_F64_e64, VCC>;
}

let WaveSizePredicate = isWave32 in {
def : DivFmasPat<f32, V_DIV_FMAS_F32_e64, VCC_LO>;
def : DivFmasPat<f64, V_DIV_FMAS_F64_e64, VCC_LO>;
}

Does anyone know where the issue lies? I'm a bit stuck there at the moment.
Is it the pattern that shouldn't be using VCC_LO like that? Or a missing case in SIInstrInfo.cpp ?
Maybe the register allocator should be more careful and transform the COPY into an EXTRACT_SUBREG when allocating %24 to vcc?

So for the weird codegen in frem.ll, the issue is from the addLiveIn call I think.

I believe so too. You are defining the vcc, you should not need it live to define it. Much less live at a function's entry.

rampitec added inline comments.Aug 30 2022, 12:14 PM
llvm/test/CodeGen/AMDGPU/GlobalISel/constant-bus-restriction.ll
255

What is this and why is it needed?

llvm/test/MC/AMDGPU/wave32.s
386–388

It was wave32 test. Now it is wave64 test and wave32 test is below. They are swapped.

llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s
147 ↗(On Diff #456614)

Why did it change at all?

@nhaehnle, @foad sorry for the ping, but I see you've also contributed to DAGISel so I was wondering if you had any input regarding the issue highlighted above.
Do you think there's a problem in the DIV_FMAS pattern, SIInstrInfo or the DIV_SCALE lowering?

llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s
147 ↗(On Diff #456614)

I don't think it changed, just content was added to it. I think the previous person that updated it didn't add the --timeline-max-cycles=0 and instead removed the check line with the cycles limit.

I think I can revert all these and the test should still pass, it'll just cover less instructions?

Pierre-vh updated this revision to Diff 456968.Aug 31 2022, 8:17 AM
Pierre-vh marked 2 inline comments as done.

Fixing wave32.s, removing useless prefixes from constant-bus-restriction.ll

rampitec added inline comments.Aug 31 2022, 10:12 AM
llvm/test/MC/AMDGPU/wave_any.s
207

It seems the intent of the test was to check that both wave32 and wave64 versions are accepted with both attributes set. This is lost now.

llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s
147 ↗(On Diff #456614)

Then it does not belong to this patch.

Pierre-vh marked 2 inline comments as done.

Revert llvm-mca test changes + fix wave_any test

So I think I found a solution that may work. The idea is just to:

  • Remove the SDag changes
    • After digging deep, I found that the InstrEmitter is smart enough to infer that the returned i1 is the implicit def of the instruction. No changes needed so.
    • However, it incorrectly assigns VReg_1 for the register class of the VCC copy it emits in InstrEmitter. This can be fixed in a pass.
  • Add a small pass that runs before FixSGPRCopies that fixes the COPY of the implicit def for V_DIV_SCALE. Something like "FixVCCImpDefCopy". The pass itself is minimal and the logic boils down to:
// Iterate over all instructions, check opcode
      case AMDGPU::V_DIV_SCALE_F32_e64:
      case AMDGPU::V_DIV_SCALE_F64_e64: {
        TII->fixImplicitOperands(MI);

        // Check for COPY of VCC right after it and fix it too.
        auto NextI = std::next(I);
        if(NextI == MBB->end() || NextI->getOpcode() != AMDGPU::COPY ||
            NextI->getOperand(1).getReg() != AMDGPU::VCC) {
          break;
        }

        if(TII->isWave32()) {
          NextI->getOperand(1).setReg(AMDGPU::VCC_LO);
          MRI->setRegClass(NextI->getOperand(0).getReg(), &AMDGPU::SReg_32_XM0_XEXECRegClass);
        } else {
          MRI->setRegClass(NextI->getOperand(0).getReg(), &AMDGPU::SReg_64_XEXECRegClass);
        }
      }

@foad, @rampitec do you think it's an acceptable solution? I can get all tests to pass with a seemingly normal codegen with that.

Pierre-vh updated this revision to Diff 457250.Sep 1 2022, 6:28 AM

Add FixVCCImpDef pass as explained in previous comment

foad added a comment.Sep 1 2022, 6:28 AM

However, it incorrectly assigns VReg_1 for the register class of the VCC copy it emits in InstrEmitter.

Why is VReg_1 wrong? In some places that is the class we use for an SGPR (or SGPR pair) that represents an independent 1-bit value per lane. Why doesn't that work here?

However, it incorrectly assigns VReg_1 for the register class of the VCC copy it emits in InstrEmitter.

Why is VReg_1 wrong? In some places that is the class we use for an SGPR (or SGPR pair) that represents an independent 1-bit value per lane. Why doesn't that work here?

It indeed seems to be used to represent a i1 when the value is divergent. Though I'm not sure why, but if I don't correct the regclass to an SReg, the LowerI1Copies pass cannot handle it.

Perhaps the issue lies in LowerI1Copies? I don't understand that pass well enough to know if the assertion is legitimate, or a sign of an unhandled case.
It seems like, to me, it shouldn't be lowering the VCC copy to a specific instruction but rather only lower it when it's actually needed (when the value needs to be preserved).

[build] llc: llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp:713: bool (anonymous namespace)::SILowerI1Copies::lowerCopiesToI1(): Assertion `TII->getRegisterInfo().getRegSizeInBits(SrcReg, *MRI) == 32' failed.

[build]   LLVM :: CodeGen/AMDGPU/dpp64_combine.ll
[build]   LLVM :: CodeGen/AMDGPU/fcanonicalize-elimination.ll
[build]   LLVM :: CodeGen/AMDGPU/fdiv-nofpexcept.ll
[build]   LLVM :: CodeGen/AMDGPU/fdiv.f16.ll
[build]   LLVM :: CodeGen/AMDGPU/llvm.powi.ll
[build]   LLVM :: CodeGen/AMDGPU/rsq.ll
[build]   LLVM :: CodeGen/AMDGPU/sgpr-copy.ll
[build]   LLVM :: CodeGen/AMDGPU/si-sgpr-spill.ll
[build]   LLVM :: CodeGen/AMDGPU/wave32.ll
`
foad added a comment.Sep 1 2022, 6:38 AM

I think if you fix the COPY to refer to $vcc_lo instead of $vcc (for wave 32) then SILowerI1Copies will be happy, because $vcc_lo will satisfy the isLaneMaskReg test. But there should be no need to change the regclass of the COPY.

I think if you fix the COPY to refer to $vcc_lo instead of $vcc (for wave 32) then SILowerI1Copies will be happy, because $vcc_lo will satisfy the isLaneMaskReg test. But there should be no need to change the regclass of the COPY.

It's what I was initially doing, but then it crashes on Wave64:

Process 397709 stopped
* thread #1, name = 'llc', stop reason = hit program assert
    frame #4: 0x0000555556f066c5 llc`(anonymous namespace)::SILowerI1Copies::lowerCopiesToI1(this=0x000055555f3a0ae0) at SILowerI1Copies.cpp:713:9
   710        assert(!MI.getOperand(1).getSubReg());
   711 
   712        if (!SrcReg.isVirtual() || (!isLaneMaskReg(SrcReg) && !isVreg1(SrcReg))) {
-> 713          assert(TII->getRegisterInfo().getRegSizeInBits(SrcReg, *MRI) == 32);
   714          unsigned TmpReg = createLaneMaskReg(*MF);
   715          BuildMI(MBB, MI, DL, TII->get(AMDGPU::V_CMP_NE_U32_e64), TmpReg)
   716              .addReg(SrcReg)
(lldb) expr MI->dump()
  %7:sreg_64 = COPY $vcc
  Fix-it applied, fixed expression was: 
    MI.dump()
(lldb) expr TII->isWave32()
(bool) $0 = false

I checked with:

  • Not forcing the regbank to SReg
  • Adding v_cmp_ne_u64_e64` & supporting 64 bits sources in LowerI1Copies

It doesn't break codegen, and it no longer crashes, though we get weirdness like v_cmp_ne_u64_e64 vcc, vcc, 0 in the output because:

%14:sreg_32 = V_CMP_NE_U32_e64 0, $vcc_lo, implicit $exec
$vcc_lo = COPY killed %14:sreg_32

> $vcc_lo = COPY killed renamable $vcc_lo
Identity copy: $vcc_lo = COPY killed renamable $vcc_lo
  deleted.
renamable $vcc_lo = V_CMP_NE_U32_e64 0, $vcc_lo, implicit $exec
foad added a comment.Sep 1 2022, 7:08 AM

Well, I don't fully understand the condition on line 712:

if (!SrcReg.isVirtual() || (!isLaneMaskReg(SrcReg) && !isVreg1(SrcReg)))

I'm not sure what kind of copies from physical SGPRs this pass is (was) expecting to see here. I suppose an instruction like this could have two possible interpretations:

%0:vreg_1 = COPY $sgpr0
  1. Each bit of sgpr0 supplies an i1 value for the corresponding lane of the result.
  2. The whole of sgpr0 is either 0 or 1, and that value should be broadcast to every lane of the result.

Stepping back a bit, it feels like you are running into problems because you have a def and a use of vcc, where one of them is implicit and the other is explicit. If def and use were both implicit or both explicit then I think things might work better. Is there a way you can change v_div_scale so that the vcc operand is explicit?

Well, I don't fully understand the condition on line 712:

if (!SrcReg.isVirtual() || (!isLaneMaskReg(SrcReg) && !isVreg1(SrcReg)))

I'm not sure what kind of copies from physical SGPRs this pass is (was) expecting to see here. I suppose an instruction like this could have two possible interpretations:

%0:vreg_1 = COPY $sgpr0
  1. Each bit of sgpr0 supplies an i1 value for the corresponding lane of the result.
  2. The whole of sgpr0 is either 0 or 1, and that value should be broadcast to every lane of the result.

Stepping back a bit, it feels like you are running into problems because you have a def and a use of vcc, where one of them is implicit and the other is explicit. If def and use were both implicit or both explicit then I think things might work better. Is there a way you can change v_div_scale so that the vcc operand is explicit?

I initially wanted to do just that, but IIRC @arsenm told me that it's not a good idea to force a output operand to be a specific register by constraining it with a RegClass, instead, implicit definitions should be used. I also remember starting out this patch just like that but I ran into a lot more issues, especially with the register allocator.

I quickly checked and check-llvm-codegen-amdgpu passes if I skip lowering Copies to i1 for VCC/VCC_LO. Should we just do that? Not sure if it makes much sense to lower such copies.
Then the pass would be very minimal, just fixing up the copy, but I think it might even be possible to remove it entirely in that case.

foad added a comment.Sep 1 2022, 7:42 AM

I quickly checked and check-llvm-codegen-amdgpu passes if I skip lowering Copies to i1 for VCC/VCC_LO. Should we just do that?

My gut feeling is that that's not safe because VCC/VCC_LO should be treated the same as any other SGPRs. But really I'm out of my depth here and I don't know what the correct solution is.

I quickly checked and check-llvm-codegen-amdgpu passes if I skip lowering Copies to i1 for VCC/VCC_LO. Should we just do that?

My gut feeling is that that's not safe because VCC/VCC_LO should be treated the same as any other SGPRs. But really I'm out of my depth here and I don't know what the correct solution is.

I guess it depends on what the pass is trying to achieve. With my current understanding of how this all works, VCC is already a lane mask + it's uniform across the wave since it's a SGPR under the hood, so it shouldn't need special treatment to be passed around, no?

The pass uses V_CMP_NE_U32 to do the lowering/copy, which does:

D.u64[threadId] = (S0 <> S1).

So for v_cmp_ne_u64_e64 vcc, vcc, 0, this would just set the mask to all ones or zeroes depending on whether VCC is all zeroes, no? Then it doesn't make sense because it doesn't COPY VCC but instead changes it.

I'm also out of my depth here, I'm trying to piece this together but my current understanding would be that it doesn't make sense to do this lowering for copies of physical SGPRs to i1

This is really an overkill to have a separate pass to fix vcc copy of just 2 instructions. What if you change the lowering instead? I believe it shall be possible to w/a it by not reading the second result of the lowered DIV_SCALE and glue a CopyFromReg to it with a correct register, i.e. VCC or VCC_LO.

foad added a subscriber: jrbyrnes.Sep 9 2022, 1:34 AM

Stepping back a bit, it feels like you are running into problems because you have a def and a use of vcc, where one of them is implicit and the other is explicit. If def and use were both implicit or both explicit then I think things might work better. Is there a way you can change v_div_scale so that the vcc operand is explicit?

+ @jrbyrnes who did some work on implicit vs explicit uses of $scc in D128681.

Pierre-vh updated this revision to Diff 458989.Sep 9 2022, 2:14 AM

REmove pass

I got rid of the pass. I tried using a glued node

CurDAG->SelectNodeTo(N, Opc, CurDAG->getVTList(VT, MVT::i1, MVT::Glue), Ops);
SDValue CopyFromReg = CurDAG->getCopyFromReg(CurDAG->getEntryNode(), SL, TRI->getVCC(), MVT::i1, SDValue(N, 2));
CurDAG->ReplaceAllUsesOfValueWith(SDValue(N, 1), CopyFromReg);

and that doesn't work, it doesn't crash but it messes up scheduling tool much and we end up with things like this:

	v_div_scale_f32 v2, vcc, v0, v1, v0
	s_mov_b64 s[0:1], vcc
	v_div_scale_f32 v3, vcc, v1, v1, v0
	s_mov_b64 vcc, s[0:1]

Instead of just swapping the div_scales.

If I fix it later in finalizeLowering when we change the implicit VCC to a VCC_LO, then it works fine.

I got rid of the pass. I tried using a glued node

CurDAG->SelectNodeTo(N, Opc, CurDAG->getVTList(VT, MVT::i1, MVT::Glue), Ops);
SDValue CopyFromReg = CurDAG->getCopyFromReg(CurDAG->getEntryNode(), SL, TRI->getVCC(), MVT::i1, SDValue(N, 2));
CurDAG->ReplaceAllUsesOfValueWith(SDValue(N, 1), CopyFromReg);

and that doesn't work, it doesn't crash but it messes up scheduling tool much and we end up with things like this:

	v_div_scale_f32 v2, vcc, v0, v1, v0
	s_mov_b64 s[0:1], vcc
	v_div_scale_f32 v3, vcc, v1, v1, v0
	s_mov_b64 vcc, s[0:1]

Instead of just swapping the div_scales.

If I fix it later in finalizeLowering when we change the implicit VCC to a VCC_LO, then it works fine.

Did you change AMDGPUdiv_scale to produce glue? So that you have something to glue this copy to?

+ @jrbyrnes who did some work on implicit vs explicit uses of $scc in D128681.

Haven't looked fully into this review, but the work in D128681 seems orthogonal to the issue here -- the issue in that ticket was involving not honoring live ranges. Glued copy nodes that wrote to a PhysReg didn't check if the copy would clobber a live range.

Anyway, are the vcc COPYs needed here? Maybe I don't understand something, but it would be nice to see isel not emit the COPY and just rely on the implicit def/use of $vcc.

Check D133593, it tries to address a similar problem, but with SCC.

Check D133593, it tries to address a similar problem, but with SCC.

Thanks Stas -- I played with D133593 a bit. Looks like if we extend that to support VCC, then we don't need to emit those COPYs which I think will clear things up a bit for this.

Check D133593, it tries to address a similar problem, but with SCC.

Thanks Stas -- I played with D133593 a bit. Looks like if we extend that to support VCC, then we don't need to emit those COPYs which I think will clear things up a bit for this.

What is the status on this diff then? Do I just need to rebase on top of D133593, extending it with VCC support, and it'll fix the issues?

I got rid of the pass. I tried using a glued node

CurDAG->SelectNodeTo(N, Opc, CurDAG->getVTList(VT, MVT::i1, MVT::Glue), Ops);
SDValue CopyFromReg = CurDAG->getCopyFromReg(CurDAG->getEntryNode(), SL, TRI->getVCC(), MVT::i1, SDValue(N, 2));
CurDAG->ReplaceAllUsesOfValueWith(SDValue(N, 1), CopyFromReg);

and that doesn't work, it doesn't crash but it messes up scheduling tool much and we end up with things like this:

	v_div_scale_f32 v2, vcc, v0, v1, v0
	s_mov_b64 s[0:1], vcc
	v_div_scale_f32 v3, vcc, v1, v1, v0
	s_mov_b64 vcc, s[0:1]

Instead of just swapping the div_scales.

If I fix it later in finalizeLowering when we change the implicit VCC to a VCC_LO, then it works fine.

Did you change AMDGPUdiv_scale to produce glue? So that you have something to glue this copy to?

Isn't CurDAG->SelectNodeTo(N, Opc, CurDAG->getVTList(VT, MVT::i1, MVT::Glue), Ops); enough for that? Do I need to change the DAG Op definition itself?

Did you change AMDGPUdiv_scale to produce glue? So that you have something to glue this copy to?

Isn't CurDAG->SelectNodeTo(N, Opc, CurDAG->getVTList(VT, MVT::i1, MVT::Glue), Ops); enough for that? Do I need to change the DAG Op definition itself?

I think so, but I believe we have reached the conclusion we do not need this w/a. We are pretty confident it works as is and the w/a tends to be invasive. So I'd suggest to shelve this just in case, abandon the patch and reject the ticket.

Pierre-vh planned changes to this revision.Sep 12 2022, 12:07 PM
Pierre-vh removed reviewers: rampitec, andreadb, alex-t.

shelving for now