We previously allowed any scalar register but that was incorrect, the only correct operand is VCC.
Details
- Reviewers
- None
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
- VCC encoding
- wave32 test is failing because the V_DIV_SCALE instruction has no wave32 variant, so I can't print VCC_LO. Ideally I'd print the implicit def (so it's always correct and not hardcoded like that) but I'm not sure it's possible?
I'll need some help in review. I think the VCC encoding is correct, but it should be expressed differently (perhaps by using the HWEncoding first 7 bits?)
For the wave32 test/having different asm depending on wave32/64 I'm not sure how to do that, this is where I'm stuck currently.
If we prohibit and SDTS except VCC it should be also prohibited in asm/disasm.
This is not a first instruction which can only have VCC as carry, see any VOP2be instructions, for example V_ADD_CO_U32.
It shall not have SDST at all and instead impdef VCC. Asm string operand is replaced in the real instruction depending on a wave size, see for example:
def _dpp_w32_gfx10 :
  Base_VOP2_DPP16<op, !cast<VOP2_DPP_Pseudo>(opName#"_dpp"), asmName> {
    string AsmDPP = !cast<VOP2_Pseudo>(opName#"_e32").Pfl.AsmDPP16;
    let AsmString = asmName # !subst("vcc", "vcc_lo", AsmDPP);
    let isAsmParserOnly = 1;
    let WaveSizePredicate = isWave32;
  }Plus you will need to enforce SDST encoding field to VCC for these instructions only, no need in a new SDstIsAlwaysVCC.
Will I need to create 2 variants of the instruction, a wave32/wave64 variant (e32/e64 if I understand correctly) for this to work?
Also for the asm/disasm, do I need to change the asmparser/add tests to verify everything other than VCC is rejected in the dst?
enforce SDST encoding field to VCC for these instructions only
How can I do this? Do I also follow the example of V_ADD_CO_U32?
You need a single pseudo, but 2 different real instructions with different WaveSizePredicate. V_ADD_CO_U32 is not a good example after all, it does that for sdwa/dpp, which it does not have. You can pick a similar example from VOPCInstructions.td looking for _w32/_w64 variants.
Note, this instruction only has _e64 variant.
Also for the asm/disasm, do I need to change the asmparser/add tests to verify everything other than VCC is rejected in the dst?
enforce SDST encoding field to VCC for these instructions only
How can I do this? Do I also follow the example of V_ADD_CO_U32?
Just enforce sdst field to !cast<int>(VCC.HWEncoding) (i.e. 0x6a) in the Real class.
Also note, existing mc tests shall fail when you do it. You will have to create w32/w64 tests for gfx10/11 and negative tests for non-vcc.
I've written the following for one of the  real instructions, and I get a encoding conflict (VCC/VCC_LO have the same encoding).
Am I doing this wrong?
Also, do I need to create a new "Real" class specifically for this, or should I just add a flag to the VOP_PROFILE & change the existing classes to create _w32/64 variants?
Thanks
multiclass VOP3be_VCCSDST_Real_vi<bits<10> op> {
  let sdst = !cast<int>(VCC_LO.HWEncoding), WaveSizePredicate = isWave32 in
  def _vi_w32 : VOP3_Real<!cast<VOP_Pseudo>(NAME#"_e64"), SIEncodingFamily.VI>,
                VOP3be_vi <op, !cast<VOP_Pseudo>(NAME#"_e64").Pfl>;
  let sdst = !cast<int>(VCC.HWEncoding), WaveSizePredicate = isWave64 in
  def _vi_w64 : VOP3_Real<!cast<VOP_Pseudo>(NAME#"_e64"), SIEncodingFamily.VI>,
                VOP3be_vi <op, !cast<VOP_Pseudo>(NAME#"_e64").Pfl>;
}[build] Decoding Conflict: [build] ................................1101000111100000.1101010........ [build] ................................1101000111100000................ [build] ................................110100.......................... [build] ................................................................ [build] V_DIV_SCALE_F32_vi_w32 ________________________________1101000111100000_1101010________ [build] V_DIV_SCALE_F32_vi_w64 ________________________________1101000111100000_1101010________ [build] Decoding Conflict: [build] ................................1101000111100001.1101010........ [build] ................................1101000111100001................ [build] ................................110100.......................... [build] ................................................................ [build] V_DIV_SCALE_F64_vi_w32 ________________________________1101000111100001_1101010________ [build] V_DIV_SCALE_F64_vi_w64 ________________________________1101000111100001_1101010________
Adding isAsmParserOnly solves it, but now it's InstrInfo that fails:
[build] error: Multiple matches found for `V_DIV_SCALE_F32_e64', for the relation `getMCOpcodeGen', row fields ["v_div_scale_f32_e64"], column `["0"]' [build] CurInstr: V_DIV_SCALE_F32_w64_gfx6_gfx7 [build] MatchInstr: V_DIV_SCALE_F32_w32_gfx6_gfx7
I'm going to update the diff with my latest (non-working) draft. It's a bit messy so I'll look into cleaning it up further.
Unfortunately I keep having the issue even after I removed the <gfx10 w32/w64 variants.
[build] error: Multiple matches found for `V_DIV_SCALE_F32_e64', for the relation `getMCOpcodeGen', row fields ["v_div_scale_f32_e64"], column `["6"]' [build] V_DIV_SCALE_F32_w64_gfx10 [build] V_DIV_SCALE_F32_w32_gfx10
Maybe I need to create an InstAlias for wave32/wave64? Not sure how it works
It compiles now, but it still does not print vcc_lo for wave32.ll and I'm not sure what else to try.
I spent a while comparing the detailed records of TableGen and found no major difference. I also tried
looking for example instructions that have this w32 variant (like v_cmp_lt) and found no special-casing anywhere in the codebase, so
i'm not sure what to do
So far, if I understand correctly, my changes only affect the asm parser, but not yet the printer, correct? That's why I'm not seeing vcc_lo in the output.
It seems like some changes are needed on that front but I'm not sure where/how as it looks like this is the first VOP3 instruction to need special casing to print vcc/vcc_lo?
Maybe in needsImpliedVcc?
I also get an assertion failure currently
[build] llvm-mc: ../include/llvm/MC/MCInst.h:81: int64_t llvm::MCOperand::getImm() const: Assertion `isImm() && "This is not an immediate"' failed. [build] PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. [build] Stack dump: [build] 0. Program arguments: /home/pvanhout/work/trunk/llvm-project/llvm/build/bin/llvm-mc -arch=amdgcn -mcpu=gfx900 -show-encoding /home/pvanhout/work/trunk/llvm-project/llvm/test/MC/AMDGPU/gfx9_asm_vop3_e64.s [build] #0 0x0000555ed26ecdf3 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Support/Unix/Signals.inc:569:13 [build] #1 0x0000555ed26eb310 llvm::sys::RunSignalHandlers() /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Support/Signals.cpp:104:18 [build] #2 0x0000555ed26ed40f SignalHandler(int) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Support/Unix/Signals.inc:407:1 [build] #3 0x00007f9cdcf8d420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420) [build] #4 0x00007f9cdca2000b raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1 [build] #5 0x00007f9cdc9ff859 abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81:7 [build] #6 0x00007f9cdc9ff729 get_sysdep_segment_value /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:509:8 [build] #7 0x00007f9cdc9ff729 _nl_load_domain /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:970:34 [build] #8 0x00007f9cdca10fd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6) [build] #9 0x0000555ed240aae5 (/home/pvanhout/work/trunk/llvm-project/llvm/build/bin/llvm-mc+0x812ae5) [build] #10 0x0000555ed24079f9 llvm::AMDGPUInstPrinter::printInstruction(llvm::MCInst const*, unsigned long, llvm::MCSubtargetInfo const&, llvm::raw_ostream&) /home/pvanhout/work/trunk/llvm-project/llvm/build/lib/Target/AMDGPU/AMDGPUGenAsmWriter.inc:0:0 [build] #11 0x0000555ed2405659 llvm::AMDGPUInstPrinter::printInst(llvm::MCInst const*, unsigned long, llvm::StringRef, llvm::MCSubtargetInfo const&, llvm::raw_ostream&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp:54:3 [build] #12 0x0000555ed2618ab4 llvm::MCTargetStreamer::prettyPrintAsm(llvm::MCInstPrinter&, unsigned long, llvm::MCInst const&, llvm::MCSubtargetInfo const&, llvm::raw_ostream&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCStreamer.cpp:1065:1 [build] #13 0x0000555ed25d66cf llvm::SmallVectorBase<unsigned long>::size() const /home/pvanhout/work/trunk/llvm-project/llvm/build/../include/llvm/ADT/SmallVector.h:77:32 [build] #14 0x0000555ed25d66cf llvm::SmallString<128u>::str() const /home/pvanhout/work/trunk/llvm-project/llvm/build/../include/llvm/ADT/SmallString.h:260:64 [build] #15 0x0000555ed25d66cf llvm::SmallString<128u>::operator llvm::StringRef() const /home/pvanhout/work/trunk/llvm-project/llvm/build/../include/llvm/ADT/SmallString.h:270:39 [build] #16 0x0000555ed25d66cf (anonymous namespace)::MCAsmStreamer::emitInstruction(llvm::MCInst const&, llvm::MCSubtargetInfo const&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCAsmStreamer.cpp:2297:24 [build] #17 0x0000555ed227b8fc (anonymous namespace)::AMDGPUAsmParser::MatchAndEmitInstruction(llvm::SMLoc, unsigned int&, llvm::SmallVectorImpl<std::unique_ptr<llvm::MCParsedAsmOperand, std::default_delete<llvm::MCParsedAsmOperand>>>&, llvm::MCStreamer&, unsigned long&, bool) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp:0:9 [build] #18 0x0000555ed266d294 (anonymous namespace)::AsmParser::parseAndMatchAndEmitTargetInstruction((anonymous namespace)::ParseStatementInfo&, llvm::StringRef, llvm::AsmToken, llvm::SMLoc) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCParser/AsmParser.cpp:2387:27 [build] #19 0x0000555ed2661e60 (anonymous namespace)::AsmParser::parseStatement((anonymous namespace)::ParseStatementInfo&, llvm::MCAsmParserSemaCallback*) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCParser/AsmParser.cpp:2320:10 [build] #20 0x0000555ed265bbee (anonymous namespace)::AsmParser::Run(bool, bool) /home/pvanhout/work/trunk/llvm-project/llvm/build/../lib/MC/MCParser/AsmParser.cpp:1004:16 [build] #21 0x0000555ed2245247 AssembleInput(char const*, llvm::Target const*, llvm::SourceMgr&, llvm::MCContext&, llvm::MCStreamer&, llvm::MCAsmInfo&, llvm::MCSubtargetInfo&, llvm::MCInstrInfo&, llvm::MCTargetOptions const&) /home/pvanhout/work/trunk/llvm-project/llvm/build/../tools/llvm-mc/llvm-mc.cpp:344:13 [build] #22 0x0000555ed224402c main /home/pvanhout/work/trunk/llvm-project/llvm/build/../tools/llvm-mc/llvm-mc.cpp:0:11 [build] #23 0x00007f9cdca01083 __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:342:3 [build] #24 0x0000555ed224214e _start (/home/pvanhout/work/trunk/llvm-project/llvm/build/bin/llvm-mc+0x64a14e)
Yes, something along these lines and call printDefaultVccOperand. There are similar cases handled there but not 100% the same.
- Fix Parser/Printer
There is still a test (gfx10_asm_vop3.s) that I'm not sure how to update (so I deleted the checks for now)
I'd say the encoding needs to be updated everywhere, but should I duplicate the tests for w32/w64 then?
Or just use GFX10 check instead of W32/W64?
Moreover, there is no other test in that file that has a wavesize-dependent vcc/vcc_lo register usage so it'd
seem strange to leave div_scale there unless there's a pattern I'm not seeing.
Do I just not check for the "operands are not valid for this GPU or mode" errors and use GFX10 check, then update
all encodings as needed?
I feel like since another test already checks "operands are not valid for this GPU or mode", leaving the instructions
to just use VCC (or do half vcc_lo/half vcc) + using GFX10 check should be enough. What do you think?
| llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
|---|---|---|
| 1023 | TRI->getVCC(). | |
| 1025 | Is this really needed? | |
| llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
| 936–950 | Tabs. | |
| 961 | MIB->getNextNode() looks suspicious. | |
| 962 | TRI.getVCC(). | |
| llvm/test/CodeGen/AMDGPU/frem.ll | ||
| 2341–2345 | What happened here? | |
| llvm/test/MC/AMDGPU/gfx10_asm_vop3.s | ||
| 6796 | Replace s[0:1]/s0 with vcc/vcc_lo | |
| llvm/test/MC/AMDGPU/gfx7_asm_vop3_e64.s | ||
| 8815 ↗ | (On Diff #456362) | Encoding is wrong. | 
| llvm/test/MC/AMDGPU/vop3.s | ||
| 411 | Ditto. | |
| llvm/test/MC/AMDGPU/wave32.s | ||
| 386–388 | Please keep checks order so it is easy to see changes. | |
| llvm/test/MC/AMDGPU/wave_any.s | ||
| 201–202 | Ditto. | |
Fixing remaining issues.
I had to hack a bit the AsmParser to make it all fit together. I think InstAlias could be used to do the w32/w64 -> normal instruction translation automatically but I don't know how to use it properly.
| llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
|---|---|---|
| 1025 | Yes, otherwise a lot of tests fail with "Using an undefined physical register". | |
| llvm/test/CodeGen/AMDGPU/frem.ll | ||
| 2341–2345 | I think something is wrong with the way I replace SDValue(N, 1) in SelectDIV_SCALE but I'm not sure how to fix it. The addLiveIn is suspicious (I can't find any ISelDAGtoDAG impl that also uses it) but without it, it crashes. | |
| llvm/test/MC/AMDGPU/wave32.s | ||
| 386–388 | I didn't change the order. Do you mean the -ERR check should always be at the bottom? (e.g. change GFX1064 to GFX1032 instead of adding the -ERR suffix & checking the error message) | |
| llvm/test/MC/AMDGPU/wave_any.s | ||
| 201–202 | Is the check order the issue or the encoding? Encoding looks good to me | |
| llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s | ||
| 136 | Not sure why this popped up, I don't really understand the message | |
Rebase
Waiting for feedback on:
- AsmParser: Do I need to move to a InstAlias? Would it work?
- "Truncated display due to cycle limit": Not sure where to start looking with that one.
- DAGISel impl: Seems like it makes codegen worse in somee cases but I haven't found a cleaner way to adress the problem yet.
| llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s | ||
|---|---|---|
| 136 | Add --timeline-max-cycles=0 to the RUN line? All the other files in this directory have it. | |
So for the weird codegen in frem.ll, the issue is from the addLiveIn call I think.
In fact my changes to the DAGISel aren't needed I believe, but in practice there are segfaults in the following case: When a DIV_FMAS uses the result of DIV_SCALE.
In such cases, the following happens:
t92: f32,i1 = DIV_SCALE nofpexcept t59, t57, t59 ISEL: Starting selection on root node: t102: f32 = DIV_FMAS nofpexcept t101, t97, t100, t92:1 ISEL: Starting pattern match Initial Opcode index to 456907 TypeSwitch[f32] from 456914 to 456917 Skipped scope entry (due to false predicate) at index 456919, continuing at 456953 Creating new node: t127: ch,glue = CopyToReg t0, Register:i1 $vcc_lo, t92:1 Morphed node: t102: f32 = V_DIV_FMAS_F32_e64 nofpexcept TargetConstant:i32<0>, t101, TargetConstant:i32<0>, t97, TargetConstant:i32<0>, t100, TargetConstant:i1<0>, TargetConstant:i32<0>, t127:1 ISEL: Match complete!
Then the following MIR is emitted.
%23:vgpr_32 = nofpexcept V_DIV_SCALE_F32_e64 0, %16:vgpr_32, 0, %17:vgpr_32, 0, %16:vgpr_32, 0, 0, implicit-def $vcc, implicit $mode, implicit $exec %24:sreg_32 = COPY $vcc $vcc_lo = COPY %24:sreg_32 %29:vgpr_32 = nofpexcept V_DIV_FMAS_F32_e64 0, killed %28:vgpr_32, 0, %22:vgpr_32, 0, %27:vgpr_32, 0, 0, implicit $mode, implicit $vcc, implicit $exec
And SIInstrInfo.cpp asserts when lowering the copy:
real copy: renamable $vcc_lo = COPY $vcc llc: ../lib/Target/AMDGPU/SIInstrInfo.cpp:762: virtual void llvm::SIInstrInfo::copyPhysReg(llvm::MachineBasicBlock &, MachineBasicBlock::iterator, const llvm::DebugLoc &, llvm::MCRegister, llvm::MCRegister, bool) const: Assertion `AMDGPU::VGPR_32RegClass.contains(SrcReg)' failed.
I feel like that copy shouldn't be there in the first place, but it seems like the DAGISel is adding it automatically.
The pattern used to select the FMAS is:
class DivFmasPat<ValueType vt, Instruction inst, Register CondReg> : GCNPat<
  (AMDGPUdiv_fmas (vt (VOP3Mods vt:$src0, i32:$src0_modifiers)),
                  (vt (VOP3Mods vt:$src1, i32:$src1_modifiers)),
                  (vt (VOP3Mods vt:$src2, i32:$src2_modifiers)),
                  (i1 CondReg)),
  (inst $src0_modifiers, $src0, $src1_modifiers, $src1, $src2_modifiers, $src2)
>;
let WaveSizePredicate = isWave64 in {
def : DivFmasPat<f32, V_DIV_FMAS_F32_e64, VCC>;
def : DivFmasPat<f64, V_DIV_FMAS_F64_e64, VCC>;
}
let WaveSizePredicate = isWave32 in {
def : DivFmasPat<f32, V_DIV_FMAS_F32_e64, VCC_LO>;
def : DivFmasPat<f64, V_DIV_FMAS_F64_e64, VCC_LO>;
}Does anyone know where the issue lies? I'm a bit stuck there at the moment.
Is it the pattern that shouldn't be using VCC_LO like that? Or a missing case in SIInstrInfo.cpp ?
Maybe the register allocator should be more careful and transform the COPY into an EXTRACT_SUBREG when allocating %24 to vcc?
I believe so too. You are defining the vcc, you should not need it live to define it. Much less live at a function's entry.
| llvm/test/tools/llvm-mca/AMDGPU/gfx11-double.s | ||
|---|---|---|
| 136 | I don't think it changed, just content was added to it. I think the previous person that updated it didn't add the --timeline-max-cycles=0 and instead removed the check line with the cycles limit. I think I can revert all these and the test should still pass, it'll just cover less instructions? | |
So I think I found a solution that may work. The idea is just to:
- Remove the SDag changes- After digging deep, I found that the InstrEmitter is smart enough to infer that the returned i1 is the implicit def of the instruction. No changes needed so.
- However, it incorrectly assigns VReg_1 for the register class of the VCC copy it emits in InstrEmitter. This can be fixed in a pass.
 
- Add a small pass that runs before FixSGPRCopies that fixes the COPY of the implicit def for V_DIV_SCALE. Something like "FixVCCImpDefCopy". The pass itself is minimal and the logic boils down to:
// Iterate over all instructions, check opcode case AMDGPU::V_DIV_SCALE_F32_e64: case AMDGPU::V_DIV_SCALE_F64_e64: { TII->fixImplicitOperands(MI); // Check for COPY of VCC right after it and fix it too. auto NextI = std::next(I); if(NextI == MBB->end() || NextI->getOpcode() != AMDGPU::COPY || NextI->getOperand(1).getReg() != AMDGPU::VCC) { break; } if(TII->isWave32()) { NextI->getOperand(1).setReg(AMDGPU::VCC_LO); MRI->setRegClass(NextI->getOperand(0).getReg(), &AMDGPU::SReg_32_XM0_XEXECRegClass); } else { MRI->setRegClass(NextI->getOperand(0).getReg(), &AMDGPU::SReg_64_XEXECRegClass); } }
@foad, @rampitec do you think it's an acceptable solution? I can get all tests to pass with a seemingly normal codegen with that.
However, it incorrectly assigns VReg_1 for the register class of the VCC copy it emits in InstrEmitter.
Why is VReg_1 wrong? In some places that is the class we use for an SGPR (or SGPR pair) that represents an independent 1-bit value per lane. Why doesn't that work here?
It indeed seems to be used to represent a i1 when the value is divergent. Though I'm not sure why, but if I don't correct the regclass to an SReg, the LowerI1Copies pass cannot handle it.
Perhaps the issue lies in LowerI1Copies? I don't understand that pass well enough to know if the assertion is legitimate, or a sign of an unhandled case.
It seems like, to me, it shouldn't be lowering the VCC copy to a specific instruction but rather only lower it when it's actually needed (when the value needs to be preserved).
[build] llc: llvm/lib/Target/AMDGPU/SILowerI1Copies.cpp:713: bool (anonymous namespace)::SILowerI1Copies::lowerCopiesToI1(): Assertion `TII->getRegisterInfo().getRegSizeInBits(SrcReg, *MRI) == 32' failed. [build] LLVM :: CodeGen/AMDGPU/dpp64_combine.ll [build] LLVM :: CodeGen/AMDGPU/fcanonicalize-elimination.ll [build] LLVM :: CodeGen/AMDGPU/fdiv-nofpexcept.ll [build] LLVM :: CodeGen/AMDGPU/fdiv.f16.ll [build] LLVM :: CodeGen/AMDGPU/llvm.powi.ll [build] LLVM :: CodeGen/AMDGPU/rsq.ll [build] LLVM :: CodeGen/AMDGPU/sgpr-copy.ll [build] LLVM :: CodeGen/AMDGPU/si-sgpr-spill.ll [build] LLVM :: CodeGen/AMDGPU/wave32.ll `
I think if you fix the COPY to refer to $vcc_lo instead of $vcc (for wave 32) then SILowerI1Copies will be happy, because $vcc_lo will satisfy the isLaneMaskReg test. But there should be no need to change the regclass of the COPY.
It's what I was initially doing, but then it crashes on Wave64:
Process 397709 stopped
* thread #1, name = 'llc', stop reason = hit program assert
    frame #4: 0x0000555556f066c5 llc`(anonymous namespace)::SILowerI1Copies::lowerCopiesToI1(this=0x000055555f3a0ae0) at SILowerI1Copies.cpp:713:9
   710        assert(!MI.getOperand(1).getSubReg());
   711 
   712        if (!SrcReg.isVirtual() || (!isLaneMaskReg(SrcReg) && !isVreg1(SrcReg))) {
-> 713          assert(TII->getRegisterInfo().getRegSizeInBits(SrcReg, *MRI) == 32);
   714          unsigned TmpReg = createLaneMaskReg(*MF);
   715          BuildMI(MBB, MI, DL, TII->get(AMDGPU::V_CMP_NE_U32_e64), TmpReg)
   716              .addReg(SrcReg)
(lldb) expr MI->dump()
  %7:sreg_64 = COPY $vcc
  Fix-it applied, fixed expression was: 
    MI.dump()
(lldb) expr TII->isWave32()
(bool) $0 = falseI checked with:
- Not forcing the regbank to SReg
- Adding v_cmp_ne_u64_e64` & supporting 64 bits sources in LowerI1Copies
It doesn't break codegen, and it no longer crashes, though we get weirdness like v_cmp_ne_u64_e64 vcc, vcc, 0 in the output because:
%14:sreg_32 = V_CMP_NE_U32_e64 0, $vcc_lo, implicit $exec $vcc_lo = COPY killed %14:sreg_32 > $vcc_lo = COPY killed renamable $vcc_lo Identity copy: $vcc_lo = COPY killed renamable $vcc_lo deleted. renamable $vcc_lo = V_CMP_NE_U32_e64 0, $vcc_lo, implicit $exec
Well, I don't fully understand the condition on line 712:
if (!SrcReg.isVirtual() || (!isLaneMaskReg(SrcReg) && !isVreg1(SrcReg)))
I'm not sure what kind of copies from physical SGPRs this pass is (was) expecting to see here. I suppose an instruction like this could have two possible interpretations:
%0:vreg_1 = COPY $sgpr0
- Each bit of sgpr0 supplies an i1 value for the corresponding lane of the result.
- The whole of sgpr0 is either 0 or 1, and that value should be broadcast to every lane of the result.
Stepping back a bit, it feels like you are running into problems because you have a def and a use of vcc, where one of them is implicit and the other is explicit. If def and use were both implicit or both explicit then I think things might work better. Is there a way you can change v_div_scale so that the vcc operand is explicit?
I initially wanted to do just that, but IIRC @arsenm told me that it's not a good idea to force a output operand to be a specific register by constraining it with a RegClass, instead, implicit definitions should be used. I also remember starting out this patch just like that but I ran into a lot more issues, especially with the register allocator.
I quickly checked and check-llvm-codegen-amdgpu passes if I skip lowering Copies to i1 for VCC/VCC_LO. Should we just do that? Not sure if it makes much sense to lower such copies.
Then the pass would be very minimal, just fixing up the copy, but I think it might even be possible to remove it entirely in that case.
I quickly checked and check-llvm-codegen-amdgpu passes if I skip lowering Copies to i1 for VCC/VCC_LO. Should we just do that?
My gut feeling is that that's not safe because VCC/VCC_LO should be treated the same as any other SGPRs. But really I'm out of my depth here and I don't know what the correct solution is.
I guess it depends on what the pass is trying to achieve. With my current understanding of how this all works, VCC is already a lane mask + it's uniform across the wave since it's a SGPR under the hood, so it shouldn't need special treatment to be passed around, no?
The pass uses V_CMP_NE_U32 to do the lowering/copy, which does:
D.u64[threadId] = (S0 <> S1).
So for v_cmp_ne_u64_e64 vcc, vcc, 0, this would just set the mask to all ones or zeroes depending on whether VCC is all zeroes, no? Then it doesn't make sense because it doesn't COPY VCC but instead changes it.
I'm also out of my depth here, I'm trying to piece this together but my current understanding would be that it doesn't make sense to do this lowering for copies of physical SGPRs to i1
This is really an overkill to have a separate pass to fix vcc copy of just 2 instructions. What if you change the lowering instead? I believe it shall be possible to w/a it by not reading the second result of the lowered DIV_SCALE and glue a CopyFromReg to it with a correct register, i.e. VCC or VCC_LO.
Stepping back a bit, it feels like you are running into problems because you have a def and a use of vcc, where one of them is implicit and the other is explicit. If def and use were both implicit or both explicit then I think things might work better. Is there a way you can change v_div_scale so that the vcc operand is explicit?
+ @jrbyrnes who did some work on implicit vs explicit uses of $scc in D128681.
I got rid of the pass. I tried using a glued node
CurDAG->SelectNodeTo(N, Opc, CurDAG->getVTList(VT, MVT::i1, MVT::Glue), Ops); SDValue CopyFromReg = CurDAG->getCopyFromReg(CurDAG->getEntryNode(), SL, TRI->getVCC(), MVT::i1, SDValue(N, 2)); CurDAG->ReplaceAllUsesOfValueWith(SDValue(N, 1), CopyFromReg);
and that doesn't work, it doesn't crash but it messes up scheduling tool much and we end up with things like this:
v_div_scale_f32 v2, vcc, v0, v1, v0 s_mov_b64 s[0:1], vcc v_div_scale_f32 v3, vcc, v1, v1, v0 s_mov_b64 vcc, s[0:1]
Instead of just swapping the div_scales.
If I fix it later in finalizeLowering when we change the implicit VCC to a VCC_LO, then it works fine.
Did you change AMDGPUdiv_scale to produce glue? So that you have something to glue this copy to?
+ @jrbyrnes who did some work on implicit vs explicit uses of $scc in D128681.
Haven't looked fully into this review, but the work in D128681 seems orthogonal to the issue here -- the issue in that ticket was involving not honoring live ranges. Glued copy nodes that wrote to a PhysReg didn't check if the copy would clobber a live range.
Anyway, are the vcc COPYs needed here? Maybe I don't understand something, but it would be nice to see isel not emit the COPY and just rely on the implicit def/use of $vcc.
What is the status on this diff then? Do I just need to rebase on top of D133593, extending it with VCC support, and it'll fix the issues?
Isn't CurDAG->SelectNodeTo(N, Opc, CurDAG->getVTList(VT, MVT::i1, MVT::Glue), Ops); enough for that? Do I need to change the DAG Op definition itself?
I think so, but I believe we have reached the conclusion we do not need this w/a. We are pretty confident it works as is and the w/a tends to be invasive. So I'd suggest to shelve this just in case, abandon the patch and reject the ticket.
TRI->getVCC().