This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.
Details
Diff Detail
Event Timeline
You need to add tests for selection and moveToVALU, including immediates and wave32.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
1087 | Should it look thru PHI and COPY? It may need a helper function though. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
3805 | It can me immediate, maybe even a FI. | |
3870 | Same here, it can be an immediate, right? | |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
5202 | Formatting is off. | |
5236 | Formatting is off. | |
5975 | VCC_LO for wave32? |
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
1086 | I assume that the one user selected to VALU is enough to make selecting carryout to SALU impractical. Also, how would you suggest to decide about VALU/SALU basing on the quantity of VALU users? Ratio VALU/SALU? Other heuristic? | |
1087 | It works on the selection DAG on per block basis so no PHI nodes ever exist here. There are CopyToReg nodes representing cross block values. Those in order only have MVTs like i32, f32 etc... | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
3623–3625 | I did it this way because I noticed that the code in SIInstrInfo::CopyPhysReg restricted to 32bit destination. if (RC == &AMDGPU::SReg_32_XM0RegClass || RC == &AMDGPU::SReg_32RegClass) { if (SrcReg == AMDGPU::SCC) { BuildMI(MBB, MI, DL, get(AMDGPU::S_CSELECT_B32), DestReg) .addImm(1) .addImm(0); return; } So, I expected this remark and postpone the decision for review :) |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
3804 | Either operand can be immediate. | |
3807 | VReg_64? Since it did not fail anywhere this case must be not covered by any tests. | |
3866 | Again it can be an immediate. | |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
5193 | These are not necessarily registers too. | |
5220 | Same here. | |
5967 | use_nodbg_instructions() |
Carry outs - UADDO/USUBO are already covered by the existing uaddo.ll and usubo.ll. There exist examples with both divergent and uniform ISD::UADDO/ISD::USUBO nodes to select.
That test are already updated.
SIFixSGPRCopies::moveToVALU part is covered as well by the udiv64.ll, urem64.ll, sdiv64.ll, srem64.ll. All that tests contains identical tests one of which declared as kernel and therefore has uniform arguments and another one as function and has divergent arguments.
The former one contains uniform UADDO and ADDCARRY that are initially selected to S_ADD_CO_PSEUDO but later on converted in moveToVALU to V_ADD/V_ADDC.
So, I am only planning to add the pure selection MIR tests for ISD opcodes.
Wave64/32 tests are probably needed as wellas the immediate operands tests.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
918–923 | As I have seen debugging, FinalizaISel gets invoked from the TargetPassConfig base class just after a bundle of passes defined as InstrSelector by the Target. So in our case - after createSIAddIMGInitPass. That's why I had to add this. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
3807 | I maybe misunderstand the documentation, but it says that the size we only can have 32bit immediate aa operand. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
3807 | Two lines below you are asking for sub0 of that RC. VGPR_32 does not have sub0. |
Both operands can be immediate - handled. Some other changes to follow up the discussion.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | ||
---|---|---|
1086 | Not really. isVGPRImm walks along the use list and check if each use can accept SGPR or not. It also tries to commute operands if possible to make SGPR acceptable. The main goal is to select SGPR instead of VGPR to keep immediate whenever possible. This works in pair with the SIFoldOperands. Here we try to workaround the following sub-optimal selection pattern: s_sub_i32 s0, s0, s1 s_cselect_b64 s[0:1], 1, 0 v_cndmask_b32_e64 v0, 0, 1, s[0:1] that appears instead of the: v_sub_co_u32_e32 v4, vcc, s0, v4 v_cndmask_b32_e64 v5, 0, 1, vcc We need s_cselect to copy SCC to SReg_64. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
3807 | Yes. And it is exactly what is expected :) if (Op.isImm()) { if (SubIdx == AMDGPU::sub0) return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm())); if (SubIdx == AMDGPU::sub1) return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm() >> 32)); llvm_unreachable("Unhandled register index for immediate"); } Once again, I maybe don't understand what your objection is about. For the simple i64 immediate addition like this: %add = add i64 20015998343286, %a we generate carry out: s_add_u32 s0, s2, 0x56789876 s_addc_u32 s1, s3, 0x1234 for uniform or v_add_co_u32_e32 v0, vcc, 0x56789876, v0 v_mov_b32_e32 v1, 0x1234 v_addc_co_u32_e32 v1, vcc, 0, v1, vcc for divergent. So, why do we need VReg_64? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
3616 | It can be immediate. | |
3797 | You can probably use SIInstrInfo::getAddNoCarry() and extend it to produce sub as well or create e new helper. You are always using I32 version even if a no-carry U32 version is available. | |
3807 | You are calling TRI->getSubRegClass(Src1RC, AMDGPU::sub1); on this RC. You want to have VGPR_32 as an answer. Even though getSubRegClass() may return RC itself if it does not have a requested subreg this sounds like a bug. It would be more natural for it to assert. To be on a safe side pass there VReg_64 to get the same VGPR_32 or just do not call it if it is an immediate. |
a few more corrections
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
3797 | Not sure which line this comment belong... Anyway, the exact opcode is selected later on by the SIInstrInfo::pseudoToMCOpcode renamable $vgpr0 = V_ADD_I32_e32 1450743926, killed $vgpr0, implicit-def $vcc, implicit $exec renamable $vgpr1 = V_MOV_B32_e32 4660, implicit $exec renamable $vgpr1 = V_ADDC_U32_e32 0, killed $vgpr1, implicit-def $vcc, implicit killed $vcc, implicit $exec turns to the v_add_co_u32_e32 v0, vcc, 0x56789876, v0 v_mov_b32_e32 v1, 0x1234 v_addc_co_u32_e32 v1, vcc, 0, v1, vcc for gfx9 but to the v_add_i32_e32 v0, vcc, 0x56789876, v0 v_mov_b32_e32 v1, 0x1234 v_addc_u32_e32 v1, vcc, 0, v1, vcc for gfx6 | |
3807 | Fine, we have one weird piece - SIRegisterInfo::getSubRegClass that returns input argument back if does not succeed. |
I think the code is fine now except couple formatting comments.
What is missing specific selection tests.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
3797 | Right, that is add which has these forms. If you need addc you have a dead carry out. So this is OK. |
Can you add tests with both immediates? You will probably need to -start-before=amdgpu-isel or -O0 if it will be constant folded.
Also fix formatting issues.
There is a reason for not adding such a test.
SelectionDAG::getNode() performs trivial constant folding irrelative of the optLevel.
See SelectionDAG.cpp::5469 for details
// Perform trivial constant folding. if (SDValue SV = FoldConstantArithmetic(Opcode, DL, VT, {N1, N2})) return SV;
The result of the visiting ISD::ADD with 2 immediates looks like:
So the add is selected to 2 moves.
%add = add i64 20015998343286, 46117495621 to the %5:sreg_32 = S_MOV_B32 4671 %6:sreg_32 = S_MOV_B32 323599291
Shouldn't need to scan all uses?