This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable carry out ADD/SUB operations divergence driven instruction selection.
ClosedPublic

Authored by alex-t on Apr 14 2020, 2:29 AM.

Details

Summary

This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.

Diff Detail

Event Timeline

alex-t created this revision.Apr 14 2020, 2:29 AM
arsenm added inline comments.Apr 14 2020, 6:55 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1064

Shouldn't need to scan all uses?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3619–3622

This ignores other things it could be. You can just use .add()

3624–3626

You can just do COPY from SCC

You need to add tests for selection and moveToVALU, including immediates and wave32.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1065

Should it look thru PHI and COPY? It may need a helper function though.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3806

It can me immediate, maybe even a FI.

3871

Same here, it can be an immediate, right?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
5198

Formatting is off.

5232

Formatting is off.

5971

VCC_LO for wave32?

alex-t updated this revision to Diff 257667.Apr 15 2020, 4:11 AM
alex-t marked 3 inline comments as done.

Src1 immediate case handled. Formatting improved.

alex-t marked 5 inline comments as done.Apr 15 2020, 4:13 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1064

I assume that the one user selected to VALU is enough to make selecting carryout to SALU impractical. Also, how would you suggest to decide about VALU/SALU basing on the quantity of VALU users? Ratio VALU/SALU? Other heuristic?

1065

It works on the selection DAG on per block basis so no PHI nodes ever exist here. There are CopyToReg nodes representing cross block values. Those in order only have MVTs like i32, f32 etc...
Although, I could try to address their divergence using reg2value FunctionLoweringInfo map, similar to cross block input regs in isSDNodeSourceOfDivergence function. I'd consider this as a kind of further improvement.
Also, nobody guaranty the BBs selection order so the cross block scan may appear too complicated.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3624–3626

I did it this way because I noticed that the code in SIInstrInfo::CopyPhysReg restricted to 32bit destination.

if (RC == &AMDGPU::SReg_32_XM0RegClass ||
      RC == &AMDGPU::SReg_32RegClass) {
    if (SrcReg == AMDGPU::SCC) {
      BuildMI(MBB, MI, DL, get(AMDGPU::S_CSELECT_B32), DestReg)
          .addImm(1)
          .addImm(0);
      return;
    }

So, I expected this remark and postpone the decision for review :)
Would you mind me extending this to 64bit?

rampitec added inline comments.Apr 15 2020, 10:52 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3805

Either operand can be immediate.

3808

VReg_64? Since it did not fail anywhere this case must be not covered by any tests.

3867

Again it can be an immediate.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
5189

These are not necessarily registers too.

5216

Same here.

5963

use_nodbg_instructions()

arsenm added inline comments.Apr 15 2020, 11:12 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1064

There's already a isVGPRImm function, which should be functionally the same

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
919–924

This is just broken. We already run it, and there shouldn't' be a reason to involve SIFixupVectorISel

alex-t marked 2 inline comments as done.Apr 16 2020, 9:53 AM

You need to add tests for selection and moveToVALU, including immediates and wave32.

Carry outs - UADDO/USUBO are already covered by the existing uaddo.ll and usubo.ll. There exist examples with both divergent and uniform ISD::UADDO/ISD::USUBO nodes to select.
That test are already updated.
SIFixSGPRCopies::moveToVALU part is covered as well by the udiv64.ll, urem64.ll, sdiv64.ll, srem64.ll. All that tests contains identical tests one of which declared as kernel and therefore has uniform arguments and another one as function and has divergent arguments.
The former one contains uniform UADDO and ADDCARRY that are initially selected to S_ADD_CO_PSEUDO but later on converted in moveToVALU to V_ADD/V_ADDC.

So, I am only planning to add the pure selection MIR tests for ISD opcodes.
Wave64/32 tests are probably needed as wellas the immediate operands tests.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
919–924

As I have seen debugging, FinalizaISel gets invoked from the TargetPassConfig base class just after a bundle of passes defined as InstrSelector by the Target. So in our case - after createSIAddIMGInitPass. That's why I had to add this.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3808

I maybe misunderstand the documentation, but it says that the size we only can have 32bit immediate aa operand.
I also did some experiments with different targets (gfx600,900,1010) and always have seen that 64bit size constant was split into 2 32bit parts for addition.
Please correct me if I understand it in a wrong way.

rampitec added inline comments.Apr 16 2020, 10:40 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3808

Two lines below you are asking for sub0 of that RC. VGPR_32 does not have sub0.

alex-t updated this revision to Diff 259633.Apr 23 2020, 11:02 AM
alex-t marked 2 inline comments as done.

Both operands can be immediate - handled. Some other changes to follow up the discussion.

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1064

Not really. isVGPRImm walks along the use list and check if each use can accept SGPR or not. It also tries to commute operands if possible to make SGPR acceptable. The main goal is to select SGPR instead of VGPR to keep immediate whenever possible. This works in pair with the SIFoldOperands.

Here we try to workaround the following sub-optimal selection pattern:

	s_sub_i32 s0, s0, s1
	s_cselect_b64 s[0:1], 1, 0
	v_cndmask_b32_e64 v0, 0, 1, s[0:1]

that appears instead of the:

	v_sub_co_u32_e32 v4, vcc, s0, v4
	v_cndmask_b32_e64 v5, 0, 1, vcc

We need s_cselect to copy SCC to SReg_64.
In this sense, any explicit use of the conditional register pair leads to one extra instruction in case of uniform UADDO/USUBO.
Same time it saves one VGPR.
The only case that does not produce extra instruction is true carry out operations where the only user of the UADDO/USUBO value 2 is the ADD/SUBCARRY node.
So, I decide to change this for explicit check. This is over-conservative but it is okay for now.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3808

Yes. And it is exactly what is expected :)
getSubRegClass returns VGPR_32RegClass itself in this case. In fact id does not matter what it is.
buildExtractSubRegOrImm does not use SubRC argument if operand is immediate.

if (Op.isImm()) {
   if (SubIdx == AMDGPU::sub0)
     return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm()));
   if (SubIdx == AMDGPU::sub1)
     return MachineOperand::CreateImm(static_cast<int32_t>(Op.getImm() >> 32));

   llvm_unreachable("Unhandled register index for immediate");
 }

Once again, I maybe don't understand what your objection is about.

For the simple i64 immediate addition like this:

%add = add i64 20015998343286, %a

we generate carry out:

	s_add_u32 s0, s2, 0x56789876
	s_addc_u32 s1, s3, 0x1234

for uniform

or

	v_add_co_u32_e32 v0, vcc, 0x56789876, v0
	v_mov_b32_e32 v1, 0x1234
	v_addc_co_u32_e32 v1, vcc, 0, v1, vcc

for divergent.

So, why do we need VReg_64?

rampitec added inline comments.Apr 23 2020, 11:45 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3617

It can be immediate.

3798

You can probably use SIInstrInfo::getAddNoCarry() and extend it to produce sub as well or create e new helper. You are always using I32 version even if a no-carry U32 version is available.

3808

You are calling TRI->getSubRegClass(Src1RC, AMDGPU::sub1); on this RC. You want to have VGPR_32 as an answer. Even though getSubRegClass() may return RC itself if it does not have a requested subreg this sounds like a bug. It would be more natural for it to assert. To be on a safe side pass there VReg_64 to get the same VGPR_32 or just do not call it if it is an immediate.

alex-t updated this revision to Diff 259867.Apr 24 2020, 5:22 AM
alex-t marked 2 inline comments as done.

a few more corrections

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3798

Not sure which line this comment belong...
For LoOpc I indeed need carry out opcode.
getAddNoCarry returnc the addition that does not write the carry flag.

Anyway, the exact opcode is selected later on by the SIInstrInfo::pseudoToMCOpcode

renamable $vgpr0 = V_ADD_I32_e32 1450743926, killed $vgpr0, implicit-def $vcc, implicit $exec
renamable $vgpr1 = V_MOV_B32_e32 4660, implicit $exec
renamable $vgpr1 = V_ADDC_U32_e32 0, killed $vgpr1, implicit-def $vcc, implicit killed $vcc, implicit $exec

turns to the

	v_add_co_u32_e32 v0, vcc, 0x56789876, v0
	v_mov_b32_e32 v1, 0x1234
	v_addc_co_u32_e32 v1, vcc, 0, v1, vcc

for gfx9

but to the

	v_add_i32_e32 v0, vcc, 0x56789876, v0
	v_mov_b32_e32 v1, 0x1234
	v_addc_u32_e32 v1, vcc, 0, v1, vcc

for gfx6

3808

Fine, we have one weird piece - SIRegisterInfo::getSubRegClass that returns input argument back if does not succeed.
Then we have one more weird piece - SIInstrInfo::buildExtractSubRegOrImm that is in fact 2 separate functions. It does completely different things for register and immediate, and just ignores register class arguments for immediate. Unfortunately all arguments required. So, I have to pass to it register class and sub-register class despite of the fact they're not used.
Refactoring these 2 pieces should be separate change.
If you insist that using VGPR_32RegClass is misleading I have no choice but VReg_64RegClass, that is misleading either IMHO.
I'll have to add a FIXME comment to explain why we use VGPR_32RegClass for immediate that can be 32bit only.

I think the code is fine now except couple formatting comments.
What is missing specific selection tests.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
3798

Right, that is add which has these forms. If you need addc you have a dead carry out. So this is OK.

alex-t updated this revision to Diff 260996.Apr 29 2020, 12:53 PM

Selection specific test (carryout-selection.ll) added

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2020, 12:53 PM

Can you add tests with both immediates? You will probably need to -start-before=amdgpu-isel or -O0 if it will be constant folded.
Also fix formatting issues.

Can you add tests with both immediates? You will probably need to -start-before=amdgpu-isel or -O0 if it will be constant folded.
Also fix formatting issues.

There is a reason for not adding such a test.
SelectionDAG::getNode() performs trivial constant folding irrelative of the optLevel.

See SelectionDAG.cpp::5469 for details

// Perform trivial constant folding.
if (SDValue SV = FoldConstantArithmetic(Opcode, DL, VT, {N1, N2}))
  return SV;

The result of the visiting ISD::ADD with 2 immediates looks like:

So the add is selected to 2 moves.

%add = add i64 20015998343286, 46117495621    to the 

%5:sreg_32 = S_MOV_B32 4671
%6:sreg_32 = S_MOV_B32 323599291
alex-t updated this revision to Diff 261189.Apr 30 2020, 5:18 AM

changed code passed through the clang-format

This revision is now accepted and ready to land.Apr 30 2020, 11:07 AM
This revision was automatically updated to reflect the committed changes.