This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] DPP combiner: recognize identities for more opcodes
ClosedPublic

Authored by foad on Jul 4 2019, 7:03 AM.

Details

Summary

This allows the DPP combiner to kick in more often. For example the
exclusive scan generated by the atomic optimizer for a divergent atomic
add used to look like this:

v_mov_b32_e32 v3, v1
v_mov_b32_e32 v5, v1
v_mov_b32_e32 v6, v1
v_mov_b32_dpp v3, v2  wave_shr:1 row_mask:0xf bank_mask:0xf
s_nop 1
v_add_u32_dpp v4, v3, v3  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
v_mov_b32_dpp v5, v3  row_shr:2 row_mask:0xf bank_mask:0xf
v_mov_b32_dpp v6, v3  row_shr:3 row_mask:0xf bank_mask:0xf
v_add3_u32 v3, v4, v5, v6
v_mov_b32_e32 v4, v1
s_nop 1
v_mov_b32_dpp v4, v3  row_shr:4 row_mask:0xf bank_mask:0xe
v_add_u32_e32 v3, v3, v4
v_mov_b32_e32 v4, v1
s_nop 1
v_mov_b32_dpp v4, v3  row_shr:8 row_mask:0xf bank_mask:0xc
v_add_u32_e32 v3, v3, v4
v_mov_b32_e32 v4, v1
s_nop 1
v_mov_b32_dpp v4, v3  row_bcast:15 row_mask:0xa bank_mask:0xf
v_add_u32_e32 v3, v3, v4
s_nop 1
v_mov_b32_dpp v1, v3  row_bcast:31 row_mask:0xc bank_mask:0xf
v_add_u32_e32 v1, v3, v1
v_add_u32_e32 v1, v2, v1
v_readlane_b32 s0, v1, 63

But now most of the dpp movs are combined into adds:

v_mov_b32_e32 v3, v1
v_mov_b32_e32 v5, v1
s_nop 0
v_mov_b32_dpp v3, v2  wave_shr:1 row_mask:0xf bank_mask:0xf
s_nop 1
v_add_u32_dpp v4, v3, v3  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
v_mov_b32_dpp v5, v3  row_shr:2 row_mask:0xf bank_mask:0xf
v_mov_b32_dpp v1, v3  row_shr:3 row_mask:0xf bank_mask:0xf
v_add3_u32 v1, v4, v5, v1
s_nop 1
v_add_u32_dpp v1, v1, v1  row_shr:4 row_mask:0xf bank_mask:0xe
s_nop 1
v_add_u32_dpp v1, v1, v1  row_shr:8 row_mask:0xf bank_mask:0xc
s_nop 1
v_add_u32_dpp v1, v1, v1  row_bcast:15 row_mask:0xa bank_mask:0xf
s_nop 1
v_add_u32_dpp v1, v1, v1  row_bcast:31 row_mask:0xc bank_mask:0xf
v_add_u32_e32 v1, v2, v1
v_readlane_b32 s0, v1, 63

Also fix some typos in comments and debug output.

Diff Detail

Repository
rL LLVM

Event Timeline

foad created this revision.Jul 4 2019, 7:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2019, 7:03 AM

I'm not sure if e64 instructions have modifiers that cannot be encoded into DPP version, need to check. Otherwise looks good, though I would split typo corrections into separate patch and submit without review.

I think modifiers are checked correctly by the existing code, but can you add a test for e64 encodings into dpp_combine.mir similar to what is under "check for floating point modifiers" comment?

foad updated this revision to Diff 208048.Jul 4 2019, 8:09 AM

Typo fixes have been committed separately.

foad added a comment.Jul 4 2019, 8:59 AM

I think modifiers are checked correctly by the existing code, but can you add a test for e64 encodings into dpp_combine.mir similar to what is under "check for floating point modifiers" comment?

I'm trying but I don't know enough about Machine IR. A typical e64 instruction from my dumps is:

%20:vgpr_32 = V_ADD_U32_e64 %18:vgpr_32, killed %19:vgpr_32, 0, implicit $exec

but I don't know what the third operand (the 0) means or what other values it can take.

Right, this is hard to follow even for me :). 3rd operand is src1_modifiers, you can use a junk value for this to check whether the DPP combiner don't crash and don't combine it.

foad added a comment.Jul 5 2019, 1:21 AM

Right, this is hard to follow even for me :). 3rd operand is src1_modifiers, you can use a junk value for this to check whether the DPP combiner don't crash and don't combine it.

I still can't understand the instruction tables but according to my debugger, the third operand is clamp:

(gdb) call OrigMI.dump()
  %10:vgpr_32 = V_ADD_U32_e64 %9:vgpr_32, %1:vgpr_32, 99, implicit $exec
(gdb) p TII->getNamedOperand(OrigMI, AMDGPU::OpName::src0_modifiers)
$5 = (llvm::MachineOperand *) 0x0
(gdb) p TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1_modifiers)
$6 = (llvm::MachineOperand *) 0x0
(gdb) p TII->getNamedOperand(OrigMI, AMDGPU::OpName::clamp)
$7 = (llvm::MachineOperand *) 0x6d1eef0
(gdb) p TII->getNamedOperand(OrigMI, AMDGPU::OpName::omod)
$8 = (llvm::MachineOperand *) 0x0
(gdb) call TII->getNamedOperand(OrigMI, AMDGPU::OpName::clamp)->dump()
99
foad updated this revision to Diff 208125.Jul 5 2019, 1:37 AM

Add a test case, and change the tests to run on gfx9 because it needs
add-no-carry instructions.

vpykhtin accepted this revision.Jul 5 2019, 4:13 AM

LGTM. Thank you!

This revision is now accepted and ready to land.Jul 5 2019, 4:13 AM
This revision was automatically updated to reflect the committed changes.