AMDGPU: Combine DPP mov with use instuctions (VOP1/2/3)
AcceptedPublic

Authored by vpykhtin on Fri, Oct 26, 7:44 AM.

Details

Reviewers
rampitec
arsenm
tpr
b-sumner
kzhuravl
Group Reviewers
Restricted Project
Summary

The change adds DPP instruction pseudos and the pass combining V_MOV_B32_dpp instruction with its VALU uses as a DPP src0 operand. If any of the use instruction cannot be combined with the mov the whole sequence is reverted.

$old = ...
$dpp_value = V_MOV_B32_dpp $old, $vgpr_to_be_read_from_other_lane, dpp_controls..., $bound_ctrl
$res = VALU $dpp_value, ...

to

$res = VALU_DPP $folded_old, $dpp_value, ..., dpp_controls..., $folded_bound_ctrl

Combining rules:

$bound_ctrl is DPP_BOUND_ZERO, $old is any
$bound_ctrl is DPP_BOUND_OFF, $old is 0

-> $folded_old = undef, $folded_bound_ctrl = DPP_BOUND_ZERO

$bound_ctrl is DPP_BOUND_OFF, $old is undef

-> $folded_old = undef, $folded_bound_ctrl = DPP_BOUND_OFF

$bound_ctrl is DPP_BOUND_OFF, $old is foldable

-> $folded_old = folded value, $folded_bound_ctrl = DPP_BOUND_OFF

FMAC and MAC instructions doesn't have $old operand, as it already have tied $src2.

Combining into VOPC instructions ins't supported with this patch (to be added later)

Diff Detail

vpykhtin created this revision.Fri, Oct 26, 7:44 AM
rampitec added inline comments.Fri, Oct 26, 10:06 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
287

AMDGPU::NoRegister

295

AMDGPU::NoRegister

317

use_nodbg_operands()

322

There must be a check somewhere that use is not a SDWA or non-trivial OpSel, as these are incompatible with DPP. Also there must be a negative test for these.

lib/Target/AMDGPU/SIInstrInfo.cpp
5459

static?

5466

I think it should be a part of SIInstrInfo, not llvm namespace.

5503

Also a part of SIInstrInfo.

lib/Target/AMDGPU/SIInstrInfo.h
916

Doesn't it and getRegSubRegPair belong to SIRegisterInfo?

vpykhtin added inline comments.Fri, Oct 26, 10:13 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
287

I wish I knew that before :-)

322

this is handled by by non-dpp -> dpp instuction map, see getDPPOp at line 131.

lib/Target/AMDGPU/SIInstrInfo.h
916

You're right but it rather belongs to MachineRegisterInfo. I din't put it into SIRegisterInfo so that there is no need to cast to it every time.

rampitec added inline comments.Fri, Oct 26, 10:23 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
322

SDWA maybe, but you can still use it with OpSel when it is trivial (selects [lo, hi] for all operands).

In anyway negative tests are needed for that.

lib/Target/AMDGPU/SIInstrInfo.h
916

They are global helpers, I do not see what to cast ;)
I also agree they better fit common MRI.

vpykhtin added inline comments.Fri, Oct 26, 10:45 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
322

I agree with tests, but I'm not sure I understood. There is no way to convert SDWA to DPP, right?

lib/Target/AMDGPU/SIInstrInfo.h
916

I mean MRI should be casted to SIRegiterInfo before use

rampitec added inline comments.Fri, Oct 26, 11:53 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
322

SDWA not, but can be possible with opsel if it does not select a partial operand or shuffle, i.e. if it is only used as opsel because encoding requires it.

lib/Target/AMDGPU/SIInstrInfo.h
916

Ok.

vpykhtin updated this revision to Diff 172353.Fri, Nov 2, 7:31 AM
vpykhtin retitled this revision from AMDGPU: Combine DPP mov with use instuctions to AMDGPU: Combine DPP mov with use instuctions (VOP1/2/3).
vpykhtin edited the summary of this revision. (Show Details)

Fixed per review issues.

Added checks on non-default VOP3 modifiers with negative mir test on these. Src operand modifiers are checked to have only ABS or NEG (allowed by DPP) which effectively prevents other modifiers such non-default OpSel modifiers from being combined. OpSel tests aren't included as there're no opsel pseudos yet.

vpykhtin marked 17 inline comments as done.Fri, Nov 2, 7:34 AM

You do not need a special pseudo for opsel, just use VOP3 and immediate value for modifier. This is much like you did above for omod.

lib/Target/AMDGPU/GCNDPPCombine.cpp
351

AFAIR proper default value for opsel has 1 bit set (which means use high bits for high part).

arsenm added inline comments.Fri, Nov 2, 10:09 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
10–11

The comments in your commit message would be more useful here

229–230

Why do you need to handle these cases? I would also use uint32_t instead of unsigned

324–325

New lines

409

Why do you need to collect a separate list of the moves in the whole function? Can you just use the dfs iterator to avoid this?

rampitec added inline comments.Fri, Nov 2, 10:26 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
351

After discussion that is not the case here. You cannot process VOP3P, so you do not have op_sel_hi for which default is 1. I.e. this mask is ok, just add a test with non-zero opsel.

vpykhtin added inline comments.Fri, Nov 2, 10:40 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
229–230

These cases are for the situation when bound_ctrl = 0 (result write disable), meaning that mov result would be the value of 'old' for inactive lanes. If we know the immediate for old we can calculate (is some cases) old value for the VALU operation which isn't the same as for the mov. Otherwise the combining would fail.

For example:

v1 = ...
v10 = ... // other lane reg

v0 = v_mov_b32 1
v0 = v_mov_b32_dpp v10, ..., 0 // bound_ctrl == write disable
v2 = v_mul_u32_u24_e32 v0, v1

in this case we know v0 for inactive lanes would be 1 (identity for mul). This makes possible to use v1 value as the result of the mul for inactive lanes:

v1 = v_mul_u32_u24_dpp v10, v1, ..., 0 // bound_ctrl == write disable
v2 = v_mov_b32 v1

Othervise the combining isn't possible for bound_ctrl == write disable.

409

ok

vpykhtin updated this revision to Diff 173169.Thu, Nov 8, 8:09 AM

fixed per review issues:

  • Added test on non-abs|neg modifiers,
  • replaced accumulation vector for DPP moves with reverse iteration on BB.
vpykhtin marked 7 inline comments as done.Thu, Nov 8, 8:11 AM
This revision is now accepted and ready to land.Thu, Nov 8, 11:49 AM