This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] SDWA: add support for PRESERVE into SDWA peephole. Add new merge SDWA preserve pass
ClosedPublic

Authored by SamWot on Sep 13 2017, 10:42 AM.

Details

Summary

SDWA instructions support several values of dst_unused operand. One of this is UNUSED_PRESERVE. This value means that parts of destination register that are not wrote by SDWA instruction would not be changed. Currently SDWA peephole pass doesn't generate UNUSED_PRESERVE. It only generates UNUSED_PAD value that means that unused parts of dst register would be set to 0.
Big problem with UNUSED_PRESERVE is that by its nature it can't be represented in SSA form. PRESERVE assumes that register that it writes into was already wrote by some other instruction and our SDWA instruction keeps this value intact.
Another problem is that in AMDGPU backend smallest sub-reg is 32-bit wide and SDWA needs smaller so support for PRESERVE can't be done with subregs.
For those reasons support for PRESERVE for split into 2 major parts. First - changes in SDWA peephole pass that allows it to recognize patterns for PRESERVE and generate according instruction. This pass works on SSA machine code and generates SSA compatible code. Second part - new pass that works on non-SSA code and converts code generated by SDWA peephole into correct code.

  1. Changes in SDWA peephole:

There were several changes in SDWA peephole.
a. First of all there was added new pattern to match for PRESERVE operand. This patterns looks for V_OR_B32 instruction with one of operands that is result of SDWA instruction. Second operand of V_OR_B32 should be instruction that is compatible bit-wise with SDWA instruction (there destination don't overlap). E.g. match:

v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1
v_add_f16_e32 v3, v1, v2
v_or_b32_e32 v4, v0, v3

Into: SDWA preserve dst:v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE preserve:v3

Then this mathced SDWA preserve pattern is converted into SDWA with preserve. During conversion V_OR_B#@ instruction is replaced by SDWA instruction with dst_unused set to UNUSED_PRESERVE. Original instruction is removed. And new instruction gets additional implicit use-operand which is destination of second operand of V_OR_B32 (register that should be preserved):

v_add_f16_e32 v3, v1, v2
v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3

Problem with this match process is that currently it only works if both instructions were SDWA instructions. Reason is that to be able to match to instructions we should check that those two instructions are compatible to match - meaning that they write different parts of destination register. But currently there is no way to determine if regular instruction writes not whole destination register. E.g. we can't understand that V_ADD_F16 only writes low 16-bit of destination and high 16-bit are irrelevant. This can be determined only for SDWA instruction by looking into dst_sel operand. So for now this pattern only metch 2 SDWA instructions. Ability to match regular instructions would be added later.

b. Second big change in SDWA peephole pass is that now it tries to apply match patterns several times until it can't convert any new instruction. This is needed because (as said earlier) PRESERVE pattern need to match SDWA instructions but SDWA instruction apear (in most cases) only after SDWA peephole. So to be able to match PRESERVE pattern we first apply all other patterns that generate regular SDWA instructions and then on second try we apply PRESERVE pattern to SDWA instructions generated on first try.

  1. New pass - merge SDWA preserve pass:

This pass is needed to convert SSA code generated by SDWA peephole pass into non-SSA correct code. It works after PHI-elimination pass where it is possible to generate non-SSA code.
This pass looks for SDWA instructions with dst_unused set to UNUSED_PRESERVE. In those instructions it looks for implicit register operand (which is added by SDWA peephole pass). This register is the one that should be reserved. Id such instruction is found then this pass changes destination register of this SDWA instruction to implicit register and creates copy from implicit register to original destination of SDWA instruction. E.g. instruction generated by SDWA peephole:

v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3

Would be converted into:

v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
v_mov_b32 v4, v3

Putting it all together original sequence of instructions:

v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1
v_add_f16_e32 v3, v1, v2
v_or_b32_e32 v4, v0, v3

Would be converted into:

v_add_f16_e32 v3, v1, v2
v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
v_mov_b32 v4, v3

Diff Detail

Repository
rL LLVM

Event Timeline

SamWot created this revision.Sep 13 2017, 10:42 AM
arsenm edited edge metadata.Sep 13 2017, 11:25 AM

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

rampitec edited edge metadata.Sep 13 2017, 3:25 PM

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

Do not we want to add bits to the instruction describing it preserves low or high half?
Adding new subregs would be quite painful, as we already have too much registers for RA and LIS to work fast and optimal.

rampitec added inline comments.Sep 13 2017, 3:40 PM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
821 ↗(On Diff #115070)

I do not think we need it under fast RA.

lib/Target/AMDGPU/SIMergeSDWAPreserve.cpp
115 ↗(On Diff #115070)

We can have more than EXEC here. Check it is a virtual register instead, or even a VGPR?

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
283 ↗(On Diff #115070)

Such things shall be functions, function templates, anything but defines. In particular that is very hard to debug.

853 ↗(On Diff #115070)

Needs cast from unsigned or use unsigned for SDWAOpcode/Opcode.

1052 ↗(On Diff #115070)

This loop itself probably deserves a separate change.

test/CodeGen/AMDGPU/sdwa-merge-preserve.mir
1 ↗(On Diff #115070)

Add -verify-machineinstrs to run lines.

test/CodeGen/AMDGPU/sdwa-preserve.mir
1 ↗(On Diff #115070)

Add -verify-machineinstrs

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

Do not we want to add bits to the instruction describing it preserves low or high half?
Adding new subregs would be quite painful, as we already have too much registers for RA and LIS to work fast and optimal.

That's another partial option, but won't solve the suboptimal RA. I think we need to try and see what the impact actually ends up being. These are more constrained since you sort of can't directly address the high component usually (i.e. the high component isn't actually separately allocatable).

SamWot updated this revision to Diff 115219.Sep 14 2017, 7:35 AM
SamWot marked 3 inline comments as done.

Resolved some issues

In any case independent of sub register questions, I think this would be better off done in the existing pass by adding variants with tied operands. This is how I am handling this problem currently in D38070/D38071 for mad_mix

In any case independent of sub register questions, I think this would be better off done in the existing pass by adding variants with tied operands. This is how I am handling this problem currently in D38070/D38071 for mad_mix

This is actually a better idea than using an implicit operand and having IR potentially broken in between of two passes.

SamWot updated this revision to Diff 117831.Oct 5 2017, 8:53 AM

Removed SIMergeSDWAPreserve pass.
Use tied registers to achieve same results

arsenm added inline comments.Oct 27 2017, 2:30 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
132 ↗(On Diff #117831)

Probably should use LLVM_ENABLE_DUMP instead of NDEBUG. Also should add the matching dump() with LLVM_DUMP_METHOD.

292 ↗(On Diff #117831)

use_nodbg_operands? Also space before :

293 ↗(On Diff #117831)

C++ style comments

294–295 ↗(On Diff #117831)

These various checks shouldn't be necessary in SSA. You can't have a def of a specific subregister (unless maybe there is a physical register which should probably be skipped anyway).

313 ↗(On Diff #117831)

This seems to b be re-inventing MRI.getVRegDef/MRI.getUniqueVRegDef?

467 ↗(On Diff #117831)

This whole loop is just MRI.clearKillFlags()

487–488 ↗(On Diff #117831)

This is still manually tying the result operand. I was expecting another set of _sdwa opcodes with the preserve behavior with the tied operand statically known in the instruction definition. Manually tying this way is potentially hazardous because the verifier won't check it, and it makes it easier for another pass to accidentally drop the tied operand. I would expect there to be an InstrMapping table between the SDWA opcode and the SWDA with preserve set versions.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1317–1318 ↗(On Diff #117831)

How are these different from the various MCRegisterInfo functions for checking if registers alias or have subreg relationships?

SamWot updated this revision to Diff 121268.Nov 2 2017, 3:32 AM
SamWot marked 7 inline comments as done.

Fixed latests issues from arsenm

SamWot added inline comments.Nov 2 2017, 3:40 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
487–488 ↗(On Diff #117831)

I thought that having separate instruction definition for just preserve case would be overkill. I didn't want to introduce another new kind of SDWA instruction that would only bloat already huge set of instruction definitions.
But if you think this would be better I will introduce new instructions definition.

Ping.
Matt, what do you think about latest changes in reivew?

At the very least there needs to be a verifier check for the tied operands if this is set. With the separate tied opcodes you get that for free

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
999 ↗(On Diff #121268)

Extra newline

SamWot updated this revision to Diff 124747.Nov 29 2017, 7:12 AM

Added verification for tied register for UNUSED_PRESERVE

arsenm added inline comments.Nov 29 2017, 3:05 PM
lib/Target/AMDGPU/SIInstrInfo.cpp
2702–2705 ↗(On Diff #124747)

Could use some of the stricter checks the normal verifier has, i.e.

else if (TargetRegisterInfo::isPhysicalRegister(MOTied.getReg()) &&
         MO->getReg() != MOTied.getReg())
SamWot updated this revision to Diff 124919.Nov 30 2017, 5:13 AM
Stronger verification for UNUSED_PRESERVE
arsenm accepted this revision.Dec 1 2017, 1:51 PM

LGTM

This revision is now accepted and ready to land.Dec 1 2017, 1:51 PM
This revision was automatically updated to reflect the committed changes.