SDWA instructions support several values of dst_unused operand. One of this is UNUSED_PRESERVE. This value means that parts of destination register that are not wrote by SDWA instruction would not be changed. Currently SDWA peephole pass doesn't generate UNUSED_PRESERVE. It only generates UNUSED_PAD value that means that unused parts of dst register would be set to 0.
Big problem with UNUSED_PRESERVE is that by its nature it can't be represented in SSA form. PRESERVE assumes that register that it writes into was already wrote by some other instruction and our SDWA instruction keeps this value intact.
Another problem is that in AMDGPU backend smallest sub-reg is 32-bit wide and SDWA needs smaller so support for PRESERVE can't be done with subregs.
For those reasons support for PRESERVE for split into 2 major parts. First - changes in SDWA peephole pass that allows it to recognize patterns for PRESERVE and generate according instruction. This pass works on SSA machine code and generates SSA compatible code. Second part - new pass that works on non-SSA code and converts code generated by SDWA peephole into correct code.
- Changes in SDWA peephole:
There were several changes in SDWA peephole.
a. First of all there was added new pattern to match for PRESERVE operand. This patterns looks for V_OR_B32 instruction with one of operands that is result of SDWA instruction. Second operand of V_OR_B32 should be instruction that is compatible bit-wise with SDWA instruction (there destination don't overlap). E.g. match:
v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1 v_add_f16_e32 v3, v1, v2 v_or_b32_e32 v4, v0, v3
Into: SDWA preserve dst:v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE preserve:v3
Then this mathced SDWA preserve pattern is converted into SDWA with preserve. During conversion V_OR_B#@ instruction is replaced by SDWA instruction with dst_unused set to UNUSED_PRESERVE. Original instruction is removed. And new instruction gets additional implicit use-operand which is destination of second operand of V_OR_B32 (register that should be preserved):
v_add_f16_e32 v3, v1, v2 v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
Problem with this match process is that currently it only works if both instructions were SDWA instructions. Reason is that to be able to match to instructions we should check that those two instructions are compatible to match - meaning that they write different parts of destination register. But currently there is no way to determine if regular instruction writes not whole destination register. E.g. we can't understand that V_ADD_F16 only writes low 16-bit of destination and high 16-bit are irrelevant. This can be determined only for SDWA instruction by looking into dst_sel operand. So for now this pattern only metch 2 SDWA instructions. Ability to match regular instructions would be added later.
b. Second big change in SDWA peephole pass is that now it tries to apply match patterns several times until it can't convert any new instruction. This is needed because (as said earlier) PRESERVE pattern need to match SDWA instructions but SDWA instruction apear (in most cases) only after SDWA peephole. So to be able to match PRESERVE pattern we first apply all other patterns that generate regular SDWA instructions and then on second try we apply PRESERVE pattern to SDWA instructions generated on first try.
- New pass - merge SDWA preserve pass:
This pass is needed to convert SSA code generated by SDWA peephole pass into non-SSA correct code. It works after PHI-elimination pass where it is possible to generate non-SSA code.
This pass looks for SDWA instructions with dst_unused set to UNUSED_PRESERVE. In those instructions it looks for implicit register operand (which is added by SDWA peephole pass). This register is the one that should be reserved. Id such instruction is found then this pass changes destination register of this SDWA instruction to implicit register and creates copy from implicit register to original destination of SDWA instruction. E.g. instruction generated by SDWA peephole:
v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
Would be converted into:
v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3 v_mov_b32 v4, v3
Putting it all together original sequence of instructions:
v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1 v_add_f16_e32 v3, v1, v2 v_or_b32_e32 v4, v0, v3
Would be converted into:
v_add_f16_e32 v3, v1, v2 v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3 v_mov_b32 v4, v3
I do not think we need it under fast RA.