First iteration of SDWA peephole.
This pass tries to combine several instruction into one SDWA instruction. E.g. it converts:
V_LSHRREV_B32_e32 %vreg0, 16, %vreg1 V_ADD_I32_e32 %vreg2, %vreg0, %vreg3 V_LSHLREV_B32_e32 %vreg4, 16, %vreg2
Into:
V_ADD_I32_sdwa %vreg4, %vreg1, %vreg3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
Added --amdgpu-sdwa-peephole command-line option that enables this pass. by default pass is disabled.
Pass structure:
- Iterate over machine instruction in basic block and try to apply "SDWA patterns" to each of them. SDWA patterns match machine instruction into either source or destination SDWA operand. E.g. V_LSHRREV_B32_e32 %vreg0, 16, %vreg1 is matched to source SDWA operand %vreg1 src_sel:WORD_1.
- Iterate over found SDWA operands and find instruction that could be potentially coverted into SDWA. E.g. for source SDWA operand potential instruction are all instruction in this basic block that uses %vreg0
- Iterate over all potential instructions and check if they can be converted into SDWA.
- Convert instructions to SDWA.
This review contains basic implementation of SDWA peephole pass. This pass requires additional testing fot both correctness and performance (no performance testing done).
There are several ways this pass can be improved:
- Make this pass work on whole function not only basic block. As I can see this can be done right now without changes to pass.
- Introduce more SDWA patterns
- Introduce mnemonics to limit when SDWA patterns should apply
Are there plans to enable it?