This patch does the following optimization.
Rx + Ry * 6 => (SH1ADD (SH2ADD Rx, Ry), Ry)
Rx + Ry * 10 => (SH1ADD (SH3ADD Rx, Ry), Ry)
Rx + Ry * 12 => (SH2ADD (SH3ADD Rx, Ry), Ry)
Differential D104436
[RISCV] Optimize mul-add in the zba extension with SH*ADD benshi001 on Jun 16 2021, 8:14 PM. Authored by
Details
This patch does the following optimization. Rx + Ry * 6 => (SH1ADD (SH2ADD Rx, Ry), Ry)
Diff Detail Event TimelineComment Actions The SH1ADDUW/SH2ADDUW/SH3ADDUW are different, and my previous test cases do not patch. I will try SH1ADDUW/SH2ADDUW/SH3ADDUW next week, and you are appreciated to review current optimization with SH1ADD/SH2ADD/SH3ADD. Comment Actions It seems SH1/2/3ADDUW are quite different and are not simply 32-bit SH1/2/3ADD on RV64, they can not be used to optimize mul-add. Comment Actions There's a more generic optimization hiding here. Could we teach decomposeMulByConstant to emit (shl (sh1add X, X), C) to handle any constant of the form (3 << C). Similar for (shl (sh2add X, X)) to handle (5 << C), and (shl (sh3add X, X)) to handle (9 << C). If the multiply happens to be used by an add the existing patterns would combine the ADD and the SHL when possible. If you want to try that as a followup. I'd suggest using the sequence you'd get from that instead. So your isel patterns would be def : Pat<(add (mul GPR:$rs1, (XLenVT 6)), GPR:$rs2), (SH1ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; def : Pat<(add (mul GPR:$rs1, (XLenVT 10)), GPR:$rs2), (SH1ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; def : Pat<(add (mul GPR:$rs1, (XLenVT 12)), GPR:$rs2), (SH2ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; And you can add these additional cases. def : Pat<(add (mul GPR:$rs1, (XLenVT 24)), GPR:$rs2), (SH3ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; def : Pat<(add (mul GPR:$rs1, (XLenVT 20)), GPR:$rs2), (SH2ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; def : Pat<(add (mul GPR:$rs1, (XLenVT 40)), GPR:$rs2), (SH3ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; def : Pat<(add (mul GPR:$rs1, (XLenVT 18)), GPR:$rs2), (SH1ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; def : Pat<(add (mul GPR:$rs1, (XLenVT 36)), GPR:$rs2), (SH2ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; def : Pat<(add (mul GPR:$rs1, (XLenVT 72)), GPR:$rs2), (SH3ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>; X86 does basically the same optimization using LEA which is like our SHNADD. https://godbolt.org/z/e8PTT3oTo
Comment Actions Thanks for teaching me so much skills. It seems I should submit another patch first, which contains tests for all above ways. |
What if the multiply has an additional user that isn't the add?