This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Optimize mul-add in the zba extension with SH*ADD
AbandonedPublic

Authored by benshi001 on Jun 16 2021, 8:14 PM.

Details

Summary

This patch does the following optimization.

Rx + Ry * 6 => (SH1ADD (SH2ADD Rx, Ry), Ry)
Rx + Ry * 10 => (SH1ADD (SH3ADD Rx, Ry), Ry)
Rx + Ry * 12 => (SH2ADD (SH3ADD Rx, Ry), Ry)

Diff Detail

Event Timeline

benshi001 created this revision.Jun 16 2021, 8:14 PM
benshi001 requested review of this revision.Jun 16 2021, 8:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2021, 8:14 PM
benshi001 updated this revision to Diff 352619.Jun 16 2021, 8:25 PM

The SH1ADDUW/SH2ADDUW/SH3ADDUW are different, and my previous test cases do not patch.

I will try SH1ADDUW/SH2ADDUW/SH3ADDUW next week, and you are appreciated to review current optimization with SH1ADD/SH2ADD/SH3ADD.

The SH1ADDUW/SH2ADDUW/SH3ADDUW are different, and my previous test cases do not patch.

I will try SH1ADDUW/SH2ADDUW/SH3ADDUW next week, and you are appreciated to review current optimization with SH1ADD/SH2ADD/SH3ADD.

It seems SH1/2/3ADDUW are quite different and are not simply 32-bit SH1/2/3ADD on RV64, they can not be used to optimize mul-add.

craig.topper retitled this revision from [RISCV] Optimize mul-add in the zbs extension with SH*ADD to [RISCV] Optimize mul-add in the zba extension with SH*ADD.Jun 17 2021, 4:15 PM

There's a more generic optimization hiding here. Could we teach decomposeMulByConstant to emit (shl (sh1add X, X), C) to handle any constant of the form (3 << C). Similar for (shl (sh2add X, X)) to handle (5 << C), and (shl (sh3add X, X)) to handle (9 << C). If the multiply happens to be used by an add the existing patterns would combine the ADD and the SHL when possible.

If you want to try that as a followup. I'd suggest using the sequence you'd get from that instead. So your isel patterns would be

def : Pat<(add (mul GPR:$rs1, (XLenVT 6)), GPR:$rs2),
          (SH1ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 10)), GPR:$rs2),
          (SH1ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 12)), GPR:$rs2),
          (SH2ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;

And you can add these additional cases.

def : Pat<(add (mul GPR:$rs1, (XLenVT 24)), GPR:$rs2),
          (SH3ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 20)), GPR:$rs2),
          (SH2ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 40)), GPR:$rs2),
          (SH3ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 18)), GPR:$rs2),
          (SH1ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 36)), GPR:$rs2),
          (SH2ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 72)), GPR:$rs2),
          (SH3ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;

X86 does basically the same optimization using LEA which is like our SHNADD. https://godbolt.org/z/e8PTT3oTo

llvm/lib/Target/RISCV/RISCVInstrInfoB.td
974

What if the multiply has an additional user that isn't the add?

There's a more generic optimization hiding here. Could we teach decomposeMulByConstant to emit (shl (sh1add X, X), C) to handle any constant of the form (3 << C). Similar for (shl (sh2add X, X)) to handle (5 << C), and (shl (sh3add X, X)) to handle (9 << C). If the multiply happens to be used by an add the existing patterns would combine the ADD and the SHL when possible.

If you want to try that as a followup. I'd suggest using the sequence you'd get from that instead. So your isel patterns would be

def : Pat<(add (mul GPR:$rs1, (XLenVT 6)), GPR:$rs2),
          (SH1ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 10)), GPR:$rs2),
          (SH1ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 12)), GPR:$rs2),
          (SH2ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;

And you can add these additional cases.

def : Pat<(add (mul GPR:$rs1, (XLenVT 24)), GPR:$rs2),
          (SH3ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 20)), GPR:$rs2),
          (SH2ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 40)), GPR:$rs2),
          (SH3ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 18)), GPR:$rs2),
          (SH1ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 36)), GPR:$rs2),
          (SH2ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 72)), GPR:$rs2),
          (SH3ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;

X86 does basically the same optimization using LEA which is like our SHNADD. https://godbolt.org/z/e8PTT3oTo

Thanks for teaching me so much skills. It seems I should submit another patch first, which contains tests for all above ways.

There's a more generic optimization hiding here. Could we teach decomposeMulByConstant to emit (shl (sh1add X, X), C) to handle any constant of the form (3 << C). Similar for (shl (sh2add X, X)) to handle (5 << C), and (shl (sh3add X, X)) to handle (9 << C). If the multiply happens to be used by an add the existing patterns would combine the ADD and the SHL when possible.

If you want to try that as a followup. I'd suggest using the sequence you'd get from that instead. So your isel patterns would be

def : Pat<(add (mul GPR:$rs1, (XLenVT 6)), GPR:$rs2),
          (SH1ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 10)), GPR:$rs2),
          (SH1ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 12)), GPR:$rs2),
          (SH2ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;

And you can add these additional cases.

def : Pat<(add (mul GPR:$rs1, (XLenVT 24)), GPR:$rs2),
          (SH3ADD (SH1ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 20)), GPR:$rs2),
          (SH2ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 40)), GPR:$rs2),
          (SH3ADD (SH2ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 18)), GPR:$rs2),
          (SH1ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 36)), GPR:$rs2),
          (SH2ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;
def : Pat<(add (mul GPR:$rs1, (XLenVT 72)), GPR:$rs2),
          (SH3ADD (SH3ADD GPR:$rs1, GPR:$rs1), GPR:$rs2)>;

X86 does basically the same optimization using LEA which is like our SHNADD. https://godbolt.org/z/e8PTT3oTo

Thanks for teaching me so much skills. It seems I should submit another patch first, which contains tests for all above ways.

I have added new tests in https://reviews.llvm.org/D104507

benshi001 abandoned this revision.Jun 18 2021, 8:06 PM