Optimize bitfield extractions retaining bit positions
from lu12i + addi + and to bstrpick + slli.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
287–288 | This is not optimized, since 0xff000 can be composed by a single lu12i.w. | |
325–326 | This is not optimized, since 0xff000 can be composed by a single lu12i.w. | |
328–329 | This is not optimized, since the constant is used twice. | |
384 | This is not optimized, since the value -31 can be composed by a single ADDI.W. |
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
384 | This is not optimized, since the value -2048 can be composed by a single ADDI.W. |
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
319–321 | Seems wrong? The intended operation is to retain only a[15:4] so we should bstrpick.d $a0, $a0, 15, 4 to retain bits, then slli.d $a0, $a0, 4 to restore the LSB position. (LoongArch bstrpick invariably stores to rd's LSB side, and will not retain the original relative position.) |
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
319–321 | Thanks. I should not make such as mistake :( |
Good catch, thanks!
IMO you could include as comments your explanations to existing cases where this optimization has not taken place (e.g. "This is not optimized into bstrpick + slli because the constant has multiple uses."), so future readers wouldn't have to do archaeology to see them. The code changes LGTM.
I've just found there's no commit message. In general it could be helpful to include one or two short sentences describing the actual change, e.g. "Optimize bitfield extractions retaining bit positions with bstrpick + slli", because "optimize foo" otherwise doesn't carry useful information about the actual improvements.
llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp | ||
---|---|---|
1386 | nit: "constant" |
llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp | ||
---|---|---|
1386 | I have fixed this typo in my local repo, and will be correct when committing. Thanks! |
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
339–344 | It's also a win if optimized to bstrpick + slli since we can save a register $a2. Right? ; LA64-NEXT: bstrpick.d $a0, $a0, 15, 4 ; LA64-NEXT: slli.d $a0, $a0, 4 ; LA64-NEXT: bstrpick.d $a1, $a1, 15, 4 ; LA64-NEXT: slli.d $a1, $a1, 4 ; LA64-NEXT: mul.d $a0, $a0, $a1 But if the immediate 0xfff0 is used 3 times, we have 2 choices:
I'm not sure how to balance this. |
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
339–344 | I am also not sure about the case of more then 2 uses. Maybe we make a convervative choice, just make it unchanged? And we only handle 1 and 2 uses ? |
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
339–344 | I agree with you. |
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
339–344 | I have updated my code
|
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
392 | This is the case the immediate has 3 uses, which is not optimized. |
LGTM from a cursory look, thanks! Although putting your informative replies into the code as comments would be even better, as others would see the rationale along with the code and not have to do archaeology themselves.
llvm/test/CodeGen/LoongArch/ir-instruction/and.ll | ||
---|---|---|
339–344 | Fine with me. If we were to aim for the maximum performance possible then that'll have to be backed with actual micro-architecture details, so the optimizer can do the right thing for the right models. Less instructions executed usually can't hurt after all. (In my Go benchmarking experiences, loop alignment could have far more profound influence on the numbers than micro-optimization like this, so it's probably fine to not care as much here.) |
I have supplemented llvm/test/CodeGen/LoongArch/alloca.ll and other affected tests.
I am working in multiple threads mode, sorry for so many typos and spots. :(
Thanks for your suggestion. I have added some informative comments to both c++ code and the test code.
Ah. I actually meant putting the justification for the "> 2 uses" (being conservative when faced with register pressure vs. instruction count blah blah), and it's actually fine to not comment what you did when the code is self-documenting. In general just document why and not what you do for a piece of code that may warrant such explanation.
I have commented in the c++ code as follow
// Omit if the constant has more than 2 uses. This a conservative // decision. Whether it is a win depends on the HW microarchitecture. // However it should always be better for 1 and 2 uses. if (CN->use_size() > 2) return SDValue();
nit: "constant"