Prevent folding (mul (add x, c1), c2) to (add (mul x, c2), c1*c2),
if c1 can be directly encoded in an instruction code, while c1*c2
has to be materialized into a register.
Details
Diff Detail
Event Timeline
My patch at least generates better code x86, aarch64 and riscv.
For x86's test urem-seteq-nonzero.ll, less instructions are emitted.
For aarch64's test urem-seteq-nonzero.ll,
- 2 cases have one more instruction emitted,
- other 2 cases have one less instruction emitted,
- other 9 cases have no change in instruction amount, but have madd replaced by mul.
Since madd has larger latency than mul, I think my change also makes aarch64 optimized in total.
There is not previous case for RISCV, but my changes also does optimization for RISCV.
For example,
%tmp0 = add i32 %x, 1971 %tmp1 = mul i32 %tmp0, 19
The original llvm generates,
addi a1, zero, 19 mul a0, a0, a1 lui a1, 9 addi a1, a1, 585 add a0, a0, a1
And my patch optimizes it to
addi a0, a0, 1971 addi a1, zero, 19 mul a0, a0, a1
The build failure does not related to my changes, today's llvm master branch also has them after running "make check-all"
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15822–15826 | You also need a second piece if the puzzle, because we would have You need to add an inverse transform into DAGCombiner.cpp. |
For these kinds of patches where you add new tests which show a difference in code quality it's helpful if you can split the tests into a separate patch. You can then add that other patch as a dependency for this one, creating a Stack in Phabricator. That way it's easier to see the differences in the generated code.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15817–15820 | How do we know these constants are not wider than 64-bits? | |
15822–15826 | Right - I asked if the motivation for this was GEP math: If yes, then we should have GEP tests. If not, then this patch isn't enough. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15817–15820 | In real world, no machines will encode a large imm into instruction. Even x86 support legalAddImm up to 32-bit, and arm & riscv only support to 12-bit. So I think using int64_t here is OK. If the real value exceed that, then the DAG falls to normal path --- get transformed. What's your opinion? | |
15822–15826 | No. I target for normal IR transform, not GEP. So I need to do a inverse transform? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15822–15826 | Where is the other place perform the same transform before reach here? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15817–15820 |
I agree that isLegalAddImmediate() takes int64_t, but that is orthogonal to the question. The question being: when doing int64_t C1 = C1Node->getSExtValue();, |
Thanks for all of your suggestions. I have uploaded a new patch edition.
- Seperate the test cases to show improvement in another patch.
Done. https://reviews.llvm.org/D83159
- Make sure c1 and c2 do not exceed int64, to avoid assert failure.
Done. One more if-statment is added to check that.
- Make a inverse transform if "opt -instcombine" has been performed.
Shall we seperate this inverse transform in another patch? At least this patch works
for "clang a.c -O2 -Wall --target=x86/riscv/aarch64" if a.c contains code pattern like
"(a + 999) * 888". At least this patch can prevent the regression in such circumstance.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15817 | This should be >, not >=, otherwise riscv64/aarch64 will still fall to regression for i64 add-mul. |
Chnage list according to all your comments.
- Seperate the test cases to show improvement in another patch.
Done. https://reviews.llvm.org/D83159
- Make sure c1 and c2 do not exceed int64, to avoid assert failure.
Done. One more if-statment is added to check that.
(the condition should be >, not >=, otherwise riscv64 can not be optimized)
- Check if c1*c2 is overflow.
Done One more if-statment for that is added.
- Make a inverse transform if "opt -instcombine" has been performed.
Shall we seperate this inverse transform in another patch? At least this patch improves
the test case urem-seteq-nonzero.ll, and the case in https://reviews.llvm.org/D83159
If it's not too much trouble, I've added D83159 as a parent patch, which are the tests for this change in the RISC-V backend.
I realise this shouldn't block changes to the other targets, but given the transform is target-independent, it makes sense to see how it affects more targets if possible.
Generally looks good, can you also add a test which will trigger an overflow with your previous revision?
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15817 | Nit: const unsigned Bits | |
15824 | // Prevent the transform if c1*c2 overflows. | |
15825 | ConstNode.getScalarValueSizeInBits() -> Bits | |
15828 | Store C1 * C2 in a variable. Please don't repeat multiplication. |
Change list according to all your comments.
- Seperate the test cases to show improvement in another patch.
Done. https://reviews.llvm.org/D83159
- Make sure c1 and c2 do not exceed int64, to avoid assert failure.
Done. One more if-statment is added to check that.
- Check if c1*c2 is overflow.
Done One more if-statment for that is added.
- Add a new test case case triggers the overflow check.
I will do that in https://reviews.llvm.org/D83159
- Make a inverse transform if "opt -instcombine" has been performed.
Shall we seperate this inverse transform in another patch? At least this patch improves
the test case urem-seteq-nonzero.ll, and the case in https://reviews.llvm.org/D83159
Change list according to all your comments.
- Seperate the test cases to show improvement in another patch.
Done. https://reviews.llvm.org/D83159, which has been landed.
- Make sure c1 and c2 do not exceed int64, to avoid assert failure.
Done. One more if-statment is added to check that.
- Check if c1*c2 is overflow.
If we stop the transform when c1*c2 overflows, the x86 will be impacked a lot,
I am afraid introducing more regression.
- Make a inverse transform if "opt -instcombine" has been performed.
Shall we seperate this inverse transform in another patch? At least this patch improves
the test case urem-seteq-nonzero.ll, and the case llvm/test/CodeGen/RISCV/addimm-mulimm.ll
- Some other small fixes.
Concusion,
- RISCV got improved
- X86 got slight improved
- For aarch64's test urem-seteq-nonzero.ll,
3.1. two cases have one more instruction emitted,
3.2. two other cases have one less instruction emitted,
3.3. nine other cases have no change in instruction amount, but have madd replaced by mul.
Since madd has larger latency than mul, I think my change also makes aarch64 optimized in total.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15822–15826 | The general transform is happening in IR in instcombine, so for example the RISC-V tests are not canonical IR. That means we will not see the pattern that you are expecting in the usual case for "mul (add X, C1), C2" in source code. You can test that by compiling from source code to IR (and on to asm) for something like this: $ cat mad.c #include <stdint.h> uint64_t f(uint64_t x) { return (x + 424242) * 15700; } $ clang -O1 mad.c -S -o - -emit-llvm define i64 @f(i64 %x) { %t0 = mul i64 %x, 15700 %mul = add i64 %t0, 6660599400 ret i64 %mul } | |
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | ||
7–10 | ||
llvm/test/CodeGen/RISCV/addimm-mulimm.ll | ||
74 | Why is this test deleted? |
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | ||
---|---|---|
7–10 | The new code is worse; madd is essentially the same cost as mul, so the new code has one more arithmetic instruction. Better to spend an extra instruction materializing a constant, particularly if the code is in a loop. |
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | ||
---|---|---|
7–10 | But there are also other cases instruction count is reduced, see line 156-162 and line 176-180, I think it is optimized in total, though worse in specific cases. | |
llvm/test/CodeGen/RISCV/addimm-mulimm.ll | ||
74 | This code was also added by me, which is expected to be affected by this patch. But actually not, so I remove them. |
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | ||
---|---|---|
7–10 | There are many aarch64 cases affected, some became better while some became worse, but they are better in total, what's your opinion? |
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | ||
---|---|---|
7–10 | 156-162 is fewer instructions, but it's higher latency, and the current heuristic doesn't really have any way to usefully predict the result here. I'd prefer to keep this transform off for AArch64. |
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | ||
---|---|---|
7–10 | I see. Thanks. I also have other patches for ARM. And you are appreciated to help me review them. |
This should be >=