This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Prevent regression in isMulAddWithConstProfitable
AbandonedPublic

Authored by benshi001 on Jul 3 2020, 9:24 PM.

Details

Summary

Prevent folding (mul (add x, c1), c2) to (add (mul x, c2), c1*c2),
if c1 can be directly encoded in an instruction code, while c1*c2
has to be materialized into a register.

Diff Detail

Event Timeline

benshi001 created this revision.Jul 3 2020, 9:24 PM
benshi001 added a comment.EditedJul 3 2020, 9:40 PM

My patch at least generates better code x86, aarch64 and riscv.

For x86's test urem-seteq-nonzero.ll, less instructions are emitted.

For aarch64's test urem-seteq-nonzero.ll,

  1. 2 cases have one more instruction emitted,
  2. other 2 cases have one less instruction emitted,
  3. other 9 cases have no change in instruction amount, but have madd replaced by mul.

Since madd has larger latency than mul, I think my change also makes aarch64 optimized in total.

There is not previous case for RISCV, but my changes also does optimization for RISCV.
For example,

%tmp0 = add i32 %x, 1971
%tmp1 = mul i32 %tmp0, 19

The original llvm generates,

addi    a1, zero, 19
mul     a0, a0, a1
lui     a1, 9
addi    a1, a1, 585
add     a0, a0, a1

And my patch optimizes it to

addi    a0, a0, 1971
addi    a1, zero, 19
mul     a0, a0, a1

The build failure does not related to my changes, today's llvm master branch also has them after running "make check-all"

benshi001 edited the summary of this revision. (Show Details)Jul 4 2020, 1:25 AM
lebedev.ri added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15822–15826

You also need a second piece if the puzzle, because we would have
already performed this transform before ever getting here:
https://godbolt.org/z/TE_bzk

You need to add an inverse transform into DAGCombiner.cpp.

For these kinds of patches where you add new tests which show a difference in code quality it's helpful if you can split the tests into a separate patch. You can then add that other patch as a dependency for this one, creating a Stack in Phabricator. That way it's easier to see the differences in the generated code.

spatel added inline comments.Jul 4 2020, 8:07 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15817–15820

How do we know these constants are not wider than 64-bits?

15822–15826

Right - I asked if the motivation for this was GEP math:
http://lists.llvm.org/pipermail/llvm-dev/2020-July/142979.html
...but I did not see an answer.

If yes, then we should have GEP tests. If not, then this patch isn't enough.

benshi001 marked 3 inline comments as done.Jul 4 2020, 9:12 AM
benshi001 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15817–15820

In real world, no machines will encode a large imm into instruction. Even x86 support legalAddImm up to 32-bit, and arm & riscv only support to 12-bit. So I think using int64_t here is OK. If the real value exceed that, then the DAG falls to normal path --- get transformed.

What's your opinion?

15822–15826

No. I target for normal IR transform, not GEP.

So I need to do a inverse transform?

benshi001 marked an inline comment as done.Jul 4 2020, 9:37 AM
benshi001 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15822–15826

Where is the other place perform the same transform before reach here?

lebedev.ri added inline comments.Jul 4 2020, 9:46 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15817–15820

In real world, no machines will encode a large imm into instruction. Even x86 support legalAddImm up to 32-bit, and arm & riscv only support to 12-bit.

I agree that isLegalAddImmediate() takes int64_t, but that is orthogonal to the question.

The question being: when doing int64_t C1 = C1Node->getSExtValue();,
is it guaranteed that APInt value in C1Node is at most 64-bit?
If not, this code will crash with an assertion.

benshi001 updated this revision to Diff 275545.Jul 5 2020, 4:54 AM

Thanks for all of your suggestions. I have uploaded a new patch edition.

  1. Seperate the test cases to show improvement in another patch.

Done. https://reviews.llvm.org/D83159

  1. Make sure c1 and c2 do not exceed int64, to avoid assert failure.

Done. One more if-statment is added to check that.

  1. Make a inverse transform if "opt -instcombine" has been performed.

Shall we seperate this inverse transform in another patch? At least this patch works
for "clang a.c -O2 -Wall --target=x86/riscv/aarch64" if a.c contains code pattern like
"(a + 999) * 888". At least this patch can prevent the regression in such circumstance.

benshi001 marked 2 inline comments as done.Jul 5 2020, 4:57 AM
MaskRay added inline comments.Jul 5 2020, 6:12 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15817

This should be >=

15825

C1 * C2 may trigger a signed overflow. (C1 * C2) may be negative and left shifting a signed integer is UB before C++20.

Please use llvm::SignExtend64(uint64_t, unsigned)

benshi001 marked an inline comment as done.Jul 6 2020, 1:15 AM
benshi001 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15817

This should be >, not >=, otherwise riscv64/aarch64 will still fall to regression for i64 add-mul.

benshi001 updated this revision to Diff 275616.Jul 6 2020, 2:28 AM
benshi001 edited the summary of this revision. (Show Details)

One more change:

add check if c1*c2 is overflow.

benshi001 marked an inline comment as done.Jul 6 2020, 2:30 AM

Chnage list according to all your comments.

  1. Seperate the test cases to show improvement in another patch.

Done. https://reviews.llvm.org/D83159

  1. Make sure c1 and c2 do not exceed int64, to avoid assert failure.

Done. One more if-statment is added to check that.
(the condition should be >, not >=, otherwise riscv64 can not be optimized)

  1. Check if c1*c2 is overflow.

Done One more if-statment for that is added.

  1. Make a inverse transform if "opt -instcombine" has been performed.

Shall we seperate this inverse transform in another patch? At least this patch improves
the test case urem-seteq-nonzero.ll, and the case in https://reviews.llvm.org/D83159

If it's not too much trouble, I've added D83159 as a parent patch, which are the tests for this change in the RISC-V backend.

I realise this shouldn't block changes to the other targets, but given the transform is target-independent, it makes sense to see how it affects more targets if possible.

Generally looks good, can you also add a test which will trigger an overflow with your previous revision?

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15817

Nit: const unsigned Bits

15824

// Prevent the transform if c1*c2 overflows.

15825

ConstNode.getScalarValueSizeInBits() -> Bits

15828

Store C1 * C2 in a variable. Please don't repeat multiplication.

benshi001 updated this revision to Diff 276023.Jul 7 2020, 5:48 AM
benshi001 edited the summary of this revision. (Show Details)
benshi001 marked 4 inline comments as done.Jul 7 2020, 5:53 AM

Change list according to all your comments.

  1. Seperate the test cases to show improvement in another patch.

Done. https://reviews.llvm.org/D83159

  1. Make sure c1 and c2 do not exceed int64, to avoid assert failure.

Done. One more if-statment is added to check that.

  1. Check if c1*c2 is overflow.

Done One more if-statment for that is added.

  1. Add a new test case case triggers the overflow check.

I will do that in https://reviews.llvm.org/D83159

  1. Make a inverse transform if "opt -instcombine" has been performed.

Shall we seperate this inverse transform in another patch? At least this patch improves
the test case urem-seteq-nonzero.ll, and the case in https://reviews.llvm.org/D83159

benshi001 updated this revision to Diff 277244.Jul 11 2020, 8:49 AM
benshi001 edited the summary of this revision. (Show Details)

Change list according to all your comments.

  1. Seperate the test cases to show improvement in another patch.

Done. https://reviews.llvm.org/D83159, which has been landed.

  1. Make sure c1 and c2 do not exceed int64, to avoid assert failure.

Done. One more if-statment is added to check that.

  1. Check if c1*c2 is overflow.

If we stop the transform when c1*c2 overflows, the x86 will be impacked a lot,
I am afraid introducing more regression.

  1. Make a inverse transform if "opt -instcombine" has been performed.

Shall we seperate this inverse transform in another patch? At least this patch improves
the test case urem-seteq-nonzero.ll, and the case llvm/test/CodeGen/RISCV/addimm-mulimm.ll

  1. Some other small fixes.
benshi001 added a comment.EditedJul 11 2020, 8:59 AM

Concusion,

  1. RISCV got improved
  1. X86 got slight improved
  1. For aarch64's test urem-seteq-nonzero.ll,

3.1. two cases have one more instruction emitted,
3.2. two other cases have one less instruction emitted,
3.3. nine other cases have no change in instruction amount, but have madd replaced by mul.
Since madd has larger latency than mul, I think my change also makes aarch64 optimized in total.

benshi001 updated this revision to Diff 279074.Jul 19 2020, 5:40 AM
spatel added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15822–15826

The general transform is happening in IR in instcombine, so for example the RISC-V tests are not canonical IR. That means we will not see the pattern that you are expecting in the usual case for "mul (add X, C1), C2" in source code. You can test that by compiling from source code to IR (and on to asm) for something like this:

$ cat mad.c
#include <stdint.h>
uint64_t f(uint64_t x) {
  return (x + 424242) * 15700;
}
$ clang -O1 mad.c -S -o - -emit-llvm
define i64 @f(i64 %x) {
  %t0 = mul i64 %x, 15700
  %mul = add i64 %t0, 6660599400
  ret i64 %mul
}
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
7–11

I think we need to hear from someone with more AArch knowledge if this is an improvement or acceptable. cc @dmgreen @efriedma @fhahn

llvm/test/CodeGen/RISCV/addimm-mulimm.ll
74

Why is this test deleted?

efriedma added inline comments.Jul 22 2020, 11:04 AM
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
7–11

The new code is worse; madd is essentially the same cost as mul, so the new code has one more arithmetic instruction. Better to spend an extra instruction materializing a constant, particularly if the code is in a loop.

benshi001 marked 3 inline comments as done.Jul 22 2020, 7:18 PM
benshi001 added inline comments.
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
7–11

But there are also other cases instruction count is reduced, see line 156-162 and line 176-180, I think it is optimized in total, though worse in specific cases.

llvm/test/CodeGen/RISCV/addimm-mulimm.ll
74

This code was also added by me, which is expected to be affected by this patch. But actually not, so I remove them.

benshi001 marked 2 inline comments as done.Jul 22 2020, 7:30 PM
benshi001 added inline comments.
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
7–11

There are many aarch64 cases affected, some became better while some became worse, but they are better in total, what's your opinion?

efriedma added inline comments.Jul 23 2020, 4:54 PM
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
7–11

156-162 is fewer instructions, but it's higher latency, and the current heuristic doesn't really have any way to usefully predict the result here.

I'd prefer to keep this transform off for AArch64.

benshi001 abandoned this revision.Jul 23 2020, 6:16 PM
benshi001 marked an inline comment as done.
benshi001 added inline comments.
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
7–11

I see. Thanks.

I also have other patches for ARM. And you are appreciated to help me review them.