This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Add one use limitation for (X * C2) << C1 --> X * (C2 << C1)
ClosedPublic

Authored by bcl5980 on Apr 21 2022, 9:24 AM.

Details

Summary

Follow up D123453, add one-use limitation for

(X * C2) << C1 --> X * (C2 << C1)

to make consistent with

lshr (mul nuw x, MulC), ShAmtC -> mul nuw x, (MulC >> ShAmtC)

Diff Detail

Event Timeline

bcl5980 created this revision.Apr 21 2022, 9:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 9:24 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
bcl5980 requested review of this revision.Apr 21 2022, 9:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 9:24 AM
spatel accepted this revision.Apr 21 2022, 9:28 AM

LGTM

This revision is now accepted and ready to land.Apr 21 2022, 9:28 AM
This revision was landed with ongoing or failed builds.Apr 21 2022, 9:34 AM
This revision was automatically updated to reflect the committed changes.

This is not consistent with our general folding rules.
What does "may" mean? We'd trade sequential mul-shl for two independent mul.
Former would be sequential anyway, while two mul may or may not be sequential depending on how bad the microarchitecture is.
Is there an actual performance bugreport?

Does this regress e.g. the following case: https://godbolt.org/z/aWsvzW751 ?
What about the case where the backend isn't even meant to recover, like https://godbolt.org/z/4Tbb3YG3s ?

bcl5980 added a comment.EditedApr 21 2022, 10:25 AM

This is not consistent with our general folding rules.

Can you help to explain the general folding rules? I'm a beginner so it will be grateful you can tell me the detail rules.
At least tell me what case can add one-use and what case can't add. There are many one-use in instcombine so it is important for me.

What does "may" mean? We'd trade sequential mul-shl for two independent mul.
Former would be sequential anyway, while two mul may or may not be sequential depending on how bad the microarchitecture is.

I think sequential mul-shr is also very easy to fold to two independent mul in backend but not the other way. If the microarchitecture has enough mul ports and don't care too much about power they can do this in backend.

Is there an actual performance bugreport?

No performance report. But the general IR change is also hard to get a large set perf data I think.

Does this regress e.g. the following case: https://godbolt.org/z/aWsvzW751 ?
What about the case where the backend isn't even meant to recover, like https://godbolt.org/z/4Tbb3YG3s ?

Thanks for the case you mentioned. Maybe I can add some more conditions to avoid the regression but now I think a rule for one-use is important for me.

This is not consistent with our general folding rules.

Can you help to explain the general folding rules? I'm a beginner so it will be grateful you can tell me the detail rules.
At least tell me what case can add one-use and what case can't add. There are many one-use in instcombine so it is important for me.

What does "may" mean? We'd trade sequential mul-shl for two independent mul.
Former would be sequential anyway, while two mul may or may not be sequential depending on how bad the microarchitecture is.

I think sequential mul-shr is also very easy to fold to two independent mul in backend but not the other way. If the microarch have enough mul port and don't care too much power they can do this in backend.

Is there an actual performance bugreport?

No performance report. But the general IR change is also hard to get a large set perf data I think.

Does this regress e.g. the following case: https://godbolt.org/z/aWsvzW751 ?
What about the case where the backend isn't even meant to recover, like https://godbolt.org/z/4Tbb3YG3s ?

Thanks for the case you mentioned. Maybe I can add somemore condition to avoid the regression but now I think a rule for one-use is important for me.

For instcombine, basically, the rule is: do not increase IR instruction count.
If there is no performance bugreport, i'm not sure why we should deviate here. (same for D123453)

bcl5980 added a comment.EditedApr 21 2022, 10:54 AM

This is not consistent with our general folding rules.

Can you help to explain the general folding rules? I'm a beginner so it will be grateful you can tell me the detail rules.
At least tell me what case can add one-use and what case can't add. There are many one-use in instcombine so it is important for me.

What does "may" mean? We'd trade sequential mul-shl for two independent mul.
Former would be sequential anyway, while two mul may or may not be sequential depending on how bad the microarchitecture is.

I think sequential mul-shr is also very easy to fold to two independent mul in backend but not the other way. If the microarch have enough mul port and don't care too much power they can do this in backend.

Is there an actual performance bugreport?

No performance report. But the general IR change is also hard to get a large set perf data I think.

Does this regress e.g. the following case: https://godbolt.org/z/aWsvzW751 ?
What about the case where the backend isn't even meant to recover, like https://godbolt.org/z/4Tbb3YG3s ?

Thanks for the case you mentioned. Maybe I can add somemore condition to avoid the regression but now I think a rule for one-use is important for me.

For instcombine, basically, the rule is: do not increase IR instruction count.
If there is no performance bugreport, i'm not sure why we should deviate here. (same for D123453)

I'm sorry but I mean the rule for one-use usage. Actually we have a lot of one-use in instcombine. And most of them also follow the rule do not increase IR instruction count.
I have already reverted this commit. and I can revert D123453 also. But it will be helpful if we can reach an agreement.

We may be going in circles here.
Both of these transforms do not require one-use check, neither from correctness perspective,
nor from the profitability perspective (at least, as instcombine views it).
I'm not sure it will be productive if i simply repeat everything i have already stated,
so i'm not sure how to approach the disagreement differently.

One other reason I insist one-use is we can give programmer choice. If we remove one-use, there is no choice for programmer.
https://godbolt.org/z/Wrzh5aczf
https://godbolt.org/z/1nGf88E87

@spatel Do you agree with @lebedev.ri to revert both one-use? If yes I will revert both of them.

One other reason I insist one-use is we can give programmer choice. If we remove one-use, there is no choice for programmer.
https://godbolt.org/z/Wrzh5aczf
https://godbolt.org/z/1nGf88E87

Apologies, i do not really understand what you mean by a choice here.

@spatel Do you agree with @lebedev.ri to revert both one-use? If yes I will revert both of them.

Note that i was not asking for the revert, and there is really no need to revert the D123453, if needed it can be adjusted in a new commit.

One other reason I insist one-use is we can give programmer choice. If we remove one-use, there is no choice for programmer.
https://godbolt.org/z/Wrzh5aczf
https://godbolt.org/z/1nGf88E87

Ideally, the compiler would not give the programmer a "choice" of multiply asm output on these examples. We want to be able to reduce logically equivalent code to the same IR (canonicalization) and then optimize it to the ideal target assembly.

Leaving holes in the canonicalization makes it harder to optimize maximally and consistently. That can also be frustrating for users when they think they have written the same program, but it compiles to different output.
...but we will never be able to canonicalize all code patterns - there are just too many potential variations in source and IR.

As noted, the general rule for instcombine is "don't increase the instruction count". In addition, if we can reduce the number of uses of variables or reduce the dependency chain, then that is usually good. That is the reason we would leave out the one-use check in the test example for this patch - "C" can be calculated directly from "A" rather than using the intermediate "B" value.

The rules can be confusing because we have several exceptions:

  1. As noted here, some operations may be more expensive (or not legal) for some targets, and it can be difficult to invert/decompose the transform in the backend, so we avoid some transforms.
  2. Sometimes by removing an intermediate use it becomes harder for the backend to match a longer sequence as a target-specific operation, so we avoid a transform. There are recent examples of icmp one-use transforms where this happens.
  3. There are operations like div that we know are difficult to analyze in IR and create expensive codegen, so we convert those to multiple simpler instructions in IR.
  4. We may choose a form like "icmp ne (and X, 1), 0" rather than "trunc X to i1" because analysis was historically better for icmp (and now it might just be too much work to change).

I don't think we need to revert D123453. As discussed there, we could have made that patch consistent with this transform (no one-use check). I assumed from the codegen examples that there could be a performance regression without one-use, but there could also be improvements. We can remove the one-use as a follow-up to that patch and see if anyone complains.

One other reason I insist one-use is we can give programmer choice. If we remove one-use, there is no choice for programmer.
https://godbolt.org/z/Wrzh5aczf
https://godbolt.org/z/1nGf88E87

Apologies, i do not really understand what you mean by a choice here.

Programmer can determine the final result is mul+shift or two independent mul.
if they write with a chain we compile to mul+shift chain, and if they write 2 independent mul we compile to 2 mul

unsigned a = i * 100; -->    mul 
unsigned b = a * 4;   -->    shift
unsigned a = i * 100;   -->    mul1
unsigned b = i * 400;   -->    mul2

Let programmer to decide which pattern is better. I'm pretty sure on GPU(CUDA/HIP) or power sensitive CPU plaform we prefer mul+shift.
Otherwise they need backend to do this for all these platform.

@spatel Do you agree with @lebedev.ri to revert both one-use? If yes I will revert both of them.

Note that i was not asking for the revert, and there is really no need to revert the D123453, if needed it can be adjusted in a new commit.

Yeah, I mean do we need a new commit to remove one-use on D123453.

bcl5980 added a comment.EditedApr 21 2022, 12:15 PM

One other reason I insist one-use is we can give programmer choice. If we remove one-use, there is no choice for programmer.
https://godbolt.org/z/Wrzh5aczf
https://godbolt.org/z/1nGf88E87

Ideally, the compiler would not give the programmer a "choice" of multiply asm output on these examples. We want to be able to reduce logically equivalent code to the same IR (canonicalization) and then optimize it to the ideal target assembly.

Leaving holes in the canonicalization makes it harder to optimize maximally and consistently. That can also be frustrating for users when they think they have written the same program, but it compiles to different output.
...but we will never be able to canonicalize all code patterns - there are just too many potential variations in source and IR.

As noted, the general rule for instcombine is "don't increase the instruction count". In addition, if we can reduce the number of uses of variables or reduce the dependency chain, then that is usually good. That is the reason we would leave out the one-use check in the test example for this patch - "C" can be calculated directly from "A" rather than using the intermediate "B" value.

The rules can be confusing because we have several exceptions:

  1. As noted here, some operations may be more expensive (or not legal) for some targets, and it can be difficult to invert/decompose the transform in the backend, so we avoid some transforms.
  2. Sometimes by removing an intermediate use it becomes harder for the backend to match a longer sequence as a target-specific operation, so we avoid a transform. There are recent examples of icmp one-use transforms where this happens.
  3. There are operations like div that we know are difficult to analyze in IR and create expensive codegen, so we convert those to multiple simpler instructions in IR.
  4. We may choose a form like "icmp ne (and X, 1), 0" rather than "trunc X to i1" because analysis was historically better for icmp (and now it might just be too much work to change).

I don't think we need to revert D123453. As discussed there, we could have made that patch consistent with this transform (no one-use check). I assumed from the codegen examples that there could be a performance regression without one-use, but there could also be improvements. We can remove the one-use as a follow-up to that patch and see if anyone complains.

Thanks for the detail explaination. Is that means our general rule is don't add one-use first, but if there are some perf regressions happened then we add it ?

Thanks for the detail explaination. Is that means our general rule is don't add one-use first, but if there are some perf regressions happened then we add it ?

The general rule is to make the canonicalization with as few restrictions as possible, so avoid one-use checks if possible. But if we know or suspect that there can be regressions, we try to fix those by creating other (inverse) transforms in the backend for example.
But it is fine to be more conservative (add one-use) in an initial patch, remove it as an enhancement, and see if there are problems. With smaller changes to the compiler, it is generally easier to analyze the resulting performance changes in applications.

bcl5980 added inline comments.Apr 21 2022, 7:49 PM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
971

@lebedev.ri @spatel
Based on the rules, can we remove this one-use also ?
(C2 << X) << C1 dependent on C2 << X
(C2 << C1) << X is independent with C2 << X

One similar case udiv+lshr, do we need add one-use if we add the transform?
https://alive2.llvm.org/ce/z/wk6n3q

bcl5980 added inline comments.Apr 21 2022, 8:12 PM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
971

add general proof for udiv+lshr:
https://alive2.llvm.org/ce/z/CfMqxN

Update general proof for udiv+lshr:
https://alive2.llvm.org/ce/z/6s7Hpc

spatel added inline comments.Apr 29 2022, 8:43 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
971

Sorry - I lost track of the comments here.

For the 1st question (shift-of-shift), the answer is: yes - we can remove that one-use limitation.
It looks like that transform was added long ago with 16f18ed7b555bce5163f , and there are no tests for multi-use, so we need to add those first to make sure we have coverage.

For the 2nd question (shift-of-udiv), the answer is: yes - we want a one-use limitation on that fold (at least as a first step).
That is because there is very little optimization potential for udiv in IR, and we know that there is a potential regression for codegen (on all targets that I know of) if we replace a shift with a div. If we can show that the backend can remove that div and avoid regressions, then we can consider removing the one-use check.

@spatel, thanks for the answer.

Based on the discussion I review all one-use in instcombine and I find some more similar cases without TODO comments:

Instruction *InstCombinerImpl::visitAdd

usub.sat(A, B) + B => umax(A, B)
    m_OneUse(m_Intrinsic<Intrinsic::usub_sat>(m_Value(A), m_Value(B)))

Instruction *InstCombinerImpl::visitSub

X - usub.sat(X, Y) => umin(X, Y)
    m_OneUse(m_Intrinsic<Intrinsic::usub_sat>(m_Specific(Op0), m_Value(Y)))

Instruction *InstCombinerImpl::visitFAdd

fadd (rdx 0.0, X), Y --> rdx Y, X
    m_OneUse(m_Intrinsic<Intrinsic::vector_reduce_fadd>(m_AnyZeroFP(), m_Value(X)))

Instruction *InstCombinerImpl::visitFNeg

-(X - Y) --> (Y - X)
    m_OneUse(m_FSub(m_Value(X), m_Value(Y)))

static Instruction *foldComplexAndOrPatterns

(~(A | B) & C) | ~(A | C) --> ~((B & C) | A)
(~(A & B) | C) & ~(A & C) --> ~((B | C) & A)
    m_OneUse(m_c_BinOp(Opcode, m_Specific(A), m_Specific(C)))

(~(A | B) & C) | ~(B | C) --> ~((A & C) | B)
(~(A & B) | C) & ~(B & C) --> ~((A | C) & B)
    m_OneUse(m_c_BinOp(Opcode, m_Specific(B), m_Specific(C)))

(~A & B & C) | ~(A | B) --> (C | ~B) & ~A
(~A | B | C) & ~(A & B) --> (C & ~B) | ~A
    m_OneUse(m_c_BinOp(Opcode, m_Specific(A), m_Specific(B)))

(~A & B & C) | ~(A | C) --> (B | ~C) & ~A
(~A | B | C) & ~(A & C) --> (B & ~C) | ~A
    m_OneUse(m_c_BinOp(Opcode, m_Specific(A), m_Specific(C))

static Instruction *foldBitCastBitwiseLogic

bitcast(logic(bitcast(X), Y)) --> logic'(X, bitcast(Y))
    m_OneUse(m_BitCast(m_Value(X)))

bitcast(logic(Y, bitcast(X))) --> logic'(bitcast(Y), X)
    m_OneUse(m_BitCast(m_Value(X)))

bitcast(select(Cond, bitcast(X), Y)) --> select'(Cond, X, bitcast(Y))
    m_OneUse(m_BitCast(m_Value(X))))

bitcast(select(Cond, Y, bitcast(X))) --> select'(Cond, bitcast(Y), X)
    m_OneUse(m_BitCast(m_Value(X))))

Instruction *InstCombinerImpl::foldICmpEquality

Transform "icmp eq (trunc (lshr(X, cst1)), cst" to "icmp (and X, mask), cst"
    m_OneUse(m_LShr(m_Value(A), m_ConstantInt(ShAmt)))

static Value *foldMulSelectToNegate

mul (select Cond, 1, -1), OtherOp --> select Cond, OtherOp, -OtherOp
mul OtherOp, (select Cond, 1, -1) --> select Cond, OtherOp, -OtherOp
    m_OneUse(m_Select(m_Value(Cond), m_One(), m_AllOnes()))

mul (select Cond, -1, 1), OtherOp --> select Cond, -OtherOp, OtherOp
mul OtherOp, (select Cond, -1, 1) --> select Cond, -OtherOp, OtherOp
    m_OneUse(m_Select(m_Value(Cond), m_AllOnes(), m_One()))

fmul (select Cond, 1.0, -1.0), OtherOp --> select Cond, OtherOp, -OtherOp
fmul OtherOp, (select Cond, 1.0, -1.0) --> select Cond, OtherOp, -OtherOp
    m_OneUse(m_Select(m_Value(Cond), m_SpecificFP(1.0), m_SpecificFP(-1.0)))

fmul (select Cond, -1.0, 1.0), OtherOp --> select Cond, -OtherOp, OtherOp
fmul OtherOp, (select Cond, -1.0, 1.0) --> select Cond, -OtherOp, OtherOp
    m_OneUse(m_Select(m_Value(Cond), m_SpecificFP(-1.0), m_SpecificFP(1.0)))

Instruction *InstCombinerImpl::visitFMul

(C1 / X) * C --> (C * C1) / X; list here, but based on the discussion with div, I don't think we should remove one-use
    m_OneUse(m_FDiv(m_Constant(C1), m_Value(X)))

Instruction *InstCombinerImpl::visitShl

(C2 << X) << C1 --> (C2 << C1) << X
    m_OneUse(m_Shl(m_Constant(C2), m_Value(X))))

Instruction *InstCombinerImpl::foldICmpBinOpEqualityWithConstant

Replace ((add A, B) != C) with (A != C-B) if B & C are constants.
    BO->hasOneUse()

For the xor case, we can xor two constants together, eliminating the explicit xor.
Replace ((xor A, B) != 0) with (A != B)
    BO->hasOneUse()

Instruction *InstCombinerImpl::foldICmpEquality

A^c1 == C^c2 --> A == C^(c1^c2)
    Op1->hasOneUse()

Transform (zext A) == (B & (1<<X)-1) --> A == (trunc B)
    Op0->hasOneUse()
and           (B & (1<<X)-1) == (zext A) --> A == (trunc B)
    Op1->hasOneUse()

@spatel @lebedev.ri
Do you think we need to remove these one-use?
If yes I will start on work it to help improve code consistent.

spatel added a comment.May 2 2022, 7:37 AM

Do you think we need to remove these one-use?
If yes I will start on work it to help improve code consistent.

Some of these may be intentionally limited to one-use while others were just coded conservatively to avoid controversy.

As I said earlier, I don't think it is possible to be 100% consistent (without a lot of work in the backend). So I think it would be better to spend your time on problems that will give us a known improvement instead of a theoretical improvement. There are many open issues for optimization, or you can look for new opportunities by analyzing real applications.