Page MenuHomePhabricator

[AggressiveInstCombine] Fold redundant masking operations of shifted value
AbandonedPublic

Authored by dnsampaio on Jul 12 2018, 5:44 AM.

Details

Summary

Allow to reduce redundant shift masks

Let
OR = x1 | x2, where
x1 = x & 0xAB00
x2 = (x >> 8) & 0xAB

The x2 operation can be seen as
x2 = (x >> 8) & (0xAB00 >> 8)

>

x2 = (x & 0xAB00) >> 8

And finally reduced to
x2 = x1 >> 8

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lebedev.ri added inline comments.Jul 12 2018, 5:53 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2838 ↗(On Diff #155155)

Is this even for instcombine?
I wonder if this should be aggressiveinstcombine, or something else?

spatel added inline comments.Jul 12 2018, 6:51 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2838 ↗(On Diff #155155)

Right - you'll notice that in my potential suggestions in D48278, I did *not* include instcombine. I think this has the same problem as that patch, but when you do it in instcombine, it gets multiplied by another 8x factor because we run instcombine so many times in the normal opt pass pipelines.

In general, we don't walk user lists in instcombine because that's not efficient (same concern that @efriedma raised in the other patch I think). On top of that, this patch is using recursive value tracking which is also expensive.

I suggest looking at (new)-gvn or early-cse to see if we can find the redundant op efficiently.

dnsampaio updated this revision to Diff 155329.Jul 13 2018, 1:11 AM
dnsampaio marked 3 inline comments as done.

Constrain to allow the transformation to happen only when the masked value has only 2 users (an AND and a SHIFT).
Removed value tracking operations.

Fixed execution costs. See below.

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2838 ↗(On Diff #155155)

If it is due the running costs, that is fixed. If is due the complexity of it ... I believe it is quite straight forward no?

2838 ↗(On Diff #155155)

No problems. Can remove both value tracking and iteration as to work in cases where the masked value only has 2 users, an and and a shift operation.

dnsampaio updated this revision to Diff 155377.Jul 13 2018, 7:18 AM

Moved test to correct folder

dnsampaio updated this revision to Diff 155642.Jul 16 2018, 3:28 AM
dnsampaio marked 2 inline comments as done.

Using hasNUses() instead of numUses() ==

  1. Please always upload all patches with the full context (-U99999)
  2. I think this was said in the other review, but while i acknowledge the missing fold, i'm not convinced that this approach is the right one. I think this should be done in smaller, fine-grained steps. I think, first one would be to canonicalize the and-mask before shifts. https://rise4fun.com/Alive/rOb Incidentally, that would already solve these, at least at -O3: https://godbolt.org/g/CkJzQd
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1416 ↗(On Diff #155642)

Newline

2838 ↗(On Diff #155155)

No the point. What @spatel said:

In general, we don't walk user lists in instcombine because that's not efficient (same concern that @efriedma raised in the other patch I think).

We simply don't.
For this to be in instcombine, you'd need to match the %4 = or i32 %2, %3 and look at it's parents.

Or maybe this should be in some other pass.

test/Transforms/InstCombine/FoldRedundantShiftedMasks.ll
26–31 ↗(On Diff #155642)

Please cleanup the tests a little, prefix all these numeric variables with explicit tmp prefix.

  1. Please always upload all patches with the full context (-U99999)

Sorry, I thought that it being an entire function it wouldn't matter. Lesson learned.

  1. I think this was said in the other review, but while i acknowledge the missing fold, i'm not convinced that this approach is the right one. I think this should be done in smaller, fine-grained steps. I think, first one would be to canonicalize the and-mask before shifts.

The problem with this canonicalization process is that, it is not certainly reducing code-size. It might be profitable in certain cases, and problematic in others. I won't take place in the decision if the normalization should be changed, as I do not have a strong opinion about it.
Also about from the canonicalization thread, I understood that they wanted to canonicalize only shl. It won't help in my case where I want the redundant and of ashr and lshr results to be handled.

About using the value uses in InstCombine. Well, it wouldn't be the 1st transformation to do it. The idea to start matching all the way from the bottom, like from the "or" operation and match both sides of it could be a solution. But is much less generic. It would require to be called in all binary operation as starting point as to cover the same cases of this one.

So what do you think is the best?
Create a separate pass? If so, when should it be called, as to maximize the chances of the pattern being detected?

Use InstCombine and start all the way from the bottom? Should it be added to all binary operations or do I leave it just in the 'or' operation?

Or leave as it is, apart from the corrections just described?

All of your test cases are rooted at an or, so it makes sense to search up from there. Why not start with searching just from or (and xors?) and then add the search from more operators in later patches?

dnsampaio updated this revision to Diff 156473.Jul 20 2018, 6:08 AM
dnsampaio marked an inline comment as done.
dnsampaio edited the summary of this revision. (Show Details)

Detect desired pattern from the binary operation using the results.

dnsampaio updated this revision to Diff 156475.Jul 20 2018, 6:16 AM
labrinea added inline comments.Jul 23 2018, 7:10 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2024 ↗(On Diff #156475)

You could add a more descriptive comment like:

// Fold redundant masking.
// Let C3 = C1 lsh C2
// (X & C1) | ((X lsh C2) & C3) -> (X & C1) | ((X & C1) lsh C2)
// Also handles the commutative cases.
2027 ↗(On Diff #156475)

You could change that into m_c_Or to cover the commutative case too. Then you can get rid of the code duplication.

2031 ↗(On Diff #156475)

You could replace the lines 2031-2035 with

BinaryOperator *Op0 = cast<BinaryOperator>(I.getOperand(0));
BinaryOperator *Op1 = cast<BinaryOperator>(I.getOperand(1));
       
BinaryOperator *And, *Shift;
if ((Shift = dyn_cast<BinaryOperator>(Op0->getOperand(0))))
  And = Op1;
else if ((Shift = dyn_cast<BinaryOperator>(Op0->getOperand(1))))
  And = Op1;
else if ((Shift = dyn_cast<BinaryOperator>(Op1->getOperand(0))))
  And = Op0;
else
  And = Op0;

and then remove the commutative pattern match from below (lines 2049-2070).

test/Transforms/InstCombine/FoldRedundantShiftedMasks.ll
4 ↗(On Diff #156475)

The reference to Phabricator is confusing. Please remove it. It'll become irrelevant with time anyway.

samparker added inline comments.Jul 23 2018, 7:13 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2024 ↗(On Diff #156475)

Looks like the convention here is to perform more specific folding after the calls to SimplifyAssociativeOrCommutative, etc.. And there is already a case for '(A & C)|(B & D)'.

2027 ↗(On Diff #156475)

We already know I is an Or.

labrinea added inline comments.Jul 23 2018, 7:38 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2024 ↗(On Diff #156475)

Yeap, Sam is right. Line 2079 of the original source.

2031 ↗(On Diff #156475)

I made a mistake at the else clause. Should be

else {
  Shift = cast<BinaryOperator>(Op1->getOperand(1));
  And = Op0;
}
lebedev.ri added inline comments.Jul 23 2018, 7:50 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2024 ↗(On Diff #156475)

Just put this into a new function and call it from here.

2026–2027 ↗(On Diff #156475)

s/X1/X/

2027 ↗(On Diff #156475)

if (match(&I, m_Or(m_c_And(m_Value(X1), m_APInt(AndC)),

@labrinea is correct.
This is backwards. m_c_And won't ever matter. Constant will be rhs, So it should be m_And.
But this isn't true about m_Or, that one will be missing some commutative cases. It should be m_c_Or.
(i thought i commented that already?)

2028 ↗(On Diff #156475)

Same here, constant will always be RHS, no need for commutativity.

2028 ↗(On Diff #156475)

s/m_Value(X2)/m_Deferred(X)/

2030 ↗(On Diff #156475)

And this is no longer needed.

2049–2052 ↗(On Diff #156475)

This should be

if (match(&I, m_c_Or(m_And(m_OneUse(m_Shift(m_Value(X), m_APInt(ShftAmt))),
                           m_APInt(ShAndC)),
                     m_And(m_Deferred(X), m_APInt(AndC)))) {
2053–2058 ↗(On Diff #156475)

Capture them in match().

dnsampaio marked 13 inline comments as done.Jul 24 2018, 4:50 AM

Did most fixes. Just don't how to capture non-leaf nodes of the pattern being matched. Using other match operations would actually be more complicated than just passing the operands as arguments to the new function, now that I already know they are AND operations due the function call placement.

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2053–2058 ↗(On Diff #156475)

I don't see how to capture nodes that are not the leafs of the pattern. Should I use another match? It would be easier to just check if the first operand of the OR is the "and(X, AndC)" no?

if (match(&I, m_c_Or(m_And(m_OneUse(m_Shift(m_Value(X),
            m_APInt(ShftAmt))), m_APInt(ShAndC)),m_And(
            m_Deferred(X),  m_APInt(AndC))))){
 BinaryOperator *And = I.getOperand(0), *ShtAnd = I.getOperand(1), *Shift;
if(And->getOperand(0) != X)
 std::swap(And, ShtAnd);
Shift = ShtAnd->getOperand(0);
lebedev.ri added inline comments.Jul 24 2018, 4:51 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2053–2058 ↗(On Diff #156475)

See m_CombineAnd()

dnsampaio updated this revision to Diff 157006.Jul 24 2018, 4:55 AM

Moved to a separated function. Placed function call after knowing more about the operands.
Added ashr case, that was being wrongly treated as lshr.
Added comments, including one that argues that this function would be useless if and instructions are move before any type of shift operations.
Using m_c_Or, and passing operands as arguments.

lebedev.ri added inline comments.Jul 24 2018, 5:01 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2026–2036 ↗(On Diff #157006)

See m_CombineAnd().

2042–2046 ↗(On Diff #157006)

This will miscompile with types larger than i64.
Please add tests.

lebedev.ri added inline comments.Jul 24 2018, 5:07 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2031 ↗(On Diff #157006)

This is super subtle, defining more variables after the first one which is being initialized,

2042–2046 ↗(On Diff #157006)

Hmm, no, actually. disregard that.
If the shift is too large, then that old shift would have been folded to undef already.

dnsampaio marked 2 inline comments as done.Jul 24 2018, 8:36 AM
dnsampaio added inline comments.
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2026–2036 ↗(On Diff #157006)

Nice one, it took me some time to understand that I could match and then capture. Thanks. :)

dnsampaio updated this revision to Diff 157047.Jul 24 2018, 8:37 AM
dnsampaio marked an inline comment as done.

All required values are obtained during the pattern matching.

lebedev.ri added inline comments.Jul 24 2018, 8:55 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2022 ↗(On Diff #157047)

I don't think A and B are used anymore?

2028 ↗(On Diff #157047)

We create two new instructions, but we only make sure that one instruction (Shift, i think?) goes away.
The outermost m_And() should too be m_OneUse(), i think.

2033 ↗(On Diff #157047)

Early return please.

dnsampaio marked 2 inline comments as done.Jul 24 2018, 9:14 AM
dnsampaio added inline comments.
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2028 ↗(On Diff #157047)

Actually no. We are eliminating one of the and operations.

x1 = v & 0xFF00
x2 = v >> 8
x3 = x2 & 0xFF
or  = x1 | x3
x4 = x1 + 5;

We eliminate x3 and it becomes

x1 = v & 0xFF00
x3' = x1 >> 8
or'  = x3 | x1
x4 = x1 + 5;

x1 can have more than 1 user. x3 will be remove by dead-code elimination. We replace or.

dnsampaio updated this revision to Diff 157057.Jul 24 2018, 9:15 AM

Removed unused arguments.
Early exit.

dnsampaio added inline comments.Jul 24 2018, 9:19 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2028 ↗(On Diff #157047)

Small correction:

x1 = v & 0xFF00
x3' = x1 >> 8
or'  = x3' | x1
x4 = x1 + 5
dnsampaio updated this revision to Diff 157228.Jul 25 2018, 4:53 AM

Relaxed conditions for which the transformation is applied.
Added more tests for ashr.

dnsampaio updated this revision to Diff 157279.Jul 25 2018, 8:15 AM
dnsampaio marked 2 inline comments as done.Jul 29 2018, 7:11 AM
dnsampaio updated this revision to Diff 157930.Jul 30 2018, 2:23 AM
dnsampaio marked 3 inline comments as done.

Replaces all uses of the innermost and with the new shift.

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2028 ↗(On Diff #157047)

But indeed, we must replace all uses of the innermost AND with the new shift.

lebedev.ri added inline comments.Jul 30 2018, 4:10 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2058 ↗(On Diff #157930)

But how do you know it's dead? I don't think you checked that it is one-use?
And if it is one-use, then it will be dead after the parent instruction will be replaced, so why do we care?
I guess you want to add a bit more tests.

dnsampaio marked an inline comment as done.Jul 30 2018, 4:58 AM

As we are replacing all the uses of the DeadAnd, it is not required to create and replace the visiting Or operation. Replace all uses does the job already.

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2058 ↗(On Diff #157930)

The new shift, being created, holds the same value of the DeadAnd. So we can just replace all uses of the DeadAnd with the new shifted value. Test added, to be uploaded.

dnsampaio marked an inline comment as done.Jul 30 2018, 5:03 AM
dnsampaio added inline comments.
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2058 ↗(On Diff #157930)

Perhaps the naming is misleading. It is "dead" in the sense that the new shift holds the same value, and all it's uses must be replaced. But no, it is not restricted to be a single use. Adding example test.

dnsampaio updated this revision to Diff 157946.Jul 30 2018, 5:07 AM

Added test that the inner and must be replaced by the new shift operation.
Converted the function to bool, as it does not require to create the Or operation after the replaceAll.

spatel added inline comments.Jul 30 2018, 8:11 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2058 ↗(On Diff #157930)

I'll say it again: this is getting weird because we're trying to make instcombine do something it should not be doing.

You crippled the transform to try to make it fit, and now you're trying to expand it to handle the motivating cases.

The most basic case where this transform should fire looks like this:

define void @shift_r1(i32 %x) {
  %r1 = and i32 %x, 172
  %sh = shl i32 %x, 8
  %r2 = and i32 %sh, 44032
  tail call void @use(i32 %r1)
  tail call void @use(i32 %r2)
  ret void
}

-->
define void @shift_r1(i32 %x) {
  %r1 = and i32 %x, 172
  %r2 = shl i32 %r1, 8
  tail call void @use(i32 %r1)
  tail call void @use(i32 %r2)
  ret void
}

There is no 'or' in the pattern. The optimization is about recognizing a common subexpression and using it to remove an instruction. That could be CSE, GVN, or a standalone pass. I don't see how this fits in instcombine.

dnsampaio added inline comments.Aug 2 2018, 3:38 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2058 ↗(On Diff #157930)

Hi Sanjay,
So all expensive operations have been eliminated, I do not see why it wouldn't fit in InstCombne. We detect a pattern and we reduce it.

The transformation won't fit EarlyCSE pass, as it might be required to move the producer and up, such as in:

define void @shift_r1(i32 %x) {
%sh = shl i32 %x, 8
%r2 = and i32 %sh, 44032
tail call void @use(i32 %r2)
%r1 = and i32 %x, 172
tail call void @use(i32 %r1)
ret void
}

We must move %r1 before %sh.

I could very well do a stand-alone BasicBlockPass, but don't think it would have a very appealing reason to be.

spatel added a comment.Aug 2 2018, 6:02 AM

So all expensive operations have been eliminated, I do not see why it wouldn't fit in InstCombne. We detect a pattern and we reduce it.

Because the pattern that we are matching is larger than it needs to be (as the comment in the test file clearly shows - there is no 'or' in the minimal pattern). This problem of trying to make everything fit in instcombine has been discussed several times on llvm-dev in the last ~year. Eg:
http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html

Did you look at (new)-GVN to see if it fits in there?

If this is in instcombine (in addition to missing the pattern when there is no 'or'), I think you have to limit the transform based on uses as Roman mentioned in an earlier comment.

So this started life in the DAGCombiner and issues around the implementation were raised and that it would be useful to have earlier in the pipeline. But it seems that it hasn't really be thought, or discussion, about how this would fit well in the existing passes... I think DAG combine has always been the right place for this because we're trying to reuse values - something that DAGs are good for. In DAGCombiner::visitANDLike, we already handle ANDs with SRL operands and the motivating example can be addressed with very little effort:

+        uint32_t ShiftedMask = CAnd->getZExtValue() << ShiftBits;
+        SDNode *Res =
+          DAG.UpdateNodeOperands(N, N0->getOperand(0), DAG.getConstant(ShiftedMask, DL, VT));
+        if (Res != N)
+          return DAG.getNode(ISD::SRL, DL, VT, SDValue(Res, 0), N0->getOperand(1));
+        else
+          return SDValue(DAG.UpdateNodeOperands(N, N0, SDValue(CAnd, 0)), 0);

I don't know the cost of calling UpdateNodeOperands, potentially twice, but I feel the simplicity suggests that the transform is most suited to the DAG.

So all expensive operations have been eliminated, I do not see why it wouldn't fit in InstCombne. We detect a pattern and we reduce it.

Because the pattern that we are matching is larger than it needs to be (as the comment in the test file clearly shows - there is no 'or' in the minimal pattern). This problem of trying to make everything fit in instcombine has been discussed several times on llvm-dev in the last ~year. Eg:
http://lists.llvm.org/pipermail/llvm-dev/2017-September/117151.html

I agree that it won't handle all cases. But one will need to come with a more generic thinking as to create a new pass that handles this. Something like an abstract known bits, that tells that two values hold the same bits coming from a given instruction, or some simplification by demanded bits from the same values. It is feasible, but it is not my intention to do it so now.

Did you look at (new)-GVN to see if it fits in there?

I must confess that I did not quite understand all the work-flow of newGVN, but from what I did see, it mostly wouldn't fit. It seems to behave like InstCombine, expecting to replace the current instruction being visited. And it would require to create one value as to detect if there is a leader of that value and then reuse it. It is not that complicated, but quite awkward IMO.

If this is in instcombine (in addition to missing the pattern when there is no 'or'), I think you have to limit the transform based on uses as Roman mentioned in an earlier comment.

Why? The @mulUses in the test shows the case where it does not require %x2 to have a single user. The shift operation dominates the and being replaced.

And I agree with @samparker, our motivating example is quite simple, is the @foo in our tests. I just intended to made it more generic so it would act in more shift operations, not to over complicate it. I acknowledge that doing the transformation in the IR could be good, but the DAG fits much simpler. So I really see no problems in having it both places (as in D48278). Either way, it would be nice to come to a definitive solution.

dnsampaio updated this revision to Diff 159529.Aug 7 2018, 9:50 AM
dnsampaio retitled this revision from [InstCombine] Fold redundant masking operations of shifted value to [AggressiveInstCombine] Fold redundant masking operations of shifted value.

Moving this pattern matching to AggressiveInstCombine following a suggestion of @lebedev.ri . Now it searches for minimal required patterns as desired by @spatel.

dnsampaio updated this revision to Diff 159530.Aug 7 2018, 9:51 AM

Added missing test-file

dnsampaio edited the summary of this revision. (Show Details)Aug 13 2018, 7:30 AM
lebedev.ri requested changes to this revision.Fri, Jun 21, 10:51 AM
This revision now requires changes to proceed.Fri, Jun 21, 10:51 AM

Hi @lebedev.ri,
Nice you looked this one as I am not quite sure what to do about it. Any suggestions?

Hi @lebedev.ri,
Nice you looked this one as I am not quite sure what to do about it. Any suggestions?

As it can be seen from the disscussion here, while the problem this is trying to solve is real,
the actual solution that should be done is very much non-obvious. It kind-of doesn't anywhere.
Or maybe there's some astounding perf numbers (SPEC?) that justify this solution?
I don't have any further ideas presently, sorry. I only marked it to remove from review queue.

dnsampaio abandoned this revision.Fri, Jun 28, 8:10 AM