This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Extracting common and-mask for shift operands of Or instruction
Needs ReviewPublic

Authored by opaparo on Oct 30 2017, 8:24 AM.

Details

Summary

Adding an InstCombine transformation:
((V<<C3)&C1) | ((V<<C4)&C2) --> ((V&C5)<<C3) | ((V&C5)<<C4), if C5 = C1>>C3 == C2>>C4, for both logical shifts.
When executed, this transforms five instructions into four, saving one instruction.

These patterns will also be transformed:
((V&C5)<<C3) | ((V<<C4)&C2) --> ((V&C5)<<C3) | ((V&C5)<<C4)
((V<<C3)&C1) | ((V&C5)<<C4) --> ((V&C5)<<C3) | ((V&C5)<<C4)

Diff Detail

Repository
rL LLVM

Event Timeline

opaparo created this revision.Oct 30 2017, 8:24 AM
opaparo updated this revision to Diff 122612.Nov 12 2017, 11:50 PM
opaparo edited the summary of this revision. (Show Details)

Adding a pattern match for a shift of and ((V&C1)<<C2).
Although and of shift is the canonical form, this new form is also required in some cases. The new test multiuse3 demonstrate such a case.

spatel edited edge metadata.Nov 13 2017, 8:19 AM

Why do we canonicalize shift-left before 'and'?

Ie, shouldn't we prefer this:

define i8 @andshl(i8 %x) {
  %and = and i8 %x, 1
  %shl = shl i8 %and, 3
  ret i8 %shl
}

instead of this:

define i8 @andshl(i8 %x) {
  %and = shl i8 %x, 3
  %shl = and i8 %and, 8
  ret i8 %shl
}

...because doing the 'and' before the shift always uses a smaller constant?

zvi added inline comments.Nov 20 2017, 11:00 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
18

May be a nitpick of mine, but i would find it easier to follow these helpers if the argument and variable names would match the comments. E.g. Source -> V, ShifyBy -> C2, PreShiftMask ->C1 ...

lsaba added inline comments.Nov 21 2017, 3:26 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
11

both logical shifts (shl / lshr)

81

Check that it's the same logical shift

lsaba accepted this revision.Nov 21 2017, 3:26 AM

LGTM after fixing the minor comments

This revision is now accepted and ready to land.Nov 21 2017, 3:26 AM
spatel requested changes to this revision.Nov 24 2017, 7:12 AM

Please answer my question. It seems like that canonicalization might simplify or obviate the need for this patch.

This revision now requires changes to proceed.Nov 24 2017, 7:12 AM
davide removed a subscriber: davide.Nov 25 2017, 12:32 AM

Why do we canonicalize shift-left before 'and'?

Ie, shouldn't we prefer this:

define i8 @andshl(i8 %x) {
  %and = and i8 %x, 1
  %shl = shl i8 %and, 3
  ret i8 %shl
}

instead of this:

define i8 @andshl(i8 %x) {
  %and = shl i8 %x, 3
  %shl = and i8 %and, 8
  ret i8 %shl
}

...because doing the 'and' before the shift always uses a smaller constant?

Hi,
Sorry for the delayed response.

I'm not sure I understand why the suggested canonization might simplify or obviate the need for this patch.
Consider my use case "multiuse3". Although InstCombine normally recognizes and canonize something of the form

%1 = and i32 %x, 96
%2 = shl nuw nsw i32 %1, 6

It will not in this case as '%1' has more than one use, and having one use is a condition for this transformation. If my transformation wouldn't consider the non-canonical in addition to the canonical form it could not handle this case.
One may argue that changing the canonical form will yield this code:

%1 = and i32 %x, 96
%2 = shl nuw nsw i32 %1, 6
%3 = lshr exact i32 %1, 1
%4 = and i32 %x, 30
%5 = shl nuw nsw i32 %4, 6
%6 = or i32 %2, %5
%7 = lshr exact i32 %4, 1
%8 = or i32 %3, %7
%9 = or i32 %8, %6
ret i32 %9

Which will then simplify the patch as I will not be required to consider the non-canonical form. However:

  1. This will be true only if the two shifts are shl. In this example one of them is a lshr, so the transformation will not actually happen and the non-canonical form still needs to be considered.
  2. This canonization will only occur if the intermediate results, i.e. "%4 = shl i32 %x, 6" and "%7 = lshr i32 %x, 1" have only one use. Suppose the scenario was a bit different and those values were used somewhere along the road. In this case the canonization would not happen and again I'll have to consider the non-canonical form.

As far as I understand this is the way that the suggested canonization might simplify or obviate the need for this patch. If you meant something else, could you please elaborate?

I'm not sure I understand why the suggested canonization might simplify or obviate the need for this patch.
Consider my use case "multiuse3". Although InstCombine normally recognizes and canonize something of the form

I think inverting the canonicalization of shl+and would make your first test case optimize without this patch, so that's actually where I paused in reviewing the patch. Have you investigated that possibility? Currently, we end up inverting the canonicalization in the x86 backend (because a smaller constant mask can be created in less instruction bytes), so it would be better to "get it right" here in IR in the first place.

I understand the multi-use case better now with your explanation, so I agree that we want this patch to handle those cases too. But I don't think we should ignore the underlying canonicalization choices just because we know we want to catch the larger patterns.

I think inverting the canonicalization of shl+and would make your first test case optimize without this patch

Could you please explain why? I'm not sure I'm seeing it.

Currently, we end up inverting the canonicalization in the x86 backend (because a smaller constant mask can be created in less instruction bytes), so it would be better to "get it right" here in IR in the first place.
I understand the multi-use case better now with your explanation, so I agree that we want this patch to handle those cases too. But I don't think we should ignore the underlying canonicalization choices just because we know we want to catch the larger patterns.

I agree that this alternative canonization could prove to be beneficial and more correct. However, I feel that this discussion is orthogonal to this patch, and if it would indeed be decided to switch to the new form then some of the code of this patch, along with several other pieces of code, may need to change accordingly.

I think inverting the canonicalization of shl+and would make your first test case optimize without this patch

Could you please explain why? I'm not sure I'm seeing it.

If we invert the shl+and transform, we don't need this patch to reach optimal code for 3 out of the 6 tests:

define i32 @or_and_shifts1(i32 %x) {
  %1 = and i32 %x, 1
  %2 = shl nuw nsw i32 %1, 3
  %3 = and i32 %x, 1   <-- CSE will eliminate this
  %4 = shl nuw nsw i32 %3, 5
  %5 = or i32 %2, %4
  ret i32 %5
}

Similarly (what does this test check that is different from the above?):

define i32 @or_and_shift_shift_and(i32 %x) {
  %1 = and i32 %x, 7
  %2 = shl nuw nsw i32 %1, 3
  %3 = and i32 %x, 7  <-- CSE will eliminate this
  %4 = shl nuw nsw i32 %3, 2
  %5 = or i32 %2, %4
  ret i32 %5
}

And again:

define i32 @multiuse2(i32 %x) {
  %1 = and i32 %x, 126
  %2 = shl nuw nsw i32 %1, 8
  %3 = and i32 %x, 126 <-- CSE will eliminate this
  %4 = shl nuw nsw i32 %3, 1
  %5 = or i32 %2, %4
  ret i32 %5
}

Currently, we end up inverting the canonicalization in the x86 backend (because a smaller constant mask can be created in less instruction bytes), so it would be better to "get it right" here in IR in the first place.
I understand the multi-use case better now with your explanation, so I agree that we want this patch to handle those cases too. But I don't think we should ignore the underlying canonicalization choices just because we know we want to catch the larger patterns.

I agree that this alternative canonization could prove to be beneficial and more correct. However, I feel that this discussion is orthogonal to this patch, and if it would indeed be decided to switch to the new form then some of the code of this patch, along with several other pieces of code, may need to change accordingly.

Since we can eliminate the need for this patch in half of the tests (note: I checked in the tests at rL319182 , so we can see what they look like currently), I don't think the underlying transform is orthogonal. If this patch would change with the inverted canonicalization, then that's more reason to view the inversion as a preliminary step for this patch. Otherwise, we're adding code unnecessarily. It's possible that inverting shl+and inhibits other folds, and if that's the case, then why not fix that too?

Here's the draft patch I used to check the tests above:

Index: lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
===================================================================
--- lib/Transforms/InstCombine/InstCombineAndOrXor.cpp	(revision 319170)
+++ lib/Transforms/InstCombine/InstCombineAndOrXor.cpp	(working copy)
@@ -1212,6 +1212,13 @@
       return BinaryOperator::CreateOr(And, ConstantInt::get(I.getType(),
                                                             Together));
     }
+    const APInt *ShlC;
+    if (match(Op0, m_OneUse(m_Shl(m_Value(X), m_APInt(ShlC))))) {
+      Constant *NewMask = ConstantInt::get(I.getType(), C->lshr(*ShlC));
+      Value *NewAnd = Builder.CreateAnd(X, NewMask);
+      return BinaryOperator::CreateShl(NewAnd, ConstantInt::get(I.getType(),
+                                                                *ShlC));
+    }
 
     // If the mask is only needed on one incoming arm, push the 'and' op up.
     if (match(Op0, m_OneUse(m_Xor(m_Value(X), m_Value(Y)))) ||
Index: lib/Transforms/InstCombine/InstCombineShifts.cpp
===================================================================
--- lib/Transforms/InstCombine/InstCombineShifts.cpp	(revision 319170)
+++ lib/Transforms/InstCombine/InstCombineShifts.cpp	(working copy)
@@ -505,7 +505,7 @@
       // If the operand is a bitwise operator with a constant RHS, and the
       // shift is the only use, we can pull it out of the shift.
       const APInt *Op0C;
-      if (match(Op0BO->getOperand(1), m_APInt(Op0C))) {
+      if (match(Op0BO->getOperand(1), m_APInt(Op0C)) && !isLeftShift) {
         if (canShiftBinOpWithConstantRHS(I, Op0BO, *Op0C)) {
           Constant *NewRHS = ConstantExpr::get(I.getOpcode(),
                                      cast<Constant>(Op0BO->getOperand(1)), Op1);
spatel added inline comments.Nov 29 2017, 6:08 AM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
69

Why are we looking for a pattern that early-cse can simplify? I think this is beyond the scope of instcombine.

spatel added a comment.Dec 5 2017, 7:19 AM

Both this and D38037 are trying to start a pattern match with an 'or', but I'm curious if there's a 'trunc' in the larger source that creates these patterns? Either way, we're missing something bigger than patterns that start with 'or'.

For example, I was looking at PR31667:
https://bugs.llvm.org/show_bug.cgi?id=31667

Name: sub_mask_shift

%and1 = lshr i32 %x, 3
%shr1 = and i32 %and1, 8191
%and2 = lshr i32 %x, 1
%shr2 = and i32 %and2, 32767
%r = sub i32 %shr1, %shr2

...which was filed as a backend bug, but we wouldn't handle that in IR either:
https://rise4fun.com/Alive/id4

So I think there's some more general sequence that we want to capture and optimize, but it may be difficult to justify as part of instcombine?

Note that there is a proposal for a new pass where all of these might find a home:
D38313

opaparo set the repository for this revision to rL LLVM.Dec 14 2017, 5:49 AM
opaparo updated this revision to Diff 126938.Dec 14 2017, 5:53 AM

Rebasing on parent revision and adding more test cases.

Both this and D38037 are trying to start a pattern match with an 'or', but I'm curious if there's a 'trunc' in the larger source that creates these patterns? Either way, we're missing something bigger than patterns that start with 'or'.

For example, I was looking at PR31667:
https://bugs.llvm.org/show_bug.cgi?id=31667

Name: sub_mask_shift

%and1 = lshr i32 %x, 3
%shr1 = and i32 %and1, 8191
%and2 = lshr i32 %x, 1
%shr2 = and i32 %and2, 32767
%r = sub i32 %shr1, %shr2

...which was filed as a backend bug, but we wouldn't handle that in IR either:
https://rise4fun.com/Alive/id4

So I think there's some more general sequence that we want to capture and optimize, but it may be difficult to justify as part of instcombine?

Note that there is a proposal for a new pass where all of these might find a home:
D38313

(Now abandoned) D38037 was a the first draft of this review.
There is no 'trunc' in the larger source that creates these patterns. My patch may address these issues you mentioned, but it is not directly related to them.

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
69

Please check out the newly added tests multiuse4 and multiuse5.
I believe they illustrate that this is indeed an instcombine transformation.

opaparo updated this revision to Diff 127365.Dec 18 2017, 8:01 AM

Rebasing on top of the new parent review D41354