Page MenuHomePhabricator

[InstCombine] Combine instructions of type or/and where AND masks can be combined.
ClosedPublic

Authored by bipmis on Apr 20 2022, 12:54 PM.

Details

Summary

The patch simplifies some of the patterns as below

(A | (B & C0)) | (B & C1) -> A | (B & C0|C1)
((B & C0) | A) | (B & C1) -> (B & C0|C1) | A

In some scenarios like byte reverse on half word, we can see this pattern multiple times and this conversion can optimize these patterns.

Diff Detail

Unit TestsFailed

TimeTest
60,090 msx64 debian > LLVM.CodeGen/NVPTX::wmma.py
Script: -- : 'RUN: at line 5'; "/usr/bin/python3.9" /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/NVPTX/wmma.py --ptx=60 --gpu-arch=70 > /var/lib/buildkite-agent/builds/llvm-project/build/test/CodeGen/NVPTX/Output/wmma.py.tmp-ptx60-sm_70.ll

Event Timeline

bipmis requested review of this revision.Apr 20 2022, 12:54 PM
bipmis created this revision.
bipmis edited the summary of this revision. (Show Details)
RKSimon retitled this revision from [AArch64] Combine instructions of type or/and where AND masks can be combined. to [InstCombine] Combine instructions of type or/and where AND masks can be combined..Apr 20 2022, 2:15 PM
RKSimon added a reviewer: nikic.

Sounds great, doing this is InstCombine as opposed to DAG. That should make all the costmodelling and whatnot come out better.

Do you have alive proofs for the changes? And it might be a good idea to add a few more tests, for one use cases and negative checks that shouldn't fire. Precommitting the tests is also a good idea.

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2730

Looks like this was unneeded.

llvm/test/Transforms/InstCombine/and-or.ll
359

You can remove local_unnamed_addr #0

bipmis updated this revision to Diff 424477.Apr 22 2022, 7:43 AM
bipmis edited the summary of this revision. (Show Details)

Thanks. Adding additional test for showing the other commute scenario. Also re-patching with the tests committed and comments incorporated.
Alive link for showing the transform is https://alive2.llvm.org/ce/z/XeMdDp

This seems like a problem with the -reassociate pass - it seems to invert the form we want here.

However, I'm not opposed to another relatively simple reassociation transform here in instcombine (we have many already), but the patch as written is not general enough, missing commuted patterns/tests, and missing some kind of one-use check as noted earlier.

We could generalize to a canonicalization that pushes the 'and' values together, and then the optimization of combining mask constants will fall out from that:
https://alive2.llvm.org/ce/z/ZHgrvL

As code, that would be something like this:

// (A & B) | (C | (A & D)) --> ((A & B) | (A & D)) | C
if (match(Op0, m_And(m_Value(A), m_Value(B))) &&
    match(Op1, m_c_Or(m_Value(C), m_c_And(m_Specific(A), m_Value(D)))))
...

There are already 4 commuted possibilities there, but we'd also need to test for the pattern where "B" is the repeated operand on the right side:

// (A & B) | (C | (B & D)) --> ((A & B) | (B & D)) | C

...and swap the operands of the final 'or'. So 16 total patterns to test for? If we make the matches require constants, we can reduce the number of possibilities (since we know that constants are always op1/right-side), but it's a less general transform.

On 2nd thought, we should model the transform like this:
https://alive2.llvm.org/ce/z/NAgoJM

So it's a missing optimization even without constant operands, and we really do need 16 commuted matches/tests. :)

This seems like a problem with the -reassociate pass - it seems to invert the form we want here.

However, I'm not opposed to another relatively simple reassociation transform here in instcombine (we have many already), but the patch as written is not general enough, missing commuted patterns/tests, and missing some kind of one-use check as noted earlier.

We could generalize to a canonicalization that pushes the 'and' values together, and then the optimization of combining mask constants will fall out from that:
https://alive2.llvm.org/ce/z/ZHgrvL

As code, that would be something like this:

// (A & B) | (C | (A & D)) --> ((A & B) | (A & D)) | C
if (match(Op0, m_And(m_Value(A), m_Value(B))) &&
    match(Op1, m_c_Or(m_Value(C), m_c_And(m_Specific(A), m_Value(D)))))
...

There are already 4 commuted possibilities there, but we'd also need to test for the pattern where "B" is the repeated operand on the right side:

// (A & B) | (C | (B & D)) --> ((A & B) | (B & D)) | C

...and swap the operands of the final 'or'. So 16 total patterns to test for? If we make the matches require constants, we can reduce the number of possibilities (since we know that constants are always op1/right-side), but it's a less general transform.

I agree to this point. We can generalize to a canonicalization that pushes the 'and' values together, as the reassociate pass will then combine the 2 AND's for both register and masks. This will also align to what you have suggested in https://alive2.llvm.org/ce/z/NAgoJM.

Yes for generic scenario, we are looking at 16 combinations. Let me push the test for them first and can then rebase my patch with above implementation.

bipmis updated this revision to Diff 425520.Apr 27 2022, 7:59 AM

Rebasing patch with the changes as suggested with different commuted tests.

spatel added inline comments.Apr 27 2022, 12:52 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2967–2968

This is still missing one-use checks and tests. We should have tests where one or more of the intermediate values has some other use. Look in the test file that you are modifying for tests with call void @use(i8).

If we are creating 3 instructions, then both Op0 and Op1 must have only one use to ensure that we do not end up with more instructions than we started with.

2970

The code comments are confusing - don't reuse 'D' if we've already specified that as the outer-level and instruction.

It seems better to match the final or with a commutative matcher rather than duplicate all of this code twice. It would be something like:

if (match(&I, m_c_Or(m_And(m_Value(A), m_Value(B)), m_Or(m_Value(C), m_Value(D))))
2974

Why use m_c_BinOp rather than m_c_And?

If we are not using 'X', then there is no reason to capture it - just use plain m_Value().

2976

I think it would be better (more efficient) to create the optimal pattern directly rather than relying on subsequent folds to do it for us. We can create the or(and(or...)) sequence right here.

llvm/test/Transforms/InstCombine/and-or.ll
386–387

Notice that pat1-4 are canonicalized to the same form as pat5-8, so these are not testing the patterns that you intended.

You'll need to add an extra instruction to these tests (and also pat1-4 in the next set of tests) to maintain %c as operand 0 of the or.

Search for "thwart complexity-based canonicalization" in the InstCombine test dir for examples of how to do this.

bipmis added inline comments.Apr 28 2022, 2:36 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2970

Ya the method and the comment is based on the current implementation where we do not actually reduce number of instructions, but reorder such that the AND's are now ORed and the existing implementation in InstCombine tryFactorization() folds them and reduces the instructions.

2974

m_c_BinOp was used as it can look for 2 operands in either order which is needed and could reduce extra code.

2976

Right. My point of view is if there is an existing implementation in the instruction combine to do this, should we redo the same here. I checked with some examples and multiple commuted sequences and the existing implementation could combine both constants and registers and reduce the number of instructions which was the intention. In doing so however it is not retaining the order of the operands correctly.

llvm/test/Transforms/InstCombine/and-or.ll
386–387

Right this is possibly due to the existing implementation tryFactorization() which folds the or(and, and) to and(or()). It does it for all scenarios, however likely maintains a fixed position of the operands.

bipmis marked an inline comment as not done.Apr 28 2022, 3:23 AM
bipmis added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2974

But yes we could use m_c_And as well. Thanks for pointing this out.

bipmis marked an inline comment as not done.Apr 28 2022, 11:37 AM

I have modified and committed the tests with modifications requested. Please have a look if this looks ok. Subsequently will rebase my patch.

In the current implementation I have folded (A & B) | (C | (A & D)) -> C | ((A & D) | (A & B)). Number of instructions have not changed as there is existing code instCombine which can do C | ((A & D) | (A & B)) -> C | (A & (D | B)). Advantage to this is less code required for implementing this and as suggested we can reduce the duplicated code and it would now look as below

if (match(&I, m_c_Or(m_OneUse(m_And(m_Value(A), m_Value(B))), m_OneUse(m_Or(m_Value(C), m_Value(D))))))
  {
    if (match(D, m_c_And(m_Specific(A), m_Value())) ||
        match(D, m_c_And(m_Specific(B), m_Value())))
      return BinaryOperator::CreateOr(C, Builder.CreateOr(D, Builder.CreateAnd(A, B)));
    if (match(C, m_c_And(m_Specific(A), m_Value())) ||
        match(C, m_c_And(m_Specific(B), m_Value())))
      return BinaryOperator::CreateOr(Builder.CreateOr(C, Builder.CreateAnd(A, B)), D);
  }

Other option is to directly generate the reduced number of instructions (A & B) | (C | (A & D)) -> C | (A & (D | B)). In this case we will need to introduce more code. Should this be the preferred approach and would be the reason for that. Thanks.

I have modified and committed the tests with modifications requested. Please have a look if this looks ok. Subsequently will rebase my patch.

The changes to prevent commutes look right. I suspect we'll need even more one-use test variations to be sure we have that right, but that should not hold up rebasing/changing the patch.

bipmis updated this revision to Diff 426719.May 3 2022, 8:18 AM

Rebasing after incorporating the comments and test additions. Current implementation folds to a representation which can be reduced by the existing implementation in InstCombine.

@spatel Could you review the rebased patch and let me know if you have any other comments. Thanks.

spatel accepted this revision.May 12 2022, 6:20 AM

Sorry, I lost track of this patch. LGTM - see inline comments for minor improvements.

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2969

I still don't like re-using 'D' in these comments. That's the uncaptured operand in the next match, so I'd just call it '?' to show that it's a "don't care" value.

2981

'C' is still operand 0 of the final 'or' (the same as in the above set of transforms), but these comments show it as operand 1.

This revision is now accepted and ready to land.May 12 2022, 6:20 AM
This revision was landed with ongoing or failed builds.May 16 2022, 4:44 AM
This revision was automatically updated to reflect the committed changes.
alexfh added a subscriber: alexfh.May 30 2022, 5:44 AM

Hi Biplob,
This commit has increased compilation time of certain translation units by a large factor. It raised from ~7s to at least 20 minutes. -ftime-report shows that the code previously put already quite a heavy load on InstCombinePass:

Total Execution Time: 6.0698 seconds (6.0699 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 2.0465 ( 34.4%)   0.0269 ( 23.8%)   2.0734 ( 34.2%)   2.0734 ( 34.2%)  InstCombinePass
 1.4016 ( 23.5%)   0.0360 ( 31.8%)   1.4376 ( 23.7%)   1.4376 ( 23.7%)  ModuleInlinerWrapperPass
 1.3901 ( 23.3%)   0.0360 ( 31.8%)   1.4261 ( 23.5%)   1.4262 ( 23.5%)  DevirtSCCRepeatedPass
 0.3090 (  5.2%)   0.0001 (  0.1%)   0.3091 (  5.1%)   0.3091 (  5.1%)  SROAPass

However after this change some parts of the algorithm may have become exponential. At least I haven't seen the compilation finish yet. We're trying to come up with an isolated test case, but please consider reverting the commit while working on a fix. Thanks!

@alexfh it sounds more likely that there are 2 transforms fighting each other? Can you find a bugpoint test case by just setting a short ish timeout?

@alexfh it sounds more likely that there are 2 transforms fighting each other? Can you find a bugpoint test case by just setting a short ish timeout?

I'm working on a reduced test case. So far I have collected profile from the first few seconds of clang invocation. It may help understanding what's happening:

- 99.58% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                            ◆
   - 99.43% clang::EmitBackendOutput                                                                                                                                                                                              ▒
      - 99.43% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                                 ▒
         - llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                              ▒
            - 99.24% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                  ▒
               - llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                                           ▒
                  - 99.24% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                 ▒
                     - llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                              ▒
                        - 98.82% llvm::detail::PassModel<llvm::Function, llvm::InstCombinePass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                              ▒
                           - llvm::InstCombinePass::run                                                                                                                                                                           ▒
                              - 98.81% combineInstructionsOverFunction                                                                                                                                                            ▒
                                 - 98.26% llvm::InstCombinerImpl::run                                                                                                                                                             ▒
                                    - 72.80% llvm::InstCombinerImpl::visitOr                                                                                                                                                      ▒
                                       - 21.67% llvm::InstCombinerImpl::SimplifyUsingDistributiveLaws                                                                                                                             ▒
                                          - 18.71% SimplifyOrInst                                                                                                                                                                 ▒
                                             - 9.82% SimplifyAssociativeBinOp                                                                                                                                                     ▒
                                                - 9.44% SimplifyOrInst                                                                                                                                                            ▒
                                                   - 3.44% expandCommutativeBinOp                                                                                                                                                 ▒
                                                      - 3.36% expandBinOp                                                                                                                                                         ▒
                                                         + 3.07% SimplifyOrInst                                                                                                                                                   ▒
                                                   + 2.36% expandBinOp                                                                                                                                                            ▒
                                                     2.19% simplifyOrLogic                                                                                                                                                        ▒
                                             + 3.53% expandBinOp                                                                                                                                                                  ▒
                                               2.36% simplifyOrLogic                                                                                                                                                              ▒
                                             + 1.69% expandCommutativeBinOp                                                                                                                                                       ▒
                                          + 1.58% llvm::InstCombinerImpl::tryFactorization                                                                                                                                        ▒
                                          + 0.68% llvm::Constant::getAllOnesValue                                                                                                                                                 ▒
                                       - 20.35% SimplifyOrInst                                                                                                                                                                    ▒
                                          - 9.20% SimplifyAssociativeBinOp                                                                                                                                                        ▒
                                             - 9.03% SimplifyOrInst                                                                                                                                                               ▒
                                                + 3.94% expandCommutativeBinOp                                                                                                                                                    ▒
                                                + 3.12% expandBinOp                                                                                                                                                               ▒
                                                  1.21% simplifyOrLogic                                                                                                                                                           ▒
                                          + 7.08% expandBinOp                                                                                                                                                                     ▒
                                          - 2.00% expandCommutativeBinOp                                                                                                                                                          ▒
                                             + 1.95% expandBinOp                                                                                                                                                                  ▒
                                            1.29% simplifyOrLogic                                                                                                                                                                 ▒
                                       - 10.74% llvm::InstCombinerImpl::SimplifyDemandedInstructionBits                                                                                                                           ▒
                                          - 10.61% llvm::InstCombinerImpl::SimplifyDemandedUseBits                                                                                                                                ▒
                                             - 9.99% llvm::InstCombinerImpl::SimplifyDemandedBits                                                                                                                                 ▒
                                                - 9.75% llvm::InstCombinerImpl::SimplifyDemandedUseBits                                                                                                                           ▒
                                                   - 6.72% llvm::InstCombinerImpl::SimplifyDemandedBits                                                                                                                           ▒
                                                      + 6.55% llvm::InstCombinerImpl::SimplifyDemandedUseBits                                                                                                                     ▒
                                                   + 1.86% llvm::InstCombinerImpl::SimplifyMultipleUseDemandedBits                                                                                                                ▒
                                       - 9.86% llvm::InstCombinerImpl::SimplifyAssociativeOrCommutative                                                                                                                           ▒
                                          - 9.42% SimplifyOrInst                                                                                                                                                                  ▒
                                             + 4.36% expandCommutativeBinOp                                                                                                                                                       ▒
                                             + 3.20% expandBinOp                                                                                                                                                                  ▒
                                               1.14% simplifyOrLogic                                                                                                                                                              ▒
                                       + 5.44% llvm::InstCombinerImpl::matchBSwapOrBitReverse                                                                                                                                     ▒
                                       + 0.95% llvm::InstCombinerImpl::matchSelectFromAndOr                                                                                                                                       ▒
                                         0.59% llvm::IRBuilderBase::CreateAnd                                                                                                                                                     ▒
                                    - 16.98% llvm::InstCombinerImpl::visitAnd                                                                                                                                                     ▒
                                       + 8.63% llvm::InstCombinerImpl::SimplifyDemandedInstructionBits                                                                                                                            ▒
                                       + 3.31% SimplifyAndInst                                                                                                                                                                    ▒
                                       + 1.59% llvm::InstCombinerImpl::SimplifyAssociativeOrCommutative                                                                                                                           ▒
                                    + 3.26% llvm::InstCombinerImpl::visitLShr                                                                                                                                                     ▒

@alexfh The transformation here is from one form of OR to the other which can actually be handled by the existing implementation in InstCombine. I dont think we saw an issue with compile time in llvm-test-suite. Would be good to analyze the specific test scenario which triggers this issue.

alexfh added a comment.Jun 1 2022, 3:08 AM

I was wrong with the initial analysis. The patch doesn't just make the performance worse. It makes clang loop indefinitely on a certain input.

$ cat q.cc
int e;
int u;
int ag;
void f() {
  int ak = e;
  int al((unsigned char)(ak >> 23) & 925);
  if (ak)
    al = (ak >> 23 & u) | ((unsigned char)(ak >> 23) & 925) | (u >> 23 & 157);
  ag = al;
}

$ time ./clang-15-10515 --target=x86_64--linux-gnu -O1  -c q.cc
^C

real    0m45.072s
user    0m0.025s
sys     0m0.099s

This is definitely a problem that has to be fixed. If you don't have an obvious fix in mind, please revert while investigating.

alexfh added a comment.Jun 1 2022, 3:21 AM

I was wrong with the initial analysis. The patch doesn't just make the performance worse. It makes clang loop indefinitely on a certain input.

$ cat q.cc
int e;
int u;
int ag;
void f() {
  int ak = e;
  int al((unsigned char)(ak >> 23) & 925);
  if (ak)
    al = (ak >> 23 & u) | ((unsigned char)(ak >> 23) & 925) | (u >> 23 & 157);
  ag = al;
}

$ time ./clang-15-10515 --target=x86_64--linux-gnu -O1  -c q.cc
^C

real    0m45.072s
user    0m0.025s
sys     0m0.099s

This is definitely a problem that has to be fixed. If you don't have an obvious fix in mind, please revert while investigating.

A bit cleaner test case:

int f(int a, int b) {
  int c = ((unsigned char)(a >> 23) & 925);
  if (a)
    c = (a >> 23 & b) | ((unsigned char)(a >> 23) & 925) | (b >> 23 & 157);
  return c;
}
alexfh added a comment.Jun 1 2022, 5:20 AM

I've reverted the commit in aa98e7e1eb960712533d39bbc98a05a6e70a9683 to unblock our internal release.

bipmis added a comment.Jun 1 2022, 5:32 AM

@alexfh Verified this is an issue and thanks for reverting. I have the fix for the same which I can commit. Let me know if I can do it now or at a later date. Thanks.

alexfh added a comment.Jun 1 2022, 5:42 AM

@alexfh Verified this is an issue and thanks for reverting. I have the fix for the same which I can commit. Let me know if I can do it now or at a later date. Thanks.

Feel free to commit it. I can then verify it on the original test case.

bipmis updated this revision to Diff 433387.Jun 1 2022, 7:11 AM

Update the patch to fix the test case

int f(int a, int b) {
  int c = ((unsigned char)(a >> 23) & 925);
  if (a)
    c = (a >> 23 & b) | ((unsigned char)(a >> 23) & 925) | (b >> 23 & 157);
  return c;
}
bipmis added a comment.Jun 1 2022, 7:18 AM

@alexfh I think it would be good if you can try the patch and approve for commit if it looks fine. Thanks.

spatel added a comment.Jun 1 2022, 8:18 AM

We should have at least one minimized version of the test that caused the infinite loop in the updated version of the patch or pre-committed, so we guard against that same bug in the future.

I didn't step through to see if more can be removed, but I got it down to this, and it would infinite loop with "opt -instcombine":

declare void @use(i32)

define i32 @f(i32 %a, i32 %b) {
  %shr = ashr i32 %a, 23
  %conv = trunc i32 %shr to i8
  %conv1 = zext i8 %conv to i32
  %and = and i32 %conv1, 925
  call void @use(i32 %and)
  %and3 = and i32 %shr, %b
  %or = or i32 %and3, %and
  %shr8 = ashr i32 %b, 23
  %and9 = and i32 %shr8, 157
  %r = or i32 %or, %and9
  ret i32 %r
}

@alexfh I think it would be good if you can try the patch and approve for commit if it looks fine. Thanks.

I have verified clang with your latest patch on the original file, and it doesn't hang now. However, I'm not an expert in InstCombine and can't properly review the change.

@alexfh I think it would be good if you can try the patch and approve for commit if it looks fine. Thanks.

I have verified clang with your latest patch on the original file, and it doesn't hang now. However, I'm not an expert in InstCombine and can't properly review the change.

And yes, please add a regression test.

bipmis updated this revision to Diff 434433.Jun 6 2022, 4:50 AM

Added regression tests to the patch.

spatel added a comment.Jun 6 2022, 7:20 AM

IIUC, the problem/fix (and it would be good to put something like this in the updated commit message for easier reference):
The previous revision/commit did not check one-use of an intermediate value that this transform re-uses. When that value has another use, an existing transform will try to invert the transform here. By adding one-use checks, we avoid the infinite loops seen with the earlier commit.