Page MenuHomePhabricator

[InstCombine] Fold `(c & ~(a | b)) | (b & ~(a | c))` to `~a & (b ^ c)`
ClosedPublic

Authored by rampitec on Oct 21 2021, 3:20 PM.

Details

Summary
----------------------------------------
define i4 @src(i4 %a, i4 %b, i4 %c) {
%0:
  %or1 = or i4 %a, %b
  %not1 = xor i4 %or1, 15
  %and1 = and i4 %not1, %c
  %or2 = or i4 %a, %c
  %not2 = xor i4 %or2, 15
  %and2 = and i4 %not2, %b
  %or3 = or i4 %and1, %and2
  ret i4 %or3
}
=>
define i4 @tgt(i4 %a, i4 %b, i4 %c) {
%0:
  %xor = xor i4 %b, %c
  %not = xor i4 %a, 15
  %or3 = and i4 %xor, %not
  ret i4 %or3
}
Transformation seems to be correct!

Diff Detail

Event Timeline

rampitec created this revision.Oct 21 2021, 3:20 PM
rampitec requested review of this revision.Oct 21 2021, 3:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2021, 3:20 PM
spatel added a subscriber: anton-afanasyev.

For reference, we did discuss more general boolean logic solving in D112108, and I'm not sure how many more patterns like this or D112338 are needed, but at some point, it will become unreasonable to continue adding these to instcombine (some have said that we went over the cliff long ago!).

We do have -aggressive-instcombine for housing rare/large pattern-matching, so that is another option for minimizing compile-time impact. That pass is currently only enabled at -O3, but there is some work/motivation to enable that at -O2 as well ( cc @anton-afanasyev ).

That said, the logic seems right, but we need many more tests - at least 8 combinations for commutes/swaps? There are also several patterns where intermediate ops can have extra uses, and we allow the fold, so we should have positive tests for all of those cases.

We do have -aggressive-instcombine for housing rare/large pattern-matching, so that is another option for minimizing compile-time impact. That pass is currently only enabled at -O3, but there is some work/motivation to enable that at -O2 as well ( cc @anton-afanasyev ).

Yes, I'm going to start discussion on enabling AIC at -O2 after it stabilizes (there are several issues unveiled by AIC changes). But the main motivation is that AIC isn't heavier than IC for now. InstCombine iterates linear pattern-matching while AIC does DFS on small DAGs. Current AIC was splitted out from IC itself and contains only TruncInstCombine for now, processing trunc-dominated DAGs.

If we are talking about including boolean logic solver to AIC, the main question is its compile-time impact therefore. At present, IC and AIC are separated as "linear" and "non-linear", but "-O2" vs "-O3" criteria may be more appropriate.

If we are talking about including boolean logic solver to AIC, the main question is its compile-time impact therefore. At present, IC and AIC are separated as "linear" and "non-linear", but "-O2" vs "-O3" criteria may be more appropriate.

I was just considering the option of putting this pattern matching in AIC since it is bigger than usual (and seems likely to be extremely rare in code).
We do have semi-standard pattern matching in AIC outside of the TruncInstCombiner - it lives under "foldUnusualPatterns()".

A real logic solver should be its own pass if that would have practically no overlap with the pattern-matching of AIC/IC.

We do have semi-standard pattern matching in AIC outside of the TruncInstCombiner - it lives under "foldUnusualPatterns()".

Oh, sure, forget about it, it has only few folding patterns. As for this, I vote for moving such large boolean pattern to AIC, at least, if we do not need it at "O2" right now.

We do have semi-standard pattern matching in AIC outside of the TruncInstCombiner - it lives under "foldUnusualPatterns()".

Oh, sure, forget about it, it has only few folding patterns. As for this, I vote for moving such large boolean pattern to AIC, at least, if we do not need it at "O2" right now.

For proper boolean solver (which i also suggested having some time ago), sure, AIC sounds like the place,
but the pattern at hand doesn't look all that big actually, to warrant not being here, I think.

Let me give a little background on this. This is essentially ternary boolean logic similar to AVX512 instruction VPTERNLOG. I came across this function doing similar thing at a source level: https://forums.developer.nvidia.com/t/reverse-lut-for-lop3-lut/110651 as a simulation:

uint32_t lop3_fast (uint32_t a, uint32_t b, uint32_t c, uint8_t ttbl)
{
    uint32_t r = 0;
    if (ttbl & 0x01) r |= ~a & ~b & ~c;
    if (ttbl & 0x02) r |= ~a & ~b &  c;
    if (ttbl & 0x04) r |= ~a &  b & ~c;
    if (ttbl & 0x08) r |= ~a &  b &  c;
    if (ttbl & 0x10) r |=  a & ~b & ~c;
    if (ttbl & 0x20) r |=  a & ~b &  c;
    if (ttbl & 0x40) r |=  a &  b & ~c;
    if (ttbl & 0x80) r |=  a &  b &  c;
    return r;
}

I cannot say there is a real driving factor to optimize it other than the fact I was shocked by the code produced by llvm optimizer. Here are couple of most patalogical cases:

  1. case 15:
%9 = or i32 %7, %6
%10 = xor i32 %9, -1
%11 = xor i32 %6, -1
%12 = and i32 %7, %11
%13 = xor i32 %8, -1
%14 = and i32 %12, %13
%15 = or i32 %14, %10
%16 = and i32 %12, %8
%17 = or i32 %15, %16

After simplification it is just ~a.

  1. case 255:
%9 = or i32 %7, %6
%10 = xor i32 %9, -1
%11 = xor i32 %6, -1
%12 = and i32 %7, %11
%13 = xor i32 %8, -1
%14 = and i32 %12, %13
%15 = or i32 %14, %10
%16 = and i32 %12, %8
%17 = or i32 %15, %16
%18 = xor i32 %7, -1
%19 = and i32 %6, %18
%20 = and i32 %19, %13
%21 = or i32 %17, %20
%22 = and i32 %19, %8
%23 = or i32 %21, %22
%24 = and i32 %7, %6
%25 = and i32 %24, %13
%26 = or i32 %23, %25
%27 = and i32 %24, %8
%28 = or i32 %26, %27

After simplification this is just -1.

Out of 256 possible functions calculated by this code most can use simplifications and can be done in ~3 instructions in average. Yet llvm produces a whole bloat of code.

My assumption was that shall be possible to optimize a few subexpressions here to simplify all of that. The assumption is based on the fact that these are just combinations of very similar patterns. In part that assumption is correct. For example D112108 simplifies case 4 from the above function, but also helped ~10 of other patterns. The next two patterns helped 16 and 19 other expressions respectively. Although they did not help all of these expressions completely.

I guess until all of that done I will not know how many more patterns are needed to handle it. Likely not 100, but certainly more than 10. I have to admit it converges slower than I hoped. That said I am not so sure the direction of pattern matching it here is right too so any thoughts are more than welcome.

I see a benefit of a logical solver as an alternative to pattern matching. Although the problem I see here is how to select a dag to solve? A longest dag consisting of logical operations does not seem to be correct answer because of a multiple use subexpressions, but limiting all subexpressions to a single use does not sound right either. This exercise showed that some tolerance to not having as single use for a subexpression is needed to tackle this. For example this case (c & ~(a | b)) | (b & ~(a | c)) has 7 operations, simplified case ~a & (b ^ c) has only 3. The naive approach is to think we may tolerate 4 multiply used values here, which also creates a problem of choice. I am just afraid a general logical solver will turn into an n-p complete solution because of this.

That said, patterns like this are linear even though big. I have no particular preference to put it into IC or AIC, but given the current definition anything O(N) goes into IC, and something more expensive into AIC as far as I understand.

To summarize I am somewhat unsure if that makes sense to continue pattern matching like this at all, or where shall it go if that makes sense.

We do have semi-standard pattern matching in AIC outside of the TruncInstCombiner - it lives under "foldUnusualPatterns()".

Oh, sure, forget about it, it has only few folding patterns. As for this, I vote for moving such large boolean pattern to AIC, at least, if we do not need it at "O2" right now.

On the other hand being in IC and dealing with AND and OR operations as a root these patterns can benefit from running matchDeMorgansLaws() which is not available in the AIC.

Just tried gcc. It does well for case 15: https://compiler-explorer.com/z/s9edonK7c. Clang does not: https://compiler-explorer.com/z/E7f6c455d. But for the case 255 gcc is not really better: https://compiler-explorer.com/z/zTWPrb79j and clang: https://compiler-explorer.com/z/nd4s4Y6xP.

Tried to move it into AIC. It works in principle, although I am a little worried AIC is created before the ReassociatePass.

Thinking more about it moving these patterns into AIC is not really a good option. I moved it there and tests I wrote pass. But it will not work on the original code produced by the FE. These tests are already simplified by the IC, and AIC works before the first IC run, so it will miss these.

Thinking more about it moving these patterns into AIC is not really a good option. I moved it there and tests I wrote pass. But it will not work on the original code produced by the FE. These tests are already simplified by the IC, and AIC works before the first IC run, so it will miss these.

Yes, currently AIC is awkwardly placed in the pipeline. Thanks for checking it anyway and providing the motivating code.
Regular instcombine seems fine for this patch for now, but we should stamp out more regression tests to make sure the commutes and uses are working as expected.

rampitec updated this revision to Diff 382114.Oct 25 2021, 1:49 PM

Added more tests:

  • or_not_and_commute4 - missing commute test. Note, I believe the rest of the commutes are covered because LHS and RHS are simmetrical. It is only a matter of what variables to call A, B and C, the logic is the same.
  • or_not_and_extra_or_use1/2 - tests for mutiple uses of either of two OR opcodes. Combine is still allowed since it is still less code: 4 instructions vs 7. Tests for mutiple uses of either AND and NOT are already precommited.

Please pre-commit the new tests so we just see the diffs.

Added more tests:

  • or_not_and_commute4 - missing commute test. Note, I believe the rest of the commutes are covered because LHS and RHS are simmetrical. It is only a matter of what variables to call A, B and C, the logic is the same.

If LHS and RHS are symmetric, then do we really need the second "match(Op1" clause in the code?

  • or_not_and_extra_or_use1/2 - tests for mutiple uses of either of two OR opcodes. Combine is still allowed since it is still less code: 4 instructions vs 7. Tests for mutiple uses of either AND and NOT are already precommited.

I agree that we don't need one-use checks on everything. In fact, the match seems overspecified for that, so it might be limiting the optimization power in the motivating source.
The way I view this is that we are creating 3 instructions, so as long as any 2 of the original intermediate values has one use, then the transform is desirable (because we're eliminating those 2 intermediate values + replacing the final or and reducing use counts of other values along the way).

@lebedev.ri - do you have any suggestions for how to specify that constraint in the code?

llvm/test/Transforms/InstCombine/and-xor-or.ll
798

By making %c a binop, it stuck as operand 0 for %and1, but that also commuted this or. Make %a a binop as well, so that doesn't change under us?

I realize this is tedious, but I think there really are at least 8 potential commuted variants.

Added more tests:

  • or_not_and_commute4 - missing commute test. Note, I believe the rest of the commutes are covered because LHS and RHS are simmetrical. It is only a matter of what variables to call A, B and C, the logic is the same.

If LHS and RHS are symmetric, then do we really need the second "match(Op1" clause in the code?

  • or_not_and_extra_or_use1/2 - tests for mutiple uses of either of two OR opcodes. Combine is still allowed since it is still less code: 4 instructions vs 7. Tests for mutiple uses of either AND and NOT are already precommited.

I agree that we don't need one-use checks on everything. In fact, the match seems overspecified for that, so it might be limiting the optimization power in the motivating source.
The way I view this is that we are creating 3 instructions, so as long as any 2 of the original intermediate values has one use, then the transform is desirable (because we're eliminating those 2 intermediate values + replacing the final or and reducing use counts of other values along the way).

@lebedev.ri - do you have any suggestions for how to specify that constraint in the code?

There is no nice way to do that, we are missing some common infra.
What you want is to check that at least one of the hands of the outer or is single-use,
and either the other hand is one-use either, or the not operand of the single-use hand is also single-use.

Added more tests:

  • or_not_and_commute4 - missing commute test. Note, I believe the rest of the commutes are covered because LHS and RHS are simmetrical. It is only a matter of what variables to call A, B and C, the logic is the same.

If LHS and RHS are symmetric, then do we really need the second "match(Op1" clause in the code?

Yes, because in the match for Op0 we can have any order for OR operands. This commute is actually tested.

I agree that we don't need one-use checks on everything. In fact, the match seems overspecified for that, so it might be limiting the optimization power in the motivating source.
The way I view this is that we are creating 3 instructions, so as long as any 2 of the original intermediate values has one use, then the transform is desirable (because we're eliminating those 2 intermediate values + replacing the final or and reducing use counts of other values along the way).

@lebedev.ri - do you have any suggestions for how to specify that constraint in the code?

There is no nice way to do that, we are missing some common infra.
What you want is to check that at least one of the hands of the outer or is single-use,
and either the other hand is one-use either, or the not operand of the single-use hand is also single-use.

I can see an awkward way to do this, count a sum of (int)hasOneUse() expressions and compare to 7. Definitely do not like it.

rampitec updated this revision to Diff 382436.Oct 26 2021, 12:59 PM
rampitec marked an inline comment as done.

Added more commuted tests.

@lebedev.ri - do you have any suggestions for how to specify that constraint in the code?

There is no nice way to do that, we are missing some common infra.
What you want is to check that at least one of the hands of the outer or is single-use,
and either the other hand is one-use either, or the not operand of the single-use hand is also single-use.

I can see an awkward way to do this, count a sum of (int)hasOneUse() expressions and compare to 7. Definitely do not like it.

Potentially I can see a specialized version of match() taking 2 additional arguments: pointer to unsigned to return a number of multiply used values in the pattern matched and another unsigned argument to limit this and stop the match as soon as the limit is reached.
For this case for example first call will have all m_OneUse() removed. The first limit will be 4 and for the second call it will be 4 - NumMutliUsedValues returned by the first call.
Maybe not a most elegant interface though, and certainly a separate change if we even want it.

spatel accepted this revision.Oct 28 2021, 11:01 AM

The one-use checks are conservative, but not wrong, so LGTM. But please add a TODO comment above the match, so if someone does happen to return to this some day, they will know that the code didn't try to go all out to enable the transform.

This revision is now accepted and ready to land.Oct 28 2021, 11:01 AM
rampitec updated this revision to Diff 383109.Oct 28 2021, 11:53 AM

Added todo comment.

This revision was landed with ongoing or failed builds.Oct 28 2021, 11:54 AM
This revision was automatically updated to reflect the committed changes.
haowei added a subscriber: haowei.Oct 28 2021, 2:24 PM

This change breaks windows build https://buildkite.com/llvm-project/premerge-checks/builds/62772#b136d0fe-c1a2-4853-9993-34bd56934ab8 in Transforms/InstCombine/and-xor-or.ll test. Could you revert this change if it takes a bit time to fix it?

This change breaks windows build https://buildkite.com/llvm-project/premerge-checks/builds/62772#b136d0fe-c1a2-4853-9993-34bd56934ab8 in Transforms/InstCombine/and-xor-or.ll test. Could you revert this change if it takes a bit time to fix it?

That was fixed by f7f430c9136330545637aaddebf013da05c0f370

This change breaks windows build https://buildkite.com/llvm-project/premerge-checks/builds/62772#b136d0fe-c1a2-4853-9993-34bd56934ab8 in Transforms/InstCombine/and-xor-or.ll test. Could you revert this change if it takes a bit time to fix it?

That was fixed by f7f430c9136330545637aaddebf013da05c0f370

Confirmed it was fixed. Thank you!