This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Canonicalize masked merge; optimize (a & b) | (~a & c)
AbandonedPublic

Authored by MatzeB on Sep 2 2021, 2:03 PM.

Details

Summary

Canonicalize the "masked merge" operation to the shorter xor-variant:

(a & b) | (~a & c) --> (((b ^ c) & a) ^ c)

Diff Detail

Event Timeline

MatzeB created this revision.Sep 2 2021, 2:03 PM
MatzeB requested review of this revision.Sep 2 2021, 2:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 2 2021, 2:03 PM
MatzeB added a comment.EditedSep 2 2021, 2:08 PM

And for the record: It's easy enough to prove this correct with value tables and alive. Though I can't currently figure out a way to deduce this formula myself so there may be generalizations or underlying principles that I am missing which would produce more instcombine patterns...

MatzeB added a comment.EditedSep 2 2021, 6:08 PM

Seems this is called masked merge and there is prior discussion/code:

Was this combine here missed or is it somehow not good? It does make things slightly faster in my code...

MatzeB added a subscriber: echristo.Sep 2 2021, 6:32 PM

So @lebedev.ri if I did my history research correctly, then the pattern in my patch here has also been part of https://reviews.llvm.org/D46814. That diff however was reverted in the meantime (see https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180528/556578.html @echristo) because of undef behavior? My version here adds a freeze instruction (thanks to alive2.llvm.org for catching this or I would have missed that subtlety too!).

So maybe it this is ok now?

This is an alternative bit select pattern - do we know if many backends match this pattern as well as the more common (a & b) | (~a & c) to their bit select instructions (bsl/pcmov/etc.)?

The xor pattern is better on SSE for cases where b and c were both constants, plus its entirely commutable.

MatzeB added a comment.EditedSep 10 2021, 4:15 PM

do we know if many backends match this pattern as well as the more common (a & b) | (~a & c) to their bit select instructions (bsl/pcmov/etc.)?

I tried llc with the following test:

define i32 @variant0(i32, i32, i32) {
  %4 = and i32 %1, %0
  %5 = xor i32 %0, -1
  %6 = and i32 %5, %2
  %7 = or i32 %6, %4
  ret i32 %7
}

define i32 @variant1(i32, i32, i32) {
  %4 = xor i32 %2, %1
  %5 = and i32 %4, %0
  %6 = xor i32 %5, %2
  ret i32 %6
}

(generic) X86: variant0 produces "mov; and; not; and; or"; variant1 produces "mov; xor; and; xor"
X86 -mattr=+bmi: variant0 and variant1 produce "and, andn, orl"
AArch64: variant0 and variant1 produce "bic, and, orr"
ARM: variant0 produces "and; bic; orr"; variant1 produces "eor; and; eor"
ppc32: variant0/variant1 produce the same code: "and; andc; or"
ppc64: variant0 produces "and; xori; xoris; and; or"?!? variant1 produces "andc; and; or" which is clearly better
riscv32/riscv64: variant0 produces "and; not; and; or"; variant1 produces "xor; and; xor"
s390x: variant0 produces "nr; xilf; nr; or"; variant1 produces "xr; nr; xr"

-> For targets without an "and-not" instruction variant1 seems better and this patch will help produce better code there. Most targets with an "and-not" use it either way, except for ARM which fails to recognize the xor variant and PPC64 which fails to recognize the and-not variant...

Also a goal for LLVM-IR is normalizing the program and this patch will help with that.

It looks like have a reverse fold to this in DAG: DAGCombiner::unfoldMaskedMerge

Also, InstCombine has a fold for: (X&~Z)|(Y&Z) -> select(X,Y) but it doesn't handle the XOR variant: https://simd.godbolt.org/z/sWM3KxbrE

MatzeB added a comment.EditedSep 13 2021, 5:50 PM

It looks like have a reverse fold to this in DAG: DAGCombiner::unfoldMaskedMerge

Noticed that too. It depends on TargetLower::hasAndNot() to return true.

I tried implementing hasAndNot() in the ARM and SystemZ backend which lacked this function. It does eliminate problems with this patch, unfortunately there are ripple effects in other areas and I am not sure how deep I want to dig (patterns like select(x < 0, 0, x) are DAGCombined into (x & ~(x >>s 31)) making the targets fail to select specialized saturation instructions...). Though I could probably refactor things to make it possible to independently unfold masked merge separately from the and-not zero saturation.

MatzeB updated this revision to Diff 372789.Sep 15 2021, 1:40 PM
MatzeB edited the summary of this revision. (Show Details)
  • Dropped masked-merge commutative tests which no longer apply.
  • Put on top of patch improving SystemZ codegen to avoid codegen regression there.
MatzeB retitled this revision from [InstCombine] Optimize (a & b) | (~a & c) to [InstCombine] Optimize (a & b) | (~a & c) (canonicalize masked merge).Sep 15 2021, 1:41 PM
MatzeB retitled this revision from [InstCombine] Optimize (a & b) | (~a & c) (canonicalize masked merge) to [InstCombine] Canonicalize masked merge; optimize (a & b) | (~a & c).

I filed https://bugs.llvm.org/show_bug.cgi?id=51876 about the different ARM codegen. I believe we can land this change anyway, since ARM codegen produces the same amount of instructions with any masked-merge pattern.

nikic added a comment.Sep 15 2021, 2:08 PM

I don't think this makes sense as an InstCombine (middle end) transform. The resulting expression is less analyzable both in that freeze is a complete analysis blocker, and xor is generally less analyzable than and/or. This looks like something we should be doing in the backend instead.

MatzeB added a comment.EditedSep 15 2021, 6:23 PM

I don't think this makes sense as an InstCombine (middle end) transform. The resulting expression is less analyzable both in that freeze is a complete analysis blocker, and xor is generally less analyzable than and/or. This looks like something we should be doing in the backend instead.

It's easy enough to move this pattern to SelectionDAG.

Though on a general level, my philosophy was always that if you have 2 equivalent expressions then it's best to canonicalize towards one of them in InstCombine. Not sure if freeze is a good enough reason to not canonicalize...

What about canonicalizing to (a & b) | (~a & c)?

What about canonicalizing to (a & b) | (~a & c)?

That would also work well. But it will require a freeze just the same as we go from one use of a to two.

And FWIW: My gut feeling is that a canonicalization is worth having an extra freeze around...

Abandoning in favor of a SelectionDAG solution in D112754

MatzeB abandoned this revision.Oct 28 2021, 1:50 PM