This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Unfold scalar masked merge if profitable
ClosedPublic

Authored by lebedev.ri on Apr 17 2018, 11:52 AM.

Details

Summary

This is PR37104.

PR6773 will introduce an IR canonicalization that is likely bad for the end assembly.
Previously, andl+andn/andps+andnps / bic/bsl would be generated. (see @out)
Now, they would no longer be generated (see @in).
So we need to make sure that they are still generated.

If the mask is constant, right now i always unfold it.
Else, i use hasAndNot() TLI hook.

For now, only handle scalars.

https://rise4fun.com/Alive/bO6


I *really* don't like the code i wrote in DAGCombiner::unfoldMaskedMerge().
It is super fragile. Is there something like IR Pattern Matchers for this?

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Apr 17 2018, 11:52 AM
lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri added inline comments.Apr 17 2018, 3:20 PM
test/CodeGen/X86/unfold-masked-merge-scalar.ll
370–390

Hm, this didn't lower into andn, unlike the non-constant-mask variant above.
I'm guessing movabsq is interfering?

If the mask is constant, right now i always unfold it.

Let me make sure I understand. The fold in question is:

%n0 = xor i4 %x, %y
%n1 = and i4 %n0, C1
%r  = xor i4 %n1, %y
=>
%mx = and i4 %x, C1
%my = and i4 %y, ~C1
%r = or i4 %mx, %my

If that's correct, we need to take a step back here. If the fold is universally good, then it can go in InstCombine, and there's no need to add code bloat to the DAG to handle the pattern unless something in the backend can create this pattern (seems unlikely).

But we need to take another step back before we add code bloat to InstCombine. Is there evidence that this pattern exists in source (bug report, test-suite, etc) and affects analysis/performance? If not, is it worth the cost of adding a matcher for the pattern? It's a simple matcher, so the expense bar is low...but if it never happens, do we care?

If the mask is constant, right now i always unfold it.

Let me make sure I understand. The fold in question is:

%n0 = xor i4 %x, %y
%n1 = and i4 %n0, C1
%r  = xor i4 %n1, %y
=>
%mx = and i4 %x, C1
%my = and i4 %y, ~C1
%r = or i4 %mx, %my

Yes.

If that's correct, we need to take a step back here. If the fold is universally good, then it can go in InstCombine

Yeah, that is the question, i'm having. I did look at mca output.
Here is what MCA says about that for -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75


Or is this a scheduling info problem?

and there's no need to add code bloat to the DAG to handle the pattern unless something in the backend can create this pattern (seems unlikely).

But we need to take another step back before we add code bloat to InstCombine. Is there evidence that this pattern exists in source (bug report, test-suite, etc) and affects analysis/performance? If not, is it worth the cost of adding a matcher for the pattern? It's a simple matcher, so the expense bar is low...but if it never happens, do we care?

Yeah, that is the question, i'm having. I did look at mca output.

Here is what MCA says about that for -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75


Or is this a scheduling info problem?

Cool - a chance to poke at llvm-mca! (cc @andreadb and @courbet)

First thing I see is that it's harder to get the sequence we're after on x86 using the basic source premise:

int andandor(int x, int y)  {
  __asm volatile("# LLVM-MCA-BEGIN ands");
  int r = (x & 42) | (y & ~42);
  __asm volatile("# LLVM-MCA-END ands");
  return r;
}

int xorandxor(int x, int y) {
  __asm volatile("# LLVM-MCA-BEGIN xors");
  int r = ((x ^ y) & 42) ^ y;
  __asm volatile("# LLVM-MCA-END xors");
  return r;
}

...because the input param register doesn't match the output result register. We'd have to hack that in asm...or put the code in a loop, but subtract the loop overhead somehow. Things work/look alright to me other than that.

I don't know AArch that well, but your example is a special-case that may be going wrong. Ie, if we have a bit-string constant like 0xff000000, you could get:
bfxil w0, w1, #0, #24
...which should certainly be better than:
eor w8, w1, w0
and w8, w8, #0xff000000
eor w0, w8, w1

AArch64 chose to convert to shift + possibly more expensive bfi for the 0x00ffff00 constant though. That's not something that we can account for in generic DAGCombiner, so I'd categorize that as an AArch64-specific bug (either don't use bfi there or fix the scheduling model or fix this up in MI somehow).

Yeah, that is the question, i'm having. I did look at mca output.

Here is what MCA says about that for -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75


Or is this a scheduling info problem?

Cool - a chance to poke at llvm-mca! (cc @andreadb and @courbet)

First thing I see is that it's harder to get the sequence we're after on x86 using the basic source premise:

int andandor(int x, int y)  {
  __asm volatile("# LLVM-MCA-BEGIN ands");
  int r = (x & 42) | (y & ~42);
  __asm volatile("# LLVM-MCA-END ands");
  return r;
}

int xorandxor(int x, int y) {
  __asm volatile("# LLVM-MCA-BEGIN xors");
  int r = ((x ^ y) & 42) ^ y;
  __asm volatile("# LLVM-MCA-END xors");
  return r;
}

...because the input param register doesn't match the output result register. We'd have to hack that in asm...or put the code in a loop, but subtract the loop overhead somehow. Things work/look alright to me other than that.

I simply stored the lhs and rhs side of // CHECK lines from aarch64's @in32_constmask in two local files,
run llvm-mca on each of them, and diffed the output, no clang was involved.

I don't know AArch that well, but your example is a special-case that may be going wrong. Ie, if we have a bit-string constant like 0xff000000, you could get:
bfxil w0, w1, #0, #24
...which should certainly be better than:
eor w8, w1, w0
and w8, w8, #0xff000000
eor w0, w8, w1

AArch64 chose to convert to shift + possibly more expensive bfi for the 0x00ffff00 constant though. That's not something that we can account for in generic DAGCombiner, so I'd categorize that as an AArch64-specific bug (either don't use bfi there or fix the scheduling model or fix this up in MI somehow).

Ok, then let's assume until proven otherwise that if mask is a constant, unfolded variant is always better.
I'll unfold it in instcombine (since it seems the D45664 will already match the masked merge pattern, so it would not add much code).

there's no need to add code bloat to the DAG to handle the pattern unless something in the backend can create this pattern (seems unlikely).

I don't think it will be possible to check that until after the instcombine part has landed, so ok, at least for now i will stop unfolding [constant mask] in dagcombine.

While there, any hint re pattern matchers for this code?

I don't think it will be possible to check that until after the instcombine part has landed, so ok, at least for now i will stop unfolding [constant mask] in dagcombine.

While there, any hint re pattern matchers for this code?

Unfortunately, DAG nodes don't have any equivalent match() infrastructure like IR that I know of. The commuted variants are what complicate this? Usually, I think we just std::swap() our way to the answer here in the DAG.

I don't think it will be possible to check that until after the instcombine part has landed, so ok, at least for now i will stop unfolding [constant mask] in dagcombine.

While there, any hint re pattern matchers for this code?

Unfortunately, DAG nodes don't have any equivalent match() infrastructure like IR that I know of.

Boo :(

The commuted variants are what complicate this?

Yes.

Usually, I think we just std::swap() our way to the answer here in the DAG.

Hmm, while i can see that working in many simple cases, i'm not sure that will be enough here.

Currently llvm-mca doesn't know how to resolve variant scheduling classes.
This problem mostly affects the ARM target.
This has been reported here: https://bugs.llvm.org/show_bug.cgi?id=36672

The number of micro opcodes that you see is the llvm-mca output is the
default (invalid) number of micro opcodes for instructions associated with
a sched-variant class.

I plan to send a patch to address (most of) the issues related to the
presence of variant scheduling classes. However, keep in mind that ARM
sched-predicates heavily rely on TII hooks. Those are going to cause
problems for tools like mca (i.e. there is not an easy way to "fix" them).

At the moment, llvm-mca doesnt' know how to analyze these two instructions,
since both are associated with a variant scheduling class:

 eor     w8, w0, w1
 mov w0, w1

- {F5972356, layout=link}
  • Rebased ontop of revised tests
  • Stop handling cases with constant mask. instcombine should unfold them.

NFC, rebased ontop of rebased tests with CFI noise dropped.

NFC, rebased.

spatel added inline comments.Apr 23 2018, 9:00 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5381–5421

After stepping through more of your tests, I see why this is ugly.

We don't have to capture the intermediate values if the hasOneUse() checks are in the lambda(s) though. What do you think of this version:

// There are 3 commutable operators in the pattern, so we have to deal with
// 8 possible variants of the basic pattern.
SDValue X, Y, M;
auto matchAndXor = [&X,&Y,&M](SDValue And, unsigned XorIdx, SDValue Other) {
  if (And.getOpcode() != ISD::AND || !And.hasOneUse())
    return false;
  if (And.getOperand(XorIdx).getOpcode() != ISD::XOR ||
      !And.getOperand(XorIdx).hasOneUse())
    return false;
  SDValue Xor0 = And.getOperand(XorIdx).getOperand(0);
  SDValue Xor1 = And.getOperand(XorIdx).getOperand(1);
  if (Other == Xor0) std::swap(Xor0, Xor1);
  if (Other != Xor1) return false;
  X = Xor0;
  Y = Xor1;
  M = And.getOperand(1);
  return true;
};
if (!matchAndXor(A, 0, B) && !matchAndXor(A, 1, B) &&
    !matchAndXor(B, 0, A) && !matchAndXor(B, 1, A))
  return SDValue();
spatel added inline comments.Apr 23 2018, 9:10 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5381–5421

Oops - forgot to swap the M capture:

M = And.getOperand(1);

should be:

M = And.getOperand(XorIdx ? 0 : 1);
lebedev.ri marked 2 inline comments as done.

Update with @spatel's suggested matcher.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5381–5421

Well, the test changes are the same, and the code looks a bit shorter..
Though it is much harder to read.

spatel accepted this revision.Apr 23 2018, 12:41 PM

LGTM.

This revision is now accepted and ready to land.Apr 23 2018, 12:41 PM

LGTM.

Thank you for the review!

This revision was automatically updated to reflect the committed changes.

It seems this has uncovered something.
It does not look like a miscompilation to me (FIXME or is it?), but the produced code is certainly worse:

 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+bmi | FileCheck %s
 
 define float @test_andnotps_scalar(float %a0, float %a1, float* %a2) {
 ; CHECK-LABEL: test_andnotps_scalar:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    movd %xmm0, %eax
-; CHECK-NEXT:    movd %xmm1, %ecx
-; CHECK-NEXT:    andnl %ecx, %eax, %eax
-; CHECK-NEXT:    movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
-; CHECK-NEXT:    notl %eax
-; CHECK-NEXT:    movd %eax, %xmm0
+; CHECK-NEXT:    movd %xmm1, %eax
+; CHECK-NEXT:    movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
 ; CHECK-NEXT:    pand %xmm1, %xmm0
+; CHECK-NEXT:    movd %xmm0, %ecx
+; CHECK-NEXT:    notl %eax
+; CHECK-NEXT:    orl %ecx, %eax
+; CHECK-NEXT:    movd %eax, %xmm0
+; CHECK-NEXT:    pand %xmm2, %xmm0
 ; CHECK-NEXT:    retq
   %tmp = bitcast float %a0 to i32
   %tmp1 = bitcast float %a1 to i32
   %tmp2 = xor i32 %tmp, -1
   %tmp3 = and i32 %tmp2, %tmp1
   %tmp4 = load float, float* %a2, align 16
   %tmp5 = bitcast float %tmp4 to i32
   %tmp6 = xor i32 %tmp3, -1
   %tmp7 = and i32 %tmp5, %tmp6
   %tmp8 = bitcast i32 %tmp7 to float
   ret float %tmp8
 }

We lost andnl.
Discovered accidentally because the same happened to @test_andnotps/@test_andnotpd in test/CodeGen/X86/*-schedule.ll (they are no longer lowered to andnps/andnpd).

It seems this has uncovered something.
It does not look like a miscompilation to me (FIXME or is it?), but the produced code is certainly worse:

 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+bmi | FileCheck %s
 
 define float @test_andnotps_scalar(float %a0, float %a1, float* %a2) {
 ; CHECK-LABEL: test_andnotps_scalar:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    movd %xmm0, %eax
-; CHECK-NEXT:    movd %xmm1, %ecx
-; CHECK-NEXT:    andnl %ecx, %eax, %eax
-; CHECK-NEXT:    movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
-; CHECK-NEXT:    notl %eax
-; CHECK-NEXT:    movd %eax, %xmm0
+; CHECK-NEXT:    movd %xmm1, %eax
+; CHECK-NEXT:    movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
 ; CHECK-NEXT:    pand %xmm1, %xmm0
+; CHECK-NEXT:    movd %xmm0, %ecx
+; CHECK-NEXT:    notl %eax
+; CHECK-NEXT:    orl %ecx, %eax
+; CHECK-NEXT:    movd %eax, %xmm0
+; CHECK-NEXT:    pand %xmm2, %xmm0
 ; CHECK-NEXT:    retq
   %tmp = bitcast float %a0 to i32
   %tmp1 = bitcast float %a1 to i32
   %tmp2 = xor i32 %tmp, -1
   %tmp3 = and i32 %tmp2, %tmp1
   %tmp4 = load float, float* %a2, align 16
   %tmp5 = bitcast float %tmp4 to i32
   %tmp6 = xor i32 %tmp3, -1
   %tmp7 = and i32 %tmp5, %tmp6
   %tmp8 = bitcast i32 %tmp7 to float
   ret float %tmp8
 }

We lost andnl.
Discovered accidentally because the same happened to @test_andnotps/@test_andnotpd in test/CodeGen/X86/*-schedule.ll (they are no longer lowered to andnps/andnpd).

And it happened because both xor's have the same [constant] operand - -1.