This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Masked merge: if 'B' is constant, de-canonicalize the pattern (invert the mask).
AbandonedPublic

Authored by lebedev.ri on Apr 24 2018, 3:48 PM.

Details

Summary

Discovered accidentally when working on the vector part, because the @test_andnotps/@test_andnotpd
in test/CodeGen/X86/*-schedule.ll broke - they were no longer lowered to andnps/andnpd.

Given canonical pattern of:

|        A  |  |B|
((x ^ y) & m) ^ y
 |  D  |

We don't want to handle xor's with second operand being constant,
because andn does not get selected.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Apr 24 2018, 3:48 PM

Rebased ontop of better testset.

I have stared at the test change a bit more, and unless there are some other patterns i did not analyze, i do think this is the way we want to handle this.

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
626–627

This improves.

654–655

This degrades, but instcombine will canonicalize it to @in_constant_mone_vary, and that one is ok.
So this is ok too.

655

Looked at this in llvm-mca, no clear winner.
The change decreases instruction count and IPC, but the cycle count does not change.
So i guess it's ok?

656–657

As per mca this is an unimportant change, but again, instcombine will canonicalize it to @in_constant_42_vary, which is ok.
So this one appears ok too.

It's not clear to me if this is about xor with a constant in general or xor with -1 specifically. Is the motivating pattern/problem not recognizing DeMorgan's Laws folds in the DAG?

; ~(~x & y) --> x | ~y
%notx = xor i8 %x, -1
%and = and i8 %notx, %y
%r = xor i8 %and, -1
=>
%notm = xor i8 %y, -1
%r = or i8 %x, %notm

That seems like a good fold to have in the DAG given that we're rearranging bitwise logic ops to better match target features. Should we just add that (and the 'or' --> 'and' twin)?

It's not clear to me if this is about xor with a constant in general or xor with -1 specifically.

I thought in general, note the tests with constant 42.

Is the motivating pattern/problem not recognizing DeMorgan's Laws folds in the DAG?

The motivational case is specified in the differential's description,

; ~(~x & y) --> x | ~y
%notx = xor i8 %x, -1
%and = and i8 %notx, %y
%r = xor i8 %and, -1
=>
%notm = xor i8 %y, -1
%r = or i8 %x, %notm

That seems like a good fold to have in the DAG given that we're rearranging bitwise logic ops to better match target features. Should we just add that (and the 'or' --> 'and' twin)?

Hmm, not sure, let's see..

lebedev.ri added inline comments.Apr 26 2018, 7:29 AM
test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
655

On top of D46073, this @in_constant_varx_42 pattern (i.e. %y being constant) is the only remaining issue.

# *** IR Dump After Machine InstCombiner ***:
# Machine code for function in_constant_varx_42: IsSSA, TracksLiveness
Function Live Ins: $edi in %0, $edx in %2

bb.0 (%ir-block.0):
  liveins: $edi, $edx
  %2:gr32 = COPY $edx
  %0:gr32 = COPY $edi
  %3:gr32 = AND32rr %0:gr32, %2:gr32, implicit-def dead $eflags
  %4:gr32 = NOT32r %2:gr32
  %5:gr32 = AND32ri8 %4:gr32, 42, implicit-def dead $eflags
  %6:gr32 = OR32rr %3:gr32, killed %5:gr32, implicit-def dead $eflags
  $eax = COPY %6:gr32
  RET 0, $eax

# End machine code for function in_constant_varx_42.

This *seems* ok (as per mca) on aarch64, but i'm not so sure about x86.

lebedev.ri added inline comments.Apr 26 2018, 8:32 AM
test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
655

Right, in this case not only should i not unfold it, but also de-canonicalize the mask.

lebedev.ri marked 3 inline comments as done.
lebedev.ri retitled this revision from [DAGCombiner] DON'T unfold scalar masked merge if 'B' is constant to [DAGCombiner] Masked merge: if 'B' is constant, de-canonicalize the pattern (invert the mask)..
lebedev.ri edited the summary of this revision. (Show Details)

Assuming we'll manage to get D46073, this should do it.

After some thought, and staring into MCA output, i believe this should come before De Morgan laws (D46073, should that ever land),
thus i rebased this change not to depend on that differential.

Some considerations, for znver1, and this test IR:

define i32 @in_constant_varx_42(i32 %x, i32 %y, i32 %mask) {
  %n0 = xor i32 %x, 42 ; %x
  %n1 = and i32 %n0, %mask
  %r = xor i32 %n1, 42
  ret i32 %r
}

Difference between not unfolding that pattern vs. svn - instruction count and IPC increased

Difference between not unfolding that pattern vs. this differential - Total Cycles halved, IPC doubled

Difference between unfolding that pattern in svn vs. this differential - Instruction count decreased back to not unfolded count, cycle count halved, IPC increased.

spatel added inline comments.May 2 2018, 11:10 AM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338

How does this happen? Isn't that a miscompile?

lebedev.ri added inline comments.May 2 2018, 12:01 PM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338

Hm, at first i thought it was indeed (https://reviews.llvm.org/D45733#1077183), but now i do not think so.
https://godbolt.org/g/L4hDjW
^ so neither of our outputs is fully optimized.
But if i manually transform that assembly to C, the end result tells me that DAGCombine/arm isel is simply missing some optimizations.
I could be wrong, of course.

spatel added inline comments.May 2 2018, 1:32 PM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338

Ok, I was seeing an extra 'not' in there somewhere, so no miscompile.

And the conclusion is that we don't care about this diff because it's already sub-optimal and instcombine should have folded it anyway.

That raises the question of why are we testing this in the first place though. Add a comment to explain that or just delete?

400–402

This is a real regression, or am I seeing things that aren't there?

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
626–627

But as with AArch, we don't care because instcombine would fold this?

lebedev.ri added inline comments.May 2 2018, 1:49 PM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338

I'm somewhat sure that this is the scalar version of those tests that are failing in D46073
(and when we unfold vector masked merge), so i think it's best to keep these tests.

400–402

We replaced two instructions with two other instructions.
Unless i'm using a bad -mcpu (-mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75, is there a better choice?),
this does not seem to matter in practice.
Or i'm simply looking at llvm-mca wrong :)

spatel added inline comments.May 2 2018, 2:19 PM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402

Correct - 2 instructions change.
But the whole point of the masked merge exercise was to maximize the throughput depending on the target, right? The code with and/andn+or has a shorter critical path than the dependent chain of xor+and+xor.

So I think llvm-mca is lying...at least for that CPU model.
If we plug these in with -mcpu=kryo, we get:
IPC: 1.32 for the 'eor' chain
IPC: 1.96 for the 'bic' chain

Is the problem that x86 can't form 'andn' with an immediate? Can we fix its override of hasAndNot to account for that? Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both?

lebedev.ri added inline comments.
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402

But the whole point of the masked merge exercise was to maximize the throughput depending on the target, right?

Yes, absolutely.

So I think llvm-mca is lying...at least for that CPU model.
If we plug these in with -mcpu=kryo, we get:
IPC: 1.32 for the 'eor' chain
IPC: 1.96 for the 'bic' chain

Ok, thank you, that makes more sense.
It would be nice if llvm-mca's docs would contain a list of 'good' cpu models, for which it is known not to lie. (cc @andreadb )

Is the problem that x86 can't form 'andn' with an immediate?

Yes, that is the motivational case.

Can we fix its override of hasAndNot to account for that?

Hmm, actually, maybe we can...
Looking at the docs, it is already specified that it takes the value, not the mask-to-be-inverted.

Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both?

I don't think i'm able to answer that. Instcombine should certainly handle that, yes.

spatel added a comment.May 2 2018, 2:47 PM
Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both?

I don't think i'm able to answer that. Instcombine should certainly handle that, yes.

I may still not be seeing clearly, but I think this is the real problem - we should just bail out if the 'xor' is truly a 'not'.
Nothing good is going to come out of us trying to improve on that here.

The 'andn' with constant for x86 is a small concern. It might be a win or not, but it's probably not going to make a big difference either way?

lebedev.ri updated this revision to Diff 145020.May 3 2018, 7:46 AM
lebedev.ri marked 9 inline comments as done.
  • Don't touch not.
  • Update X86TargetLowering::hasAndNot() with the check for immediates.

The last change affects the transform @spatel have added in D27489 / rL289738,
and the test coverage for X86 was missing.
But after i have added it, and looked at the changes in MCA, i'm again confused.


I'd say this regression is an improvement, since IPC increased?

andreadb added inline comments.May 4 2018, 6:22 AM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402

llvm-mca is not lying.
cpu cortex-a75 describes the latency of eor/bic/orr using variant scheduling classes.
llvm-mca doesn't know how to analyze variant scheduling classes. So, you should have seen one or more warnings generated by the tool.

If for good model you mean a model that doesn't use variant scheduling classes, then you only need to worry about cases where the above mentioned warnings are generated.

I am currently working hard on a patch to add support for variant scheduling classes. It is quite tricky, but I am confident that I will have something in the form of a patch ready (hopefully) on next week.

Cheers,
Andrea

lebedev.ri added inline comments.May 4 2018, 6:28 AM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402

I've come up with this script to do these comparisons


I'm guessing i haven't noticed any warnings because i only redirect stdout, which is good to know.

spatel added inline comments.May 4 2018, 6:39 AM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402

Aha - thanks for clearing that up. I missed the warnings because they were at the top of the output, and I didn't scroll that far back up. :)

$ llvm-mca -mtriple=aarch64 -mcpu=cortex-a75  eor.s 
warning: don't know how to model variant opcodes.
note: assume 1 micro opcode.

A potential usability improvement would be to make warnings like that one louder in some way (repeat it at the bottom, put asterisks in the stats?). Just a thought...now that I know, I'll definitely look harder at the whole output.

lebedev.ri added inline comments.May 4 2018, 6:42 AM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402

Or maybe duplicate them to stdout too, not just output then to stderr?

andreadb added inline comments.May 4 2018, 6:43 AM
test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402

I'll see what I can do to improve it. I might make the "warning:" red :-).

Alternatively I could add some sort of -Werror equivalent mode where the warning is promoted to a fatal error. Something like that...

As a side note: I mentioned this issue once in reply to D45733. Maybe that comment was lost in the noise.

I want to look at that x86 timing difference in more detail, but let me ask first: can we split this patch up and look at the changes independently?
I think there are 3 parts:

  1. Ignore 'not'.
  2. Change x86 hasAndNot().
  3. Improve matching for AndNot with constant.

I want to look at that x86 timing difference in more detail, but let me ask first: can we split this patch up and look at the changes independently?
I think there are 3 parts:

  1. Ignore 'not'.
  2. Change x86 hasAndNot().
  3. Improve matching for AndNot with constant.

Yes, i think i can split it into three, will post tomorrow.

lebedev.ri abandoned this revision.May 5 2018, 4:29 AM

I want to look at that x86 timing difference in more detail, but let me ask first: can we split this patch up and look at the changes independently?
I think there are 3 parts:

  1. Ignore 'not'.
  2. Change x86 hasAndNot().
  3. Improve matching for AndNot with constant.

Yes, i think i can split it into three, will post tomorrow.

Done, D46492 D46493 D46494.