This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
-
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
8/13
unfold-masked-merge-scalar-variablemask.ll
-
X86/
4/7
unfold-masked-merge-scalar-variablemask.ll

Differential D46031

[DAGCombiner] Masked merge: if 'B' is constant, de-canonicalize the pattern (invert the mask).
AbandonedPublic

Authored by lebedev.ri on Apr 24 2018, 3:48 PM.

Download Raw Diff

Details

Reviewers

spatel
craig.topper
RKSimon
javed.absar

Summary

Discovered accidentally when working on the vector part, because the @test_andnotps/@test_andnotpd
in test/CodeGen/X86/*-schedule.ll broke - they were no longer lowered to andnps/andnpd.

Given canonical pattern of:

|        A  |  |B|
((x ^ y) & m) ^ y
 |  D  |

We don't want to handle xor's with second operand being constant,
because andn does not get selected.

Diff Detail

Repository: rL LLVM

Event Timeline

lebedev.ri created this revision.Apr 24 2018, 3:48 PM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 24 2018, 3:48 PM

Rebased ontop of better testset.

I have stared at the test change a bit more, and unless there are some other patterns i did not analyze, i do think this is the way we want to handle this.

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
626–627	This improves.
654–655	This degrades, but instcombine will canonicalize it to `@in_constant_mone_vary`, and that one is ok. So this is ok too.
655	Looked at this in llvm-mca, no clear winner. The change decreases instruction count and IPC, but the cycle count does not change. So i guess it's ok?
656–657	As per mca this is an unimportant change, but again, instcombine will canonicalize it to `@in_constant_42_vary`, which is ok. So this one appears ok too.

It's not clear to me if this is about xor with a constant in general or xor with -1 specifically. Is the motivating pattern/problem not recognizing DeMorgan's Laws folds in the DAG?

; ~(~x & y) --> x | ~y
%notx = xor i8 %x, -1
%and = and i8 %notx, %y
%r = xor i8 %and, -1
=>
%notm = xor i8 %y, -1
%r = or i8 %x, %notm

That seems like a good fold to have in the DAG given that we're rearranging bitwise logic ops to better match target features. Should we just add that (and the 'or' --> 'and' twin)?

In D46031#1078128, @spatel wrote:

It's not clear to me if this is about xor with a constant in general or xor with -1 specifically.

I thought in general, note the tests with constant 42.

Is the motivating pattern/problem not recognizing DeMorgan's Laws folds in the DAG?

The motivational case is specified in the differential's description,

; ~(~x & y) --> x | ~y
%notx = xor i8 %x, -1
%and = and i8 %notx, %y
%r = xor i8 %and, -1
=>
%notm = xor i8 %y, -1
%r = or i8 %x, %notm
That seems like a good fold to have in the DAG given that we're rearranging bitwise logic ops to better match target features. Should we just add that (and the 'or' --> 'and' twin)?

Hmm, not sure, let's see..

lebedev.ri mentioned this in D46072: [DagCombine][InstCombine][NFC] De Morgan law tests.Apr 25 2018, 10:40 AM

lebedev.ri mentioned this in D46073: [DagCombine] De Morgan laws: 'nand' logic with an inverted operand.

lebedev.ri added inline comments.Apr 26 2018, 7:29 AM

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll

655

On top of D46073, this @in_constant_varx_42 pattern (i.e. %y being constant) is the only remaining issue.

# *** IR Dump After Machine InstCombiner ***:
# Machine code for function in_constant_varx_42: IsSSA, TracksLiveness
Function Live Ins: $edi in %0, $edx in %2

bb.0 (%ir-block.0):
  liveins: $edi, $edx
  %2:gr32 = COPY $edx
  %0:gr32 = COPY $edi
  %3:gr32 = AND32rr %0:gr32, %2:gr32, implicit-def dead $eflags
  %4:gr32 = NOT32r %2:gr32
  %5:gr32 = AND32ri8 %4:gr32, 42, implicit-def dead $eflags
  %6:gr32 = OR32rr %3:gr32, killed %5:gr32, implicit-def dead $eflags
  $eax = COPY %6:gr32
  RET 0, $eax

# End machine code for function in_constant_varx_42.

This *seems* ok (as per mca) on aarch64, but i'm not so sure about x86.

diff.txt2 KBDownload

lebedev.ri added inline comments.Apr 26 2018, 8:32 AM

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
655	Right, in this case not only should i not unfold it, but also de-canonicalize the mask. diff.txt2 KBDownload

Assuming we'll manage to get D46073, this should do it.

lebedev.ri added a parent revision: D46073: [DagCombine] De Morgan laws: 'nand' logic with an inverted operand.Apr 27 2018, 4:13 AM

After some thought, and staring into MCA output, i believe this should come before De Morgan laws (D46073, should that ever land),
thus i rebased this change not to depend on that differential.

Some considerations, for znver1, and this test IR:

define i32 @in_constant_varx_42(i32 %x, i32 %y, i32 %mask) {
  %n0 = xor i32 %x, 42 ; %x
  %n1 = and i32 %n0, %mask
  %r = xor i32 %n1, 42
  ret i32 %r
}

diff-mm-vs-unfolded-old.txt2 KBDownload

Difference between not unfolding that pattern vs. svn - instruction count and IPC increased

diff-mm-vs-unfolded-new.txt2 KBDownload

Difference between not unfolding that pattern vs. this differential - Total Cycles halved, IPC doubled

diff-unfolded-old-vs-new.txt2 KBDownload

Difference between unfolding that pattern in svn vs. this differential - Instruction count decreased back to not unfolded count, cycle count halved, IPC increased.

lebedev.ri removed a parent revision: D46073: [DagCombine] De Morgan laws: 'nand' logic with an inverted operand.Apr 30 2018, 1:10 PM

spatel added inline comments.May 2 2018, 11:10 AM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338	How does this happen? Isn't that a miscompile?

lebedev.ri added inline comments.May 2 2018, 12:01 PM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338	Hm, at first i thought it was indeed (https://reviews.llvm.org/D45733#1077183), but now i do not think so. https://godbolt.org/g/L4hDjW ^ so neither of our outputs is fully optimized. But if i manually transform that assembly to C, the end result tells me that DAGCombine/arm isel is simply missing some optimizations. I could be wrong, of course.

spatel added inline comments.May 2 2018, 1:32 PM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338	Ok, I was seeing an extra 'not' in there somewhere, so no miscompile. And the conclusion is that we don't care about this diff because it's already sub-optimal and instcombine should have folded it anyway. That raises the question of why are we testing this in the first place though. Add a comment to explain that or just delete?
400–402	This is a real regression, or am I seeing things that aren't there?
test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
626–627	But as with AArch, we don't care because instcombine would fold this?

lebedev.ri added inline comments.May 2 2018, 1:49 PM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
337–338	I'm somewhat sure that this is the scalar version of those tests that are failing in D46073 (and when we unfold vector masked merge), so i think it's best to keep these tests.
400–402	We replaced two instructions with two other instructions. Unless i'm using a bad `-mcpu` (`-mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75`, is there a better choice?), this does not seem to matter in practice. Or i'm simply looking at `llvm-mca` wrong :) diff.txt1 KBDownload

spatel added inline comments.May 2 2018, 2:19 PM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402	Correct - 2 instructions change. But the whole point of the masked merge exercise was to maximize the throughput depending on the target, right? The code with and/andn+or has a shorter critical path than the dependent chain of xor+and+xor. So I think llvm-mca is lying...at least for that CPU model. If we plug these in with -mcpu=kryo, we get: IPC: 1.32 for the 'eor' chain IPC: 1.96 for the 'bic' chain Is the problem that x86 can't form 'andn' with an immediate? Can we fix its override of hasAndNot to account for that? Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both?

lebedev.ri added a subscriber: andreadb.May 2 2018, 2:40 PM

lebedev.ri added inline comments.

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402	But the whole point of the masked merge exercise was to maximize the throughput depending on the target, right? Yes, absolutely. So I think llvm-mca is lying...at least for that CPU model. If we plug these in with -mcpu=kryo, we get: IPC: 1.32 for the 'eor' chain IPC: 1.96 for the 'bic' chain Ok, thank you, that makes more sense. It would be nice if llvm-mca's docs would contain a list of 'good' cpu models, for which it is known not to lie. (cc @andreadb ) Is the problem that x86 can't form 'andn' with an immediate? Yes, that is the motivational case. Can we fix its override of hasAndNot to account for that? Hmm, actually, maybe we can... Looking at the docs, it is already specified that it takes the value, not the mask-to-be-inverted. Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both? I don't think i'm able to answer that. Instcombine should certainly handle that, yes.

Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both?
I don't think i'm able to answer that. Instcombine should certainly handle that, yes.

I may still not be seeing clearly, but I think this is the real problem - we should just bail out if the 'xor' is truly a 'not'.
Nothing good is going to come out of us trying to improve on that here.

The 'andn' with constant for x86 is a small concern. It might be a win or not, but it's probably not going to make a big difference either way?

Don't touch not.
Update X86TargetLowering::hasAndNot() with the check for immediates.

The last change affects the transform @spatel have added in D27489 / rL289738,
and the test coverage for X86 was missing.
But after i have added it, and looked at the changes in MCA, i'm again confused.

icmp-opt.txt2 KBDownload

pos_sel_constants.txt2 KBDownload

pos_sel_special_constant.txt2 KBDownload

I'd say this regression is an improvement, since IPC increased?

andreadb added inline comments.May 4 2018, 6:22 AM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402	llvm-mca is not lying. cpu cortex-a75 describes the latency of eor/bic/orr using variant scheduling classes. llvm-mca doesn't know how to analyze variant scheduling classes. So, you should have seen one or more warnings generated by the tool. If for good model you mean a model that doesn't use variant scheduling classes, then you only need to worry about cases where the above mentioned warnings are generated. I am currently working hard on a patch to add support for variant scheduling classes. It is quite tricky, but I am confident that I will have something in the form of a patch ready (hopefully) on next week. Cheers, Andrea

lebedev.ri added inline comments.May 4 2018, 6:28 AM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402	I've come up with this script to do these comparisons llvm-mca.sh609 BDownload I'm guessing i haven't noticed any warnings because i only redirect stdout, which is good to know.

spatel added inline comments.May 4 2018, 6:39 AM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402	Aha - thanks for clearing that up. I missed the warnings because they were at the top of the output, and I didn't scroll that far back up. :) $ llvm-mca -mtriple=aarch64 -mcpu=cortex-a75 eor.s warning: don't know how to model variant opcodes. note: assume 1 micro opcode. A potential usability improvement would be to make warnings like that one louder in some way (repeat it at the bottom, put asterisks in the stats?). Just a thought...now that I know, I'll definitely look harder at the whole output.

lebedev.ri added inline comments.May 4 2018, 6:42 AM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402	Or maybe duplicate them to stdout too, not just output then to stderr?

andreadb added inline comments.May 4 2018, 6:43 AM

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll
400–402	I'll see what I can do to improve it. I might make the "warning:" red :-). Alternatively I could add some sort of `-Werror` equivalent mode where the warning is promoted to a fatal error. Something like that... As a side note: I mentioned this issue once in reply to D45733. Maybe that comment was lost in the noise.

I want to look at that x86 timing difference in more detail, but let me ask first: can we split this patch up and look at the changes independently?
I think there are 3 parts:

Ignore 'not'.
Change x86 hasAndNot().
Improve matching for AndNot with constant.

In D46031#1088250, @spatel wrote:

I want to look at that x86 timing difference in more detail, but let me ask first: can we split this patch up and look at the changes independently?
I think there are 3 parts:

Ignore 'not'.

Change x86 hasAndNot().

Improve matching for AndNot with constant.

Yes, i think i can split it into three, will post tomorrow.

lebedev.ri mentioned this in D46492: [DAGCombiner] Masked merge: don't touch "not" xor's..May 5 2018, 3:54 AM

lebedev.ri mentioned this in D46493: [DagCombiner] Not all 'andn''s work with immediates..May 5 2018, 4:15 AM

In D46031#1088257, @lebedev.ri wrote:

In D46031#1088250, @spatel wrote:

I want to look at that x86 timing difference in more detail, but let me ask first: can we split this patch up and look at the changes independently?
I think there are 3 parts:

Ignore 'not'.

Change x86 hasAndNot().

Improve matching for AndNot with constant.

Yes, i think i can split it into three, will post tomorrow.

Done, D46492 D46493 D46494.

Diffusion mentioned this in rL331595: [DAGCombiner] Masked merge: don't touch "not" xor's..May 5 2018, 8:49 AM

Diffusion mentioned this in rL331685: [DAGCombiner] Masked merge: enhance handling of 'andn' with immediates.May 7 2018, 2:57 PM

Diffusion mentioned this in rL331684: [DagCombiner] Not all 'andn''s work with immediates..

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

14 lines

test/

CodeGen/

AArch64/

unfold-masked-merge-scalar-variablemask.ll

68 lines

X86/

unfold-masked-merge-scalar-variablemask.ll

104 lines

Diff 143824

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,378 Lines • ▼ Show 20 Lines	if (VT.isVector())
return SDValue();		return SDValue();

// There are 3 commutable operators in the pattern,		// There are 3 commutable operators in the pattern,
// so we have to deal with 8 possible variants of the basic pattern.		// so we have to deal with 8 possible variants of the basic pattern.
SDValue X, Y, M;		SDValue X, Y, M;
auto matchAndXor = [&X, &Y, &M](SDValue And, unsigned XorIdx, SDValue Other) {		auto matchAndXor = [&X, &Y, &M](SDValue And, unsigned XorIdx, SDValue Other) {
if (And.getOpcode() != ISD::AND \|\| !And.hasOneUse())		if (And.getOpcode() != ISD::AND \|\| !And.hasOneUse())
return false;		return false;
if (And.getOperand(XorIdx).getOpcode() != ISD::XOR \|\|		SDValue Xor = And.getOperand(XorIdx);
!And.getOperand(XorIdx).hasOneUse())		if (Xor.getOpcode() != ISD::XOR \|\| !Xor.hasOneUse())
return false;		return false;
SDValue Xor0 = And.getOperand(XorIdx).getOperand(0);		SDValue Xor0 = Xor.getOperand(0);
SDValue Xor1 = And.getOperand(XorIdx).getOperand(1);		SDValue Xor1 = Xor.getOperand(1);
if (Other == Xor0)		if (Other == Xor0)
std::swap(Xor0, Xor1);		std::swap(Xor0, Xor1);
if (Other != Xor1)		if (Other != Xor1)
return false;		return false;
X = Xor0;		X = Xor0;
Y = Xor1;		Y = Xor1;
M = And.getOperand(XorIdx ? 0 : 1);		M = And.getOperand(XorIdx ? 0 : 1);
return true;		return true;
};		};

		// Don't do anything if the second operand of xor is a constant.
		// It can only be a second operand because of canonicalization.
		if (isa<ConstantSDNode>(N->getOperand(1).getNode()))
		return SDValue();

SDValue A = N->getOperand(0);		SDValue A = N->getOperand(0);
SDValue B = N->getOperand(1);		SDValue B = N->getOperand(1);

if (!matchAndXor(A, 0, B) && !matchAndXor(A, 1, B) && !matchAndXor(B, 0, A) &&		if (!matchAndXor(A, 0, B) && !matchAndXor(A, 1, B) && !matchAndXor(B, 0, A) &&
!matchAndXor(B, 1, A))		!matchAndXor(B, 1, A))
return SDValue();		return SDValue();

// Don't do anything if the mask is constant. This should not be reachable.		// Don't do anything if the mask is constant. This should not be reachable.
// InstCombine should have already unfolded this pattern, and DAGCombiner		// InstCombine should have already unfolded this pattern, and DAGCombiner
// probably shouldn't produce it, too.		// probably shouldn't produce it, too.
if (isa<ConstantSDNode>(M.getNode()))		if (isa<ConstantSDNode>(M.getNode()))
▲ Show 20 Lines • Show All 12,569 Lines • Show Last 20 Lines

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll

	Show First 20 Lines • Show All 323 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%y = and i32 %y_hi, %y_low			%y = and i32 %y_hi, %y_low
	%mask = xor i32 %m_a, %m_b			%mask = xor i32 %m_a, %m_b
	%n0 = xor i32 %x, %y			%n0 = xor i32 %x, %y
	%n1 = and i32 %mask, %n0			%n1 = and i32 %mask, %n0
	%r = xor i32 %y, %n1			%r = xor i32 %y, %n1
	ret i32 %r			ret i32 %r
	}			}
				; %x is a constant
				define i32 @out_constant_x_mone(i32 %y, i32 %mask) {
				; CHECK-LABEL: out_constant_x_mone:
				; CHECK: // %bb.0:
				; CHECK-NEXT: bic w8, w0, w1
				; CHECK-NEXT: orr w0, w1, w8
				; CHECK-NEXT: ret
				%mx = and i32 %mask, -1 ; %x
				%notmask = xor i32 %mask, -1
				%my = and i32 %y, %notmask
				%r = or i32 %mx, %my
				ret i32 %r
				}
				define i32 @in_constant_x_mone(i32 %x, i32 %y, i32 %mask) {
				; CHECK-LABEL: in_constant_x_mone:
				; CHECK: // %bb.0:
				; CHECK-NEXT: bic w8, w1, w2
				; CHECK-NEXT: orr w0, w2, w8
				; CHECK-NEXT: ret
				%n0 = xor i32 %y, -1 ; %x
				%n1 = and i32 %n0, %mask
				%r = xor i32 %n1, %y
				ret i32 %r
				}
				define i32 @out_constant_x_one(i32 %y, i32 %mask) {
				; CHECK-LABEL: out_constant_x_one:
				; CHECK: // %bb.0:
				; CHECK-NEXT: and w8, w1, #0x1
				; CHECK-NEXT: bic w9, w0, w1
				; CHECK-NEXT: orr w0, w8, w9
				; CHECK-NEXT: ret
				%mx = and i32 %mask, 1 ; %x
				%notmask = xor i32 %mask, -1
				%my = and i32 %y, %notmask
				%r = or i32 %mx, %my
				ret i32 %r
				}
				define i32 @in_constant_x_one(i32 %x, i32 %y, i32 %mask) {
				; CHECK-LABEL: in_constant_x_one:
				; CHECK: // %bb.0:
				; CHECK-NEXT: bic w8, w1, w2
				; CHECK-NEXT: and w9, w2, #0x1
				; CHECK-NEXT: orr w0, w9, w8
				; CHECK-NEXT: ret
				%n0 = xor i32 %y, 1 ; %x
				%n1 = and i32 %n0, %mask
				%r = xor i32 %n1, %y
				ret i32 %r
				}
	; ============================================================================ ;			; ============================================================================ ;
	; Both xor's have the same constant operand			; Negative tests. Should not be folded.
	; ============================================================================ ;			; ============================================================================ ;
				; Both xor's have the same constant operand
	define i32 @out_constant_y_mone(i32 %x, i32 %mask) {			define i32 @out_constant_y_mone(i32 %x, i32 %mask) {
	; CHECK-LABEL: out_constant_y_mone:			; CHECK-LABEL: out_constant_y_mone:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: and w8, w0, w1			; CHECK-NEXT: and w8, w0, w1
	spatelUnsubmitted Done Reply Inline Actions How does this happen? Isn't that a miscompile? spatel: How does this happen? Isn't that a miscompile?
	lebedev.riAuthorUnsubmitted Done Reply Inline Actions Hm, at first i thought it was indeed (https://reviews.llvm.org/D45733#1077183), but now i do not think so. https://godbolt.org/g/L4hDjW ^ so neither of our outputs is fully optimized. But if i manually transform that assembly to C, the end result tells me that DAGCombine/arm isel is simply missing some optimizations. I could be wrong, of course. lebedev.ri: Hm, at first i thought it was indeed (https://reviews.llvm.org/D45733#1077183), but now i do…
	spatelUnsubmitted Done Reply Inline Actions Ok, I was seeing an extra 'not' in there somewhere, so no miscompile. And the conclusion is that we don't care about this diff because it's already sub-optimal and instcombine should have folded it anyway. That raises the question of why are we testing this in the first place though. Add a comment to explain that or just delete? spatel: Ok, I was seeing an extra 'not' in there somewhere, so no miscompile. And the conclusion is…
	lebedev.riAuthorUnsubmitted Done Reply Inline Actions I'm somewhat sure that this is the scalar version of those tests that are failing in D46073 (and when we unfold vector masked merge), so i think it's best to keep these tests. lebedev.ri: I'm somewhat sure that this is the scalar version of those tests that are failing in D46073…
	; CHECK-NEXT: orn w0, w8, w1			; CHECK-NEXT: orn w0, w8, w1
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%mx = and i32 %x, %mask			%mx = and i32 %x, %mask
	%notmask = xor i32 %mask, -1			%notmask = xor i32 %mask, -1
	%my = and i32 %notmask, -1 ; %y			%my = and i32 %notmask, -1 ; %y
	%r = or i32 %mx, %my			%r = or i32 %mx, %my
	ret i32 %r			ret i32 %r
	}			}
	; FIXME: should be bic+mvn
	define i32 @in_constant_y_mone(i32 %x, i32 %mask) {			define i32 @in_constant_y_mone(i32 %x, i32 %mask) {
	; CHECK-LABEL: in_constant_y_mone:			; CHECK-LABEL: in_constant_y_mone:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: and w8, w0, w1			; CHECK-NEXT: bic w8, w1, w0
	; CHECK-NEXT: orn w0, w8, w1			; CHECK-NEXT: mvn w0, w8
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
				spatelUnsubmitted Done Reply Inline Actions This is a real regression, or am I seeing things that aren't there? spatel: This is a real regression, or am I seeing things that aren't there?
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions We replaced two instructions with two other instructions. Unless i'm using a bad `-mcpu` (`-mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75`, is there a better choice?), this does not seem to matter in practice. Or i'm simply looking at `llvm-mca` wrong :) diff.txt1 KBDownload lebedev.ri: We replaced two instructions with two other instructions. Unless i'm using a bad `-mcpu` (`…
				spatelUnsubmitted Done Reply Inline Actions Correct - 2 instructions change. But the whole point of the masked merge exercise was to maximize the throughput depending on the target, right? The code with and/andn+or has a shorter critical path than the dependent chain of xor+and+xor. So I think llvm-mca is lying...at least for that CPU model. If we plug these in with -mcpu=kryo, we get: IPC: 1.32 for the 'eor' chain IPC: 1.96 for the 'bic' chain Is the problem that x86 can't form 'andn' with an immediate? Can we fix its override of hasAndNot to account for that? Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both? spatel: Correct - 2 instructions change. But the whole point of the masked merge exercise was to…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions But the whole point of the masked merge exercise was to maximize the throughput depending on the target, right? Yes, absolutely. So I think llvm-mca is lying...at least for that CPU model. If we plug these in with -mcpu=kryo, we get: IPC: 1.32 for the 'eor' chain IPC: 1.96 for the 'bic' chain Ok, thank you, that makes more sense. It would be nice if llvm-mca's docs would contain a list of 'good' cpu models, for which it is known not to lie. (cc @andreadb ) Is the problem that x86 can't form 'andn' with an immediate? Yes, that is the motivational case. Can we fix its override of hasAndNot to account for that? Hmm, actually, maybe we can... Looking at the docs, it is already specified that it takes the value, not the mask-to-be-inverted. Or is the problem that we should be ignoring 'not' ops as candidates for transforming in this function? Or both? I don't think i'm able to answer that. Instcombine should certainly handle that, yes. lebedev.ri: > But the whole point of the masked merge exercise was to maximize the throughput depending on…
				andreadbUnsubmitted Not Done Reply Inline Actions llvm-mca is not lying. cpu cortex-a75 describes the latency of eor/bic/orr using variant scheduling classes. llvm-mca doesn't know how to analyze variant scheduling classes. So, you should have seen one or more warnings generated by the tool. If for good model you mean a model that doesn't use variant scheduling classes, then you only need to worry about cases where the above mentioned warnings are generated. I am currently working hard on a patch to add support for variant scheduling classes. It is quite tricky, but I am confident that I will have something in the form of a patch ready (hopefully) on next week. Cheers, Andrea andreadb: llvm-mca is not lying. cpu cortex-a75 describes the latency of eor/bic/orr using variant…
				lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions I've come up with this script to do these comparisons llvm-mca.sh609 BDownload I'm guessing i haven't noticed any warnings because i only redirect stdout, which is good to know. lebedev.ri: I've come up with this script to do these comparisons {F6101135} I'm guessing i haven't noticed…
				spatelUnsubmitted Not Done Reply Inline Actions Aha - thanks for clearing that up. I missed the warnings because they were at the top of the output, and I didn't scroll that far back up. :) $ llvm-mca -mtriple=aarch64 -mcpu=cortex-a75 eor.s warning: don't know how to model variant opcodes. note: assume 1 micro opcode. A potential usability improvement would be to make warnings like that one louder in some way (repeat it at the bottom, put asterisks in the stats?). Just a thought...now that I know, I'll definitely look harder at the whole output. spatel: Aha - thanks for clearing that up. I missed the warnings because they were at the top of the…
				lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Or maybe duplicate them to stdout too, not just output then to stderr? lebedev.ri: Or maybe duplicate them to stdout too, not just output then to stderr?
				andreadbUnsubmitted Not Done Reply Inline Actions I'll see what I can do to improve it. I might make the "warning:" red :-). Alternatively I could add some sort of `-Werror` equivalent mode where the warning is promoted to a fatal error. Something like that... As a side note: I mentioned this issue once in reply to D45733. Maybe that comment was lost in the noise. andreadb: I'll see what I can do to improve it. I might make the "warning:" red :-). Alternatively I…
	%n0 = xor i32 %x, -1 ; %y			%n0 = xor i32 %x, -1 ; %y
	%n1 = and i32 %n0, %mask			%n1 = and i32 %n0, %mask
	%r = xor i32 %n1, -1 ; %y			%r = xor i32 %n1, -1 ; %y
	ret i32 %r			ret i32 %r
	}			}
	define i32 @out_constant_y_one(i32 %x, i32 %mask) {			define i32 @out_constant_y_one(i32 %x, i32 %mask) {
	; CHECK-LABEL: out_constant_y_one:			; CHECK-LABEL: out_constant_y_one:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mvn w9, w1			; CHECK-NEXT: mvn w9, w1
	; CHECK-NEXT: and w8, w0, w1			; CHECK-NEXT: and w8, w0, w1
	; CHECK-NEXT: and w9, w9, #0x1			; CHECK-NEXT: and w9, w9, #0x1
	; CHECK-NEXT: orr w0, w8, w9			; CHECK-NEXT: orr w0, w8, w9
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%mx = and i32 %x, %mask			%mx = and i32 %x, %mask
	%notmask = xor i32 %mask, -1			%notmask = xor i32 %mask, -1
	%my = and i32 %notmask, 1 ; %y			%my = and i32 %notmask, 1 ; %y
	%r = or i32 %mx, %my			%r = or i32 %mx, %my
	ret i32 %r			ret i32 %r
	}			}
	; FIXME: should be eor+and+eor
	define i32 @in_constant_y_one(i32 %x, i32 %mask) {			define i32 @in_constant_y_one(i32 %x, i32 %mask) {
	; CHECK-LABEL: in_constant_y_one:			; CHECK-LABEL: in_constant_y_one:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mvn w9, w1			; CHECK-NEXT: eor w8, w0, #0x1
	; CHECK-NEXT: and w8, w0, w1			; CHECK-NEXT: and w8, w8, w1
	; CHECK-NEXT: and w9, w9, #0x1			; CHECK-NEXT: eor w0, w8, #0x1
	; CHECK-NEXT: orr w0, w8, w9
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%n0 = xor i32 %x, 1 ; %y			%n0 = xor i32 %x, 1 ; %y
	%n1 = and i32 %n0, %mask			%n1 = and i32 %n0, %mask
	%r = xor i32 %n1, 1 ; %y			%r = xor i32 %n1, 1 ; %y
	ret i32 %r			ret i32 %r
	}			}
	; ============================================================================ ;
	; Negative tests. Should not be folded.
	; ============================================================================ ;
	; Multi-use tests.			; Multi-use tests.
	declare void @use32(i32) nounwind			declare void @use32(i32) nounwind
	define i32 @in_multiuse_A(i32 %x, i32 %y, i32 %z, i32 %mask) nounwind {			define i32 @in_multiuse_A(i32 %x, i32 %y, i32 %z, i32 %mask) nounwind {
	; CHECK-LABEL: in_multiuse_A:			; CHECK-LABEL: in_multiuse_A:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: str x20, [sp, #-32]! // 8-byte Folded Spill			; CHECK-NEXT: str x20, [sp, #-32]! // 8-byte Folded Spill
	; CHECK-NEXT: eor w8, w0, w1			; CHECK-NEXT: eor w8, w0, w1
	; CHECK-NEXT: and w20, w8, w3			; CHECK-NEXT: and w20, w8, w3
	▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll

	Show First 20 Lines • Show All 528 Lines • ▼ Show 20 Lines
	; CHECK-BMI-NEXT: retq			; CHECK-BMI-NEXT: retq
	%y = and i32 %y_hi, %y_low			%y = and i32 %y_hi, %y_low
	%mask = xor i32 %m_a, %m_b			%mask = xor i32 %m_a, %m_b
	%n0 = xor i32 %x, %y			%n0 = xor i32 %x, %y
	%n1 = and i32 %mask, %n0			%n1 = and i32 %mask, %n0
	%r = xor i32 %y, %n1			%r = xor i32 %y, %n1
	ret i32 %r			ret i32 %r
	}			}
				; %x is a constant
				define i32 @out_constant_x_mone(i32 %y, i32 %mask) {
				; CHECK-NOBMI-LABEL: out_constant_x_mone:
				; CHECK-NOBMI: # %bb.0:
				; CHECK-NOBMI-NEXT: movl %esi, %eax
				; CHECK-NOBMI-NEXT: notl %eax
				; CHECK-NOBMI-NEXT: andl %edi, %eax
				; CHECK-NOBMI-NEXT: orl %esi, %eax
				; CHECK-NOBMI-NEXT: retq
				;
				; CHECK-BMI-LABEL: out_constant_x_mone:
				; CHECK-BMI: # %bb.0:
				; CHECK-BMI-NEXT: andnl %edi, %esi, %eax
				; CHECK-BMI-NEXT: orl %esi, %eax
				; CHECK-BMI-NEXT: retq
				%mx = and i32 %mask, -1 ; %x
				%notmask = xor i32 %mask, -1
				%my = and i32 %y, %notmask
				%r = or i32 %mx, %my
				ret i32 %r
				}
				define i32 @in_constant_x_mone(i32 %x, i32 %y, i32 %mask) {
				; CHECK-NOBMI-LABEL: in_constant_x_mone:
				; CHECK-NOBMI: # %bb.0:
				; CHECK-NOBMI-NEXT: movl %esi, %eax
				; CHECK-NOBMI-NEXT: notl %eax
				; CHECK-NOBMI-NEXT: andl %edx, %eax
				; CHECK-NOBMI-NEXT: xorl %esi, %eax
				; CHECK-NOBMI-NEXT: retq
				;
				; CHECK-BMI-LABEL: in_constant_x_mone:
				; CHECK-BMI: # %bb.0:
				; CHECK-BMI-NEXT: andnl %esi, %edx, %eax
				; CHECK-BMI-NEXT: orl %edx, %eax
				; CHECK-BMI-NEXT: retq
				%n0 = xor i32 %y, -1 ; %x
				%n1 = and i32 %n0, %mask
				%r = xor i32 %n1, %y
				ret i32 %r
				}
				define i32 @out_constant_x_one(i32 %y, i32 %mask) {
				; CHECK-NOBMI-LABEL: out_constant_x_one:
				; CHECK-NOBMI: # %bb.0:
				; CHECK-NOBMI-NEXT: movl %esi, %eax
				; CHECK-NOBMI-NEXT: andl $1, %eax
				; CHECK-NOBMI-NEXT: notl %esi
				; CHECK-NOBMI-NEXT: andl %edi, %esi
				; CHECK-NOBMI-NEXT: orl %eax, %esi
				; CHECK-NOBMI-NEXT: movl %esi, %eax
				; CHECK-NOBMI-NEXT: retq
				;
				; CHECK-BMI-LABEL: out_constant_x_one:
				; CHECK-BMI: # %bb.0:
				; CHECK-BMI-NEXT: andnl %edi, %esi, %eax
				; CHECK-BMI-NEXT: andl $1, %esi
				; CHECK-BMI-NEXT: orl %esi, %eax
				; CHECK-BMI-NEXT: retq
				%mx = and i32 %mask, 1 ; %x
				%notmask = xor i32 %mask, -1
				%my = and i32 %y, %notmask
				%r = or i32 %mx, %my
				ret i32 %r
				}
				define i32 @in_constant_x_one(i32 %x, i32 %y, i32 %mask) {
				; CHECK-NOBMI-LABEL: in_constant_x_one:
				; CHECK-NOBMI: # %bb.0:
				; CHECK-NOBMI-NEXT: movl %esi, %eax
				; CHECK-NOBMI-NEXT: xorl $1, %eax
				; CHECK-NOBMI-NEXT: andl %edx, %eax
				; CHECK-NOBMI-NEXT: xorl %esi, %eax
				; CHECK-NOBMI-NEXT: retq
				;
				; CHECK-BMI-LABEL: in_constant_x_one:
				; CHECK-BMI: # %bb.0:
				; CHECK-BMI-NEXT: andnl %esi, %edx, %eax
				; CHECK-BMI-NEXT: andl $1, %edx
				; CHECK-BMI-NEXT: orl %edx, %eax
				; CHECK-BMI-NEXT: retq
				%n0 = xor i32 %y, 1 ; %x
				%n1 = and i32 %n0, %mask
				%r = xor i32 %n1, %y
				ret i32 %r
				}
	; ============================================================================ ;			; ============================================================================ ;
	; Both xor's have the same constant operand			; Negative tests. Should not be folded.
	; ============================================================================ ;			; ============================================================================ ;
				; Both xor's have the same constant operand
	define i32 @out_constant_y_mone(i32 %x, i32 %mask) {			define i32 @out_constant_y_mone(i32 %x, i32 %mask) {
	; CHECK-NOBMI-LABEL: out_constant_y_mone:			; CHECK-NOBMI-LABEL: out_constant_y_mone:
	; CHECK-NOBMI: # %bb.0:			; CHECK-NOBMI: # %bb.0:
	; CHECK-NOBMI-NEXT: andl %esi, %edi			; CHECK-NOBMI-NEXT: andl %esi, %edi
				lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions This improves. lebedev.ri: This improves.
				spatelUnsubmitted Done Reply Inline Actions But as with AArch, we don't care because instcombine would fold this? spatel: But as with AArch, we don't care because instcombine would fold this?
	; CHECK-NOBMI-NEXT: notl %esi			; CHECK-NOBMI-NEXT: notl %esi
	; CHECK-NOBMI-NEXT: orl %edi, %esi			; CHECK-NOBMI-NEXT: orl %edi, %esi
	; CHECK-NOBMI-NEXT: movl %esi, %eax			; CHECK-NOBMI-NEXT: movl %esi, %eax
	; CHECK-NOBMI-NEXT: retq			; CHECK-NOBMI-NEXT: retq
	;			;
	; CHECK-BMI-LABEL: out_constant_y_mone:			; CHECK-BMI-LABEL: out_constant_y_mone:
	; CHECK-BMI: # %bb.0:			; CHECK-BMI: # %bb.0:
	; CHECK-BMI-NEXT: andl %esi, %edi			; CHECK-BMI-NEXT: andl %esi, %edi
	; CHECK-BMI-NEXT: notl %esi			; CHECK-BMI-NEXT: notl %esi
	; CHECK-BMI-NEXT: orl %edi, %esi			; CHECK-BMI-NEXT: orl %edi, %esi
	; CHECK-BMI-NEXT: movl %esi, %eax			; CHECK-BMI-NEXT: movl %esi, %eax
	; CHECK-BMI-NEXT: retq			; CHECK-BMI-NEXT: retq
	%mx = and i32 %x, %mask			%mx = and i32 %x, %mask
	%notmask = xor i32 %mask, -1			%notmask = xor i32 %mask, -1
	%my = and i32 %notmask, -1 ; %y			%my = and i32 %notmask, -1 ; %y
	%r = or i32 %mx, %my			%r = or i32 %mx, %my
	ret i32 %r			ret i32 %r
	}			}
	; FIXME: should be andnl+notl if BMI
	define i32 @in_constant_y_mone(i32 %x, i32 %mask) {			define i32 @in_constant_y_mone(i32 %x, i32 %mask) {
	; CHECK-NOBMI-LABEL: in_constant_y_mone:			; CHECK-NOBMI-LABEL: in_constant_y_mone:
	; CHECK-NOBMI: # %bb.0:			; CHECK-NOBMI: # %bb.0:
	; CHECK-NOBMI-NEXT: notl %edi			; CHECK-NOBMI-NEXT: notl %edi
	; CHECK-NOBMI-NEXT: andl %esi, %edi			; CHECK-NOBMI-NEXT: andl %esi, %edi
	; CHECK-NOBMI-NEXT: notl %edi			; CHECK-NOBMI-NEXT: notl %edi
	; CHECK-NOBMI-NEXT: movl %edi, %eax			; CHECK-NOBMI-NEXT: movl %edi, %eax
	; CHECK-NOBMI-NEXT: retq			; CHECK-NOBMI-NEXT: retq
	;			;
	; CHECK-BMI-LABEL: in_constant_y_mone:			; CHECK-BMI-LABEL: in_constant_y_mone:
				lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions This degrades, but instcombine will canonicalize it to `@in_constant_mone_vary`, and that one is ok. So this is ok too. lebedev.ri: This degrades, but instcombine will canonicalize it to `@in_constant_mone_vary`, and that one…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Looked at this in llvm-mca, no clear winner. The change decreases instruction count and IPC, but the cycle count does not change. So i guess it's ok? lebedev.ri: Looked at this in llvm-mca, no clear winner. The change decreases instruction count and IPC…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions On top of D46073, this `@in_constant_varx_42` pattern (i.e. `%y` being constant) is the only remaining issue. # * IR Dump After Machine InstCombiner : # Machine code for function in_constant_varx_42: IsSSA, TracksLiveness Function Live Ins: $edi in %0, $edx in %2 bb.0 (%ir-block.0): liveins: $edi, $edx %2:gr32 = COPY $edx %0:gr32 = COPY $edi %3:gr32 = AND32rr %0:gr32, %2:gr32, implicit-def dead $eflags %4:gr32 = NOT32r %2:gr32 %5:gr32 = AND32ri8 %4:gr32, 42, implicit-def dead $eflags %6:gr32 = OR32rr %3:gr32, killed %5:gr32, implicit-def dead $eflags $eax = COPY %6:gr32 RET 0, $eax # End machine code for function in_constant_varx_42. This seems* ok (as per mca) on aarch64, but i'm not so sure about x86. diff.txt2 KBDownload lebedev.ri: On top of D46073, this `@in_constant_varx_42` pattern (i.e. `%y` being constant) is the only…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Right, in this case not only should i not unfold it, but also de-canonicalize the mask. diff.txt2 KBDownload lebedev.ri: Right, in this case not only should i not unfold it, but also de-canonicalize the mask.
	; CHECK-BMI: # %bb.0:			; CHECK-BMI: # %bb.0:
	; CHECK-BMI-NEXT: andl %esi, %edi			; CHECK-BMI-NEXT: andnl %esi, %edi, %eax
				lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions As per mca this is an unimportant change, but again, instcombine will canonicalize it to `@in_constant_42_vary`, which is ok. So this one appears ok too. lebedev.ri: As per mca this is an unimportant change, but again, instcombine will canonicalize it to…
	; CHECK-BMI-NEXT: notl %esi			; CHECK-BMI-NEXT: notl %eax
	; CHECK-BMI-NEXT: orl %edi, %esi
	; CHECK-BMI-NEXT: movl %esi, %eax
	; CHECK-BMI-NEXT: retq			; CHECK-BMI-NEXT: retq
	%n0 = xor i32 %x, -1 ; %y			%n0 = xor i32 %x, -1 ; %y
	%n1 = and i32 %n0, %mask			%n1 = and i32 %n0, %mask
	%r = xor i32 %n1, -1 ; %y			%r = xor i32 %n1, -1 ; %y
	ret i32 %r			ret i32 %r
	}			}
	define i32 @out_constant_y_one(i32 %x, i32 %mask) {			define i32 @out_constant_y_one(i32 %x, i32 %mask) {
	; CHECK-NOBMI-LABEL: out_constant_y_one:			; CHECK-NOBMI-LABEL: out_constant_y_one:
	Show All 14 Lines
	; CHECK-BMI-NEXT: movl %esi, %eax			; CHECK-BMI-NEXT: movl %esi, %eax
	; CHECK-BMI-NEXT: retq			; CHECK-BMI-NEXT: retq
	%mx = and i32 %x, %mask			%mx = and i32 %x, %mask
	%notmask = xor i32 %mask, -1			%notmask = xor i32 %mask, -1
	%my = and i32 %notmask, 1 ; %y			%my = and i32 %notmask, 1 ; %y
	%r = or i32 %mx, %my			%r = or i32 %mx, %my
	ret i32 %r			ret i32 %r
	}			}
	; FIXME: NOBMI and BMI should match, or BMI should be better.
	define i32 @in_constant_y_one(i32 %x, i32 %mask) {			define i32 @in_constant_y_one(i32 %x, i32 %mask) {
	; CHECK-NOBMI-LABEL: in_constant_y_one:			; CHECK-NOBMI-LABEL: in_constant_y_one:
	; CHECK-NOBMI: # %bb.0:			; CHECK-NOBMI: # %bb.0:
	; CHECK-NOBMI-NEXT: xorl $1, %edi			; CHECK-NOBMI-NEXT: xorl $1, %edi
	; CHECK-NOBMI-NEXT: andl %esi, %edi			; CHECK-NOBMI-NEXT: andl %esi, %edi
	; CHECK-NOBMI-NEXT: xorl $1, %edi			; CHECK-NOBMI-NEXT: xorl $1, %edi
	; CHECK-NOBMI-NEXT: movl %edi, %eax			; CHECK-NOBMI-NEXT: movl %edi, %eax
	; CHECK-NOBMI-NEXT: retq			; CHECK-NOBMI-NEXT: retq
	;			;
	; CHECK-BMI-LABEL: in_constant_y_one:			; CHECK-BMI-LABEL: in_constant_y_one:
	; CHECK-BMI: # %bb.0:			; CHECK-BMI: # %bb.0:
				; CHECK-BMI-NEXT: xorl $1, %edi
	; CHECK-BMI-NEXT: andl %esi, %edi			; CHECK-BMI-NEXT: andl %esi, %edi
	; CHECK-BMI-NEXT: notl %esi			; CHECK-BMI-NEXT: xorl $1, %edi
	; CHECK-BMI-NEXT: andl $1, %esi			; CHECK-BMI-NEXT: movl %edi, %eax
	; CHECK-BMI-NEXT: orl %edi, %esi
	; CHECK-BMI-NEXT: movl %esi, %eax
	; CHECK-BMI-NEXT: retq			; CHECK-BMI-NEXT: retq
	%n0 = xor i32 %x, 1 ; %y			%n0 = xor i32 %x, 1 ; %y
	%n1 = and i32 %n0, %mask			%n1 = and i32 %n0, %mask
	%r = xor i32 %n1, 1 ; %y			%r = xor i32 %n1, 1 ; %y
	ret i32 %r			ret i32 %r
	}			}
	; ============================================================================ ;
	; Negative tests. Should not be folded.
	; ============================================================================ ;
	; Multi-use tests.			; Multi-use tests.
	declare void @use32(i32) nounwind			declare void @use32(i32) nounwind
	define i32 @in_multiuse_A(i32 %x, i32 %y, i32 %z, i32 %mask) nounwind {			define i32 @in_multiuse_A(i32 %x, i32 %y, i32 %z, i32 %mask) nounwind {
	; CHECK-NOBMI-LABEL: in_multiuse_A:			; CHECK-NOBMI-LABEL: in_multiuse_A:
	; CHECK-NOBMI: # %bb.0:			; CHECK-NOBMI: # %bb.0:
	; CHECK-NOBMI-NEXT: pushq %rbp			; CHECK-NOBMI-NEXT: pushq %rbp
	; CHECK-NOBMI-NEXT: pushq %rbx			; CHECK-NOBMI-NEXT: pushq %rbx
	; CHECK-NOBMI-NEXT: pushq %rax			; CHECK-NOBMI-NEXT: pushq %rax
	▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Masked merge: if 'B' is constant, de-canonicalize the pattern (invert the mask).AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 143824

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/AArch64/unfold-masked-merge-scalar-variablemask.ll

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll

[DAGCombiner] Masked merge: if 'B' is constant, de-canonicalize the pattern (invert the mask).
AbandonedPublic