This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Canonicalize OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))
ClosedPublic

Authored by RKSimon on Dec 20 2018, 8:23 AM.

Details

Summary

For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. This prevents some load-folding but removes a constant pool vector entry, which I think is the right trade off.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Dec 20 2018, 8:23 AM
craig.topper added inline comments.Dec 20 2018, 1:07 PM
lib/Target/X86/X86ISelLowering.cpp
36273 ↗(On Diff #179084)

What about 512bit?

craig.topper added inline comments.Dec 20 2018, 1:18 PM
lib/Target/X86/X86ISelLowering.cpp
36273 ↗(On Diff #179084)

We're after type legalization here. Can this just check isVector?

RKSimon updated this revision to Diff 179256.Dec 21 2018, 2:37 AM

Accept all (legal) vector types to canonicalizeBitSelect

RKSimon updated this revision to Diff 179270.Dec 21 2018, 5:15 AM

Fixed combine-fcopysign.ll regressions by adding X86ISD::FOR computeKnownBits support

Would we better off doing a generic DAGCombine into the optimal xor-and-xor (aka, masked merge) pattern?
(x & C) | (y & ~C) --> ((x ^ y) & C) ^ y

That should allow load folding of the constant, so it never burns a register to hold the constant.

We weren't allowed to do that transform with a variable mask in D46814 because:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180528/556578.html

But I think that's ok with a constant mask. We can't accidentally dynamically access an undef value when the mask is a constant.

Would we better off doing a generic DAGCombine into the optimal xor-and-xor (aka, masked merge) pattern?
(x & C) | (y & ~C) --> ((x ^ y) & C) ^ y

We weren't allowed to do that transform with a variable mask in D46814 because:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180528/556578.html

But I think that's ok with a constant mask. We can't accidentally dynamically access an undef value when the mask is a constant.

To add to that, i *think* we intentionally avoided transforming the pattern with constants there, IIRC because that was better for ValueTracking?

That should allow load folding of the constant, so it never burns a register to hold the constant.

I think that would work.
But there will likely be some cost to that: https://godbolt.org/z/-MZNgY

RKSimon edited the summary of this revision. (Show Details)Jan 7 2019, 8:05 AM

Would we better off doing a generic DAGCombine into the optimal xor-and-xor (aka, masked merge) pattern?
(x & C) | (y & ~C) --> ((x ^ y) & C) ^ y --> ((y ^ x) & ~C) ^ x

So do we agree that adding support for this additional pattern only makes sense if one of X or Y is being loaded (once)? It doesn't seem to matter if C is being reused or not.

spatel added a comment.Jan 7 2019, 9:05 AM

Would we better off doing a generic DAGCombine into the optimal xor-and-xor (aka, masked merge) pattern?
(x & C) | (y & ~C) --> ((x ^ y) & C) ^ y --> ((y ^ x) & ~C) ^ x

So do we agree that adding support for this additional pattern only makes sense if one of X or Y is being loaded (once)? It doesn't seem to matter if C is being reused or not.

I'm not seeing where loading X/Y is the deciding factor. We have something like this currently:

andps {{.*}}(%rip), %xmm1  // load constant
andps {{.*}}(%rip), %xmm2  // load inverse constant
orps  %xmm2, %xmm1

This would become with this patch:

movaps  {{.*}}(%rip), %xmm0 // load constant for 2 uses
andps   %xmm0, %xmm1
andnps  %xmm0, %xmm2
orps    %xmm2, %xmm1

With the masked-merge transform it would be:

xorps  %xmm1, %xmm2
andps  {{.*}}(%rip), %xmm2 // load fold constant for 1 use
xorps  %xmm2, %xmm1

It's 4 uops for either of the last 2 alternatives. The trade-off (which I think is impossible to model statically) is whether we're better off with the potentially better thoughput of the sequence with 'andnps' or the potentially shorter (load-folded) sequence with the dependent logic ops. Either way is a uop improvement over the current codegen, so I'm not opposed to this patch, but the generic masked-merge transform would probably be a smaller patch?

Exactly - which is why I said only the cases where X and/or Y are loaded: https://godbolt.org/z/2qMqiF has examples where this would reduce loads and improve folding opportunities.

But getting the XOR variant emitted will be painful as DAGCombiner::unfoldMaskedMerge is fighting against us all the way.

RKSimon planned changes to this revision.Jan 7 2019, 10:28 AM

I'm going to refactor this so it only kicks in on XOP or when any of the masks have multiple uses.

RKSimon updated this revision to Diff 181775.Jan 15 2019, 6:28 AM

Only canonicalize the pattern if the target is XOP (which will use the PCMOV instruction) or the mask has multiple uses.

Looks good to me, no further comments.

lib/Target/X86/X86ISelLowering.cpp
30146–30166 ↗(On Diff #181775)

I guess this is can go separately.

spatel accepted this revision.Jan 21 2019, 3:17 PM

LGTM

This revision is now accepted and ready to land.Jan 21 2019, 3:17 PM
This revision was automatically updated to reflect the committed changes.