For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. This prevents some load-folding but removes a constant pool vector entry, which I think is the right trade off.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
36273 ↗ | (On Diff #179084) | What about 512bit? |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
36273 ↗ | (On Diff #179084) | We're after type legalization here. Can this just check isVector? |
Fixed combine-fcopysign.ll regressions by adding X86ISD::FOR computeKnownBits support
Would we better off doing a generic DAGCombine into the optimal xor-and-xor (aka, masked merge) pattern?
(x & C) | (y & ~C) --> ((x ^ y) & C) ^ y
That should allow load folding of the constant, so it never burns a register to hold the constant.
We weren't allowed to do that transform with a variable mask in D46814 because:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180528/556578.html
But I think that's ok with a constant mask. We can't accidentally dynamically access an undef value when the mask is a constant.
To add to that, i *think* we intentionally avoided transforming the pattern with constants there, IIRC because that was better for ValueTracking?
That should allow load folding of the constant, so it never burns a register to hold the constant.
I think that would work.
But there will likely be some cost to that: https://godbolt.org/z/-MZNgY
So do we agree that adding support for this additional pattern only makes sense if one of X or Y is being loaded (once)? It doesn't seem to matter if C is being reused or not.
I'm not seeing where loading X/Y is the deciding factor. We have something like this currently:
andps {{.*}}(%rip), %xmm1 // load constant andps {{.*}}(%rip), %xmm2 // load inverse constant orps %xmm2, %xmm1
This would become with this patch:
movaps {{.*}}(%rip), %xmm0 // load constant for 2 uses andps %xmm0, %xmm1 andnps %xmm0, %xmm2 orps %xmm2, %xmm1
With the masked-merge transform it would be:
xorps %xmm1, %xmm2 andps {{.*}}(%rip), %xmm2 // load fold constant for 1 use xorps %xmm2, %xmm1
It's 4 uops for either of the last 2 alternatives. The trade-off (which I think is impossible to model statically) is whether we're better off with the potentially better thoughput of the sequence with 'andnps' or the potentially shorter (load-folded) sequence with the dependent logic ops. Either way is a uop improvement over the current codegen, so I'm not opposed to this patch, but the generic masked-merge transform would probably be a smaller patch?
Exactly - which is why I said only the cases where X and/or Y are loaded: https://godbolt.org/z/2qMqiF has examples where this would reduce loads and improve folding opportunities.
But getting the XOR variant emitted will be painful as DAGCombiner::unfoldMaskedMerge is fighting against us all the way.
I'm going to refactor this so it only kicks in on XOP or when any of the masks have multiple uses.
Only canonicalize the pattern if the target is XOP (which will use the PCMOV instruction) or the mask has multiple uses.
Looks good to me, no further comments.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
30146–30166 ↗ | (On Diff #181775) | I guess this is can go separately. |