This is an archive of the discontinued LLVM Phabricator instance.

[x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality
ClosedPublic

Authored by spatel on Mar 23 2017, 9:17 AM.

Details

Summary

This is the payoff for D31156 - if a target has efficient comparison instructions for vector-sized equality, we can replace memcmp calls with inline code that is both smaller and faster.

Seems like we're missing a load folding opportunity on the first test, but that's a separate problem.

I can enable the 32-byte case for AVX2 as an immediate follow-up, but I want to make sure this part looks ok before adding that.

Diff Detail

Event Timeline

spatel created this revision.Mar 23 2017, 9:17 AM
efriedma added inline comments.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6098

What's the point of performing the load in a vector type if you're going to immediately bitcast the result to an integer type? IIRC DAGCombine will fold this away.

test/CodeGen/X86/memcmp.ll
104

What's the performance of this compared to using integer registers? (movq+xorq+movq+xorq+orq).

spatel added inline comments.Mar 23 2017, 1:40 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6098

I actually had it loading i128 to start, but I saw 2 problems:

  1. The i128 loads+bitcasts weren't converted to vector loads directly. Legalization for x86-64 split this into i64 loads, and then we had to rely on the combiner to merge the loads. At the least, I think this would be slower to compile since it caused more nodes to be created and folded. At worst, we might not put the loads back together properly and that would lead to poor code. It wasn't clear to me that I could add a generic combine to do that either since some targets might not want that.
  2. It wasn't honest to use i128 loads and bypass the isTypeLegal() check. We could make the TLI hook more specialized to account for that - have it confirm that loads of a given type/size are fast, so it's truly just a memcmp hook. But given the first problem, I got scared away.
test/CodeGen/X86/memcmp.ll
104

Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers.

spatel added inline comments.Mar 23 2017, 2:32 PM
test/CodeGen/X86/memcmp.ll
104

I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing more wobble in these numbers than I can explain, but:

memcmp : 34485936 cycles for 1048576 iterations (32.89 cycles/iter).
vec cmp : 5245888 cycles for 1048576 iterations (5.00 cycles/iter).
xor cmp : 5247940 cycles for 1048576 iterations (5.00 cycles/iter).

On Ubuntu with AMD Jaguar:

memcmp : 21150343 cycles for 1048576 iterations (20.17 cycles/iter).
vec cmp : 9988395 cycles for 1048576 iterations (9.53 cycles/iter).
xor cmp : 9471849 cycles for 1048576 iterations (9.03 cycles/iter).

.align  6, 0x90
.global _cmp16vec
_cmp16vec:
movdqu (%rsi), %xmm0
movdqu (%rdi), %xmm1
pcmpeqb %xmm0, %xmm1
pmovmskb %xmm1, %eax
cmpl $65535, %eax
setne %al
movzbl  %al, %eax
retq

.align  6, 0x90
.global _cmp16scalar
_cmp16scalar:
movq  (%rsi), %rax
movq  8(%rsi), %rcx
xorq  (%rdi), %rax
xorq  8(%rdi), %rcx
orq %rax, %rcx
setne %al
movzbl  %al, %eax
retq
spatel added inline comments.Mar 23 2017, 3:36 PM
test/CodeGen/X86/memcmp.ll
104

(clang produces the xor sequence if you just write int x(__int128_t*x, __int128_t*y) { return *x == *y; }.)

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6098

If DAGCombine doesn't fold it away, this is fine, I guess. Maybe let the target specify the type to use, in case some target wants to use a type that isn't <4 x i32>?

spatel added inline comments.Mar 24 2017, 9:56 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6098

Yes - that would be better. We can cycle through the possible simple types (including i128), and the target can let us know what works.

Also, your example using "__int128_t" probably explains why we saw/expected different things after this step in the DAG. If the loads are aligned, then we will legalize these to v16i8 loads for an SSE2 target, but not if they are unaligned as I was seeing in my experiments.

spatel updated this revision to Diff 92974.Mar 24 2017, 10:49 AM

Patch updated:
Check all of the 16-byte simple value types before giving up.

Eli pointed me to D28637 (which I hadn't seen of course!) - a general solution for memcmp transformation. Not sure if this specialization still makes sense given that patch, but since I already made the edits, I'll post it.

Not sure if this specialization still makes sense given that patch, but since I already made the edits, I'll post it.

Even with that patch, we probably still want a similar target hook. Might as well finish/merge this now, then make sure we continue to generate the same efficient code when x86 transitions to the new memcmp lowering.

lib/Target/X86/X86ISelLowering.h
819

It probably makes sense to make this take a size in bytes, and return a VT, rather than calling this with every possible VT.

spatel marked an inline comment as done.Mar 24 2017, 1:42 PM
spatel added inline comments.
lib/Target/X86/X86ISelLowering.h
819

Yep - that makes the patch simpler.

spatel updated this revision to Diff 93004.Mar 24 2017, 2:05 PM
spatel marked an inline comment as done.

Patch updated:
Have the TLI hook return the preferred operand (load) type for a given bitwidth, so we don't have to cycle through all of those when transforming the memcmp().

I'm using EVT instead of MVT in the hook anticipating that we extend this to 256-bit types for AVX2. In that case, we'd use i256 which isn't an MVT / simple type, so we'd have to switch it at that point unless I'm misunderstanding how these things work.

spatel updated this revision to Diff 93010.Mar 24 2017, 2:26 PM

Patch updated:
On 2nd thought, that EVT/MVT argument makes no sense. The returned type from the hook is always going to be an MVT because it will be a supported type in order to be fast. Using MVT makes the code a bit cleaner since we don't have to pass a context around for those.

efriedma added inline comments.Mar 24 2017, 2:44 PM
lib/Target/X86/X86ISelLowering.cpp
4646

Maybe check isTypeLegal(MVT::v16i8) instead? hasSSE2() doesn't mean what you want it to.

4650

Maybe also 64-bit types (on a 32-bit target).

test/CodeGen/X86/memcmp.ll
2

Could you regenerate this test so it also compiles for a 32-bit target?

spatel updated this revision to Diff 93019.Mar 24 2017, 3:45 PM
spatel marked 3 inline comments as done.

Patch updated:

  1. Added 32-bit target testing in rL298744
  2. Don't use hasSSE2() in the x86 override - that won't work if we're in soft-float mode (nice catch!).
  3. Add TODO comment to handle 64-bit type on x86 32-bit target.
efriedma accepted this revision.Mar 24 2017, 4:32 PM

LGTM.

This revision is now accepted and ready to land.Mar 24 2017, 4:33 PM
This revision was automatically updated to reflect the committed changes.