This is an archive of the discontinued LLVM Phabricator instance.

[X86] Allow up to 4 loads per inline memcmp()
AbandonedPublic

Authored by davezarzycki on Oct 16 2019, 8:28 AM.

Details

Summary

This effectively relands r308322 / D35067, but sidesteps the PR33914 regression by only increasing the load count for memcmp() if the user only cares about equality (not which operand is greater or lesser).

This patch also generalizes combineVectorSizedSetCCEquality() to handle nontrivial memcmp() expansion pass results.

Diff Detail

Event Timeline

davezarzycki created this revision.Oct 16 2019, 8:28 AM
spatel added a subscriber: courbet.Oct 18 2019, 5:17 AM

Can we make the x86 change to combineVectorSizedSetCCEquality() independently and before the change to TargetLowering?

Given that the previous attempt was reverted because of perf only it would be good to show some perf data here in the proposal. Micro-benchmark or more substantial. cc'ing @courbet in case there's already a test harness in place for that.

I haven't looked at this in a while, so I wonder if we now have the infrastructure within memcmp expansion to create the partial vector code with 'ptest' shown here:
https://bugs.llvm.org/show_bug.cgi?id=33914

Also worth mentioning: one of the suspects for the regression in PR33914 was a trailing cmov. That's gone now*, so we might want to implement the simpler fix (expand everything to 4) and re-check perf.

*Replaced with x86 hackery:
setae %al
addl %eax, %eax
decl %eax

Can we make the x86 change to combineVectorSizedSetCCEquality() independently and before the change to TargetLowering?

Given that the previous attempt was reverted because of perf only it would be good to show some perf data here in the proposal. Micro-benchmark or more substantial. cc'ing @courbet in case there's already a test harness in place for that.

I haven't looked at this in a while, so I wonder if we now have the infrastructure within memcmp expansion to create the partial vector code with 'ptest' shown here:
https://bugs.llvm.org/show_bug.cgi?id=33914

Hi @spatel,

Thanks for the feedback and yes we can separate the changes. A few thoughts:

  1. The inlined memcmp is much smarter than the Glibc memcmp code these days, at least for pure equality comparisons. In particular, the compiler's overlapping load optimization is really nice (see D55263).
  2. This change proposal intentionally sidesteps more complex memcmps where the return result is tristate (greater, lesser, or equal), not binary (equal versus not). The tristate memcmp is what regressed in PR33914.
  3. It would be easy to contrive a microbenchmark that makes any libc memcmp look very bad and the inlined memcmp look very good. This would be fun, but not informative or actionable.
  4. If I were to design a somewhat interesting and quasi-realistic micro-benchmark, I might create a carefully crafted test that hammers on a llvm::StringSwitch where the cases need more than two load pairs to be inlined.

This all being said, and if I might be totally honest, I'd like to observe two things:

  1. The size of inlined memcmps tend to have a log-normal distribution with a small mean/median/variance. In other words, the vast majority of inlined memcmps (especially on AVX or AVX512 CPUs) don't need more than two load pairs.
  2. That being said, the specialization that a higher max load pair count allows matters more on simpler CPUs with smaller vectors (if at all) and fewer micro-architectural tricks to mask the cost of libc's dynamic dispatch.

Therefore, I would argue that the max load pair count should be derived, not fixed. For example, I think the following psuedo-code would yield reasonable results across the semi-recent history of Intel's product line: 2 * CACHELINE_SIZE / PREFERRED_VECTOR_SIZE

I'd further argue that the compiler shouldn't assume that "max load pairs per block" being less than "max load pairs" is predictable by the branch predictor, but that's a separate discussion.

Your thoughts would be appreciated. Thanks!

Can we make the x86 change to combineVectorSizedSetCCEquality() independently and before the change to TargetLowering?

Given that the previous attempt was reverted because of perf only it would be good to show some perf data here in the proposal. Micro-benchmark or more substantial. cc'ing @courbet in case there's already a test harness in place for that.

I haven't looked at this in a while, so I wonder if we now have the infrastructure within memcmp expansion to create the partial vector code with 'ptest' shown here:
https://bugs.llvm.org/show_bug.cgi?id=33914

Hi @spatel,

Thanks for the feedback and yes we can separate the changes. A few thoughts:

  1. The inlined memcmp is much smarter than the Glibc memcmp code these days, at least for pure equality comparisons. In particular, the compiler's overlapping load optimization is really nice (see D55263).
  2. This change proposal intentionally sidesteps more complex memcmps where the return result is tristate (greater, lesser, or equal), not binary (equal versus not). The tristate memcmp is what regressed in PR33914.
  3. It would be easy to contrive a microbenchmark that makes any libc memcmp look very bad and the inlined memcmp look very good. This would be fun, but not informative or actionable.
  4. If I were to design a somewhat interesting and quasi-realistic micro-benchmark, I might create a carefully crafted test that hammers on a llvm::StringSwitch where the cases need more than two load pairs to be inlined.

This all being said, and if I might be totally honest, I'd like to observe two things:

  1. The size of inlined memcmps tend to have a log-normal distribution with a small mean/median/variance. In other words, the vast majority of inlined memcmps (especially on AVX or AVX512 CPUs) don't need more than two load pairs.
  2. That being said, the specialization that a higher max load pair count allows matters more on simpler CPUs with smaller vectors (if at all) and fewer micro-architectural tricks to mask the cost of libc's dynamic dispatch.

Therefore, I would argue that the max load pair count should be derived, not fixed. For example, I think the following psuedo-code would yield reasonable results across the semi-recent history of Intel's product line: 2 * CACHELINE_SIZE / PREFERRED_VECTOR_SIZE

I'd further argue that the compiler shouldn't assume that "max load pairs per block" being less than "max load pairs" is predictable by the branch predictor, but that's a separate discussion.

Your thoughts would be appreciated. Thanks!

I agree that a derived setting would be better than hard-coding. Exactly what that formula should be, I don't know...
Lazy question (can't tell from the test diffs): are we ignoring -mprefer-vector-width in these expansions? If so, we're almost certainly going to create fallout.
There's a lot to unravel for memcmp, so let's break this down into possible patches/steps:

  1. Add a bunch of x86 tests (all of the new tests here can be committed with baseline codegen, so we'll just see diffs as we enable more expansions). Include some -prefer-vector-width override RUNs.
  2. Enhance the x86 lowering.
  3. Change/adjust the MaxLoads settings (probably fine to start as shown here currently with a simple tweak of the hardcoded value).

The staging sounds fine. As an aside, this patch is exposing a bug in EVEX address/displacement generation:

length256_eq:
    vmovdqu64 -128(%rdi), %zmm0
    vmovdqu64 -64(%rdi), %zmm1
    vmovdqu64 (%rdi), %zmm2
    vmovdqu64 64(%rdi), %zmm3

The above code should be:

length256_eq:
    vmovdqu64 (%rdi), %zmm0
    vmovdqu64 64(%rdi), %zmm1
    vmovdqu64 128(%rdi), %zmm2
    vmovdqu64 192(%rdi), %zmm3

Any tips on how to debug this?

Oh, and the memcmp() expansion only honors -mprefer-vector-width if the width is 256 or 512. I don't think any part of the X86 code gen actually honors 128 when 256 or 512 is possible.

Oh, and the memcmp() expansion only honors -mprefer-vector-width if the width is 256 or 512. I don't think any part of the X86 code gen actually honors 128 when 256 or 512 is possible.

Nothing disables 256-bit types the way we do for 512. But X86TargetLowering::getOptimalMemOpType() honors -mprefer-vector-width=128

craig.topper added a comment.EditedOct 19 2019, 9:41 AM

The negative offsets on length256_eq are coming from the memcmp expansion IR.

*** IR Dump After Expand memcmp() to load/stores ***

define i1 @length256_eq(i8* %x, i8* %y) #0 {
  %1 = bitcast i8* %x to i512*
  %2 = bitcast i8* %y to i512*
  %3 = load i512, i512* %1
  %4 = load i512, i512* %2
  %5 = xor i512 %3, %4
  %6 = getelementptr i8, i8* %x, i8 64
  %7 = bitcast i8* %6 to i512*
  %8 = getelementptr i8, i8* %y, i8 64
  %9 = bitcast i8* %8 to i512*
  %10 = load i512, i512* %7
  %11 = load i512, i512* %9
  %12 = xor i512 %10, %11
  %13 = getelementptr i8, i8* %x, i8 -128
  %14 = bitcast i8* %13 to i512*
  %15 = getelementptr i8, i8* %y, i8 -128
  %16 = bitcast i8* %15 to i512*
  %17 = load i512, i512* %14
  %18 = load i512, i512* %16
  %19 = xor i512 %17, %18
  %20 = getelementptr i8, i8* %x, i8 -64
  %21 = bitcast i8* %20 to i512*
  %22 = getelementptr i8, i8* %y, i8 -64
  %23 = bitcast i8* %22 to i512*
  %24 = load i512, i512* %21
  %25 = load i512, i512* %23
  %26 = xor i512 %24, %25
  %27 = or i512 %5, %12
  %28 = or i512 %19, %26
  %29 = or i512 %27, %28
  %30 = icmp ne i512 %29, 0
  %31 = zext i1 %30 to i32
  %cmp = icmp ne i32 %31, 0
  ret i1 %cmp
}

Seems to be because the geps are only using i8 as their index type. So they aliased 128 to -128. And 192 to -64.

This seems to fix it. I haven't tested to see if it effects any existing tests

diff --git a/llvm/lib/CodeGen/ExpandMemCmp.cpp b/llvm/lib/CodeGen/ExpandMemCmp.cpp
index 9916f2de041..0539db58a61 100644
--- a/llvm/lib/CodeGen/ExpandMemCmp.cpp
+++ b/llvm/lib/CodeGen/ExpandMemCmp.cpp
@@ -264,9 +264,9 @@ Value *MemCmpExpansion::getPtrToElementAtOffset(Value *Source,
                                                 uint64_t OffsetBytes) {
   if (OffsetBytes > 0) {
     auto *ByteType = Type::getInt8Ty(CI->getContext());
-    Source = Builder.CreateGEP(
+    Source = Builder.CreateConstGEP1_64(
         ByteType, Builder.CreateBitCast(Source, ByteType->getPointerTo()),
-        ConstantInt::get(ByteType, OffsetBytes));
+        OffsetBytes);
   }
   return Builder.CreateBitCast(Source, LoadSizeType->getPointerTo());
 }

@craig.topper – That does seem to fix the bug. Thanks!

Just FYI everybody, I built LLVM+clang+lld with this change, and I don't see any evidence of more than two AVX512 load pairs being generated outside of one LLVM unit test (which failed until Craig's patch). And yes, one could argue that LLVM/clang/lld aren't representative of normal code, but let's pause on that. I just wanted to test a large source base.

More so, the ratios are consistent with my assertion that inline memcmps have log-normal distribution (with small mean/median/variance). For example, clang has ~3300 XMM load pairs, ~230 YMM load pairs, and ~40 ZMM load pairs. These values are approximate because the script I wrote might have missed something.

Some testing results.

I built llvm+clang twice, both with core2 as the target CPU. Once without this change and once with this change. I verified that the 4-load-pair clang assembly to see that at least some memcmps generated three or more XMM load-pairs. That being said, more than two load XMM pairs was uncommon. I then ran perf stat against clang while it compiled X86ISelLowering.cpp (which takes about 37 seconds on my Xeon 8168 with turbo disabled).

In terms of "wall clock" performance, allowing up to four load pairs is lost in the noise. (At best, there might be a 0.082% difference.) The 2-load-pair clang required 0.027% more instructions to execute versus the 4-load-pair clang, and almost 0.03% more branches. Both of these seem given the dynamic overhead of Libc's memcmp().

Separably, I started writing a microbenchmark that used llvm::StringSwitch but it didn't feel right. Two (potentially overlapping) XMM registers can cover all values up to 32 bytes. That's big enough for the majority of real world scenarios.

Overall, I've changed my mind about this proposal. I think the time and place for 4 (or more) load pairs was in the pre-vector (and therefore pre-64-bit) era, where going from 2 scalar load pairs to 4 scalar load pairs was a bigger win because the load sizes were so tiny.

I suppose we could enable four load pairs on pre-SSE machines if people care. Otherwise and unless there objections, I'll close this proposal in a few days.

Some testing results.

I built llvm+clang twice, both with core2 as the target CPU. Once without this change and once with this change. I verified that the 4-load-pair clang assembly to see that at least some memcmps generated three or more XMM load-pairs. That being said, more than two load XMM pairs was uncommon. I then ran perf stat against clang while it compiled X86ISelLowering.cpp (which takes about 37 seconds on my Xeon 8168 with turbo disabled).

In terms of "wall clock" performance, allowing up to four load pairs is lost in the noise. (At best, there might be a 0.082% difference.) The 2-load-pair clang required 0.027% more instructions to execute versus the 4-load-pair clang, and almost 0.03% more branches. Both of these seem given the dynamic overhead of Libc's memcmp().

Separably, I started writing a microbenchmark that used llvm::StringSwitch but it didn't feel right. Two (potentially overlapping) XMM registers can cover all values up to 32 bytes. That's big enough for the majority of real world scenarios.

Overall, I've changed my mind about this proposal. I think the time and place for 4 (or more) load pairs was in the pre-vector (and therefore pre-64-bit) era, where going from 2 scalar load pairs to 4 scalar load pairs was a bigger win because the load sizes were so tiny.

I suppose we could enable four load pairs on pre-SSE machines if people care. Otherwise and unless there objections, I'll close this proposal in a few days.

Thanks for running the experiments. I don't have a motivating case to change the current setting, so no objection from me. Adding Clement and Guillaume as reviewers in case they have data/thoughts.

I don't remember cases where we had very large constant compares (though we do have quite a lot of small ones). I'll run our internal benchmarks with this change.

I don't remember cases where we had very large constant compares (though we do have quite a lot of small ones). I'll run our internal benchmarks with this change.

I've ran our benchmarks, I see no improvement from the change.

I don't remember cases where we had very large constant compares (though we do have quite a lot of small ones). I'll run our internal benchmarks with this change.

I've ran our benchmarks, I see no improvement from the change.

Thanks. I think we're almost ready to close this. Do your benchmarks test pre-SSE2 CPUs? In particular 32-bit CPUs? Otherwise, as long as SSE vector registers are available, two load pairs covers the majority inline memcmp scenarios.

No, we only have SSE2 and above.

davezarzycki abandoned this revision.Nov 12 2019, 1:12 AM

I think we reasoned our way out of this.