This is an archive of the discontinued LLVM Phabricator instance.

[libc] Improve memcmp latency and codegen
ClosedPublic

Authored by gchatelet on Apr 19 2023, 8:22 AM.

Details

Summary

This is based on ideas from @nafi to:

  • use a branchless version of 'cmp' for 'uint32_t',
  • completely resolve the lexicographic comparison through vector operations when wide types are available. We also get rid of byte reloads and serializing '__builtin_ctzll'.

I did not include the suggestion to replace comparisons of 'uint16_t'
with two 'uint8_t' as it did not seem to help the codegen. This can
be revisited in sub-sequent patches.

The code been rewritten to reduce nested function calls, making the
job of the inliner easier and preventing harmful code duplication.

Diff Detail

Event Timeline

gchatelet created this revision.Apr 19 2023, 8:22 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 19 2023, 8:22 AM
gchatelet requested review of this revision.Apr 19 2023, 8:22 AM
gchatelet added inline comments.Apr 19 2023, 8:26 AM
libc/src/string/CMakeLists.txt
464

I'll submit this as a separate patch.

gchatelet updated this revision to Diff 515724.Apr 21 2023, 7:23 AM
  • rebase
  • Simplifying uint32_t neq<uint64_t>

@nafi3000
I've created a trimmed down version of this code to play with the codegen : https://godbolt.org/z/rdEG5nY1q
I defaulted the compile option so that you can compare with the code we discussed earlier aka -O3 -march=haswell -mprefer-vector-width=128 -mno-avx2

Let me know what you think.
Also phabricator is sometimes confusing and you need to hit the Submit button at the bottom of the page if you've made inline comments, otherwise I won't see them.
Ping me offline if needed.

nafi3000 added inline comments.May 2 2023, 11:32 AM
libc/src/string/memory_utils/bcmp_implementations.h
142

OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g.

generic::Bcmp<uint32_t>::head_tail(p1, p2, 7)

libc/src/string/memory_utils/op_generic.h
608

OOC, have you ever tried other values? E.g. how about:

return a > b ? 0x7fffffff : 0x80000000;

or if it does not compile due to the MemcmpReturnType, then:

return a > b ? static_cast<int32_t>(0x7fffffff) : static_cast<int32_t>(0x80000000);

-1 and 1 are 2 values apart.
0x7fffffff and 0x80000000 are 1 value apart.

assembly diff: https://www.diffchecker.com/LMVfxJ1D/

xor eax, eax	
cmp rcx, r8	
sbb eax, eax	
or eax, 1

vs

cmp r8, rcx
mov eax, -2147483648
sbb eax, 0

In theory... the former should take 3 cycles (or waiting for sbb, sbb waiting for cmp) while the latter should take 2 cycles (cmp and mov should happen in parallel, sbb happening after the cmp), right?

libc/src/string/memory_utils/op_x86.h
70–71

[optional nit] Maybe not in this diff, but eventually we can programmatically generate the sequences here and at lines 129 and 160 below.

202–203

Would it make sense to factor out this part to another function?
This is used here and for cmp<uint32_t>.

233

Ditto, -1 : 1 vs 0x80000000 : 0x7fffffff

libc/src/string/memory_utils/x86_64/memcmp_implementations.h
75

Ditto. Similar to the bcmp comment, how about using head_tail (2 loads) instead of MemcmpSequence of 3 loads? E.g.

generic::Memcmp<uint32_t>::head_tail(p1, p2, 7)

x86 asm diff:
https://www.diffchecker.com/XQNu3lGN/

libc/test/src/string/memory_utils/op_tests.cpp
220

Do we need to add generic::BcmpSequence<uint32_t, uint8_t> and generic::BcmpSequence<uint32_t, uint16_t> here too? I am interpreting the above list as:
8, 1, 2, 4, 1+1, 1+1+1, 2+1, 4+2+1

nafi3000 added inline comments.May 3 2023, 10:04 AM
libc/src/string/memory_utils/op_generic.h
407–417

I wonder if it is better to use cmp<T> only for the last comparison. Motivation is that for non-last compare blocks we need to check the comparison result anyway (e.g. line 470 above) to decide whether to load and compare the next block in the sequence. Isn't it better to compute this decision (0 or non-0) as early as possible instead of computing the full cmp result (0, <0 or >0)?

E.g.

if constexpr (sizeof...(TS) == 0) {
  if constexpr (cmp_is_expensive<T>::value) {
    if (eq<T>(p1, p2, 0))
      return MemcmpReturnType::ZERO();
    return cmp_neq<T>(p1, p2, 0);
  } else {
    return cmp<T>(p1, p2, 0);
  }
} else {
  if (!eq<T>(p1, p2, 0))
    return cmp_neq<T>(p1, p2, 0);
  return MemcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T));
}

And, for the last block, I wonder if we can invariably call cmp<T> instead. What is better would depend on data. E.g. for __m512i, cmp<T> is faster if there is at least 1 byte mismatch in the last 64 bytes.

gchatelet updated this revision to Diff 520983.May 10 2023, 7:03 AM
gchatelet marked 7 inline comments as done.
  • Address most comments
gchatelet added inline comments.May 10 2023, 7:03 AM
libc/src/string/memory_utils/bcmp_implementations.h
142

Yeah there are a bunch of options here. Usually I want to use head_tail to cover a range of sizes as it clearly diminishes the overall code size (all sizes from 5 to 8 with only two loads per pointer).
Depending on how often those sizes appear in the size distribution it might be useful to special case the code.
I'm tempted to keep the previous logic to prevent regressions and make this a separate change. WDYT?

libc/src/string/memory_utils/op_generic.h
608

Nice one. Picking other values was on my TODO list but I never thought it through.

I had a look at codegen for armv8 it uses cinv instead of cneg but it seems to be neutral in terms of performance.
https://godbolt.org/z/Y9aGq5sPd

For x86 in theory this should be better yes 👍.
https://godbolt.org/z/69Gefhqef

For RISC-V it seems it's worse as it generates a branch.
https://godbolt.org/z/bvqMeMjMP
This seems to be in line with what I found on stackoverflow

Now I tried it on the full implementation and the additional branch seems to be outlined leading to the same code size. I don't know the impact on the code speed but since we don't yet have an optimized version of memcmp for RISC-V we can happily revisit the function later on.
I've created a separate function so we can keep track of the rationale.

libc/src/string/memory_utils/op_x86.h
70–71

AFAICT we can't do it with the intel intrinsics as they are real functions expecting a certain number of arguments.
It may be possible to generate them with GCC and clang vector extensions though and then convert them back to Intel types.

I gave it a try but it's brittle on clang and fails on GCC
https://godbolt.org/z/Ms7fW5nP3

Not sure it's worth it.

202–203

Yes, I've done so for the uint64_t so let's factor this out for uint32_t as well.

libc/test/src/string/memory_utils/op_tests.cpp
220

Done. More coverage doesn't hurt : )

gchatelet updated this revision to Diff 520990.May 10 2023, 7:50 AM
  • Rebase
  • Simplifying uint32_t neq<uint64_t>
  • Address most comments
  • Fix bazel build
lntue added a subscriber: lntue.May 10 2023, 9:16 AM
lntue added inline comments.
libc/src/string/memory_utils/bcmp_implementations.h
92

Currently we only check for SSE4.2 in our CMake build https://github.com/llvm/llvm-project/blob/main/libc/cmake/modules/LLVMLibCCheckCpuFeatures.cmake#L9

Do you want to change or add SSE4.1 to the list instead?

gchatelet marked an inline comment as done.May 11 2023, 4:48 AM
gchatelet added inline comments.
libc/src/string/memory_utils/bcmp_implementations.h
92

Technically this code only needs SSE4.1 but I don't think it's worth discriminating between the two.
https://en.wikipedia.org/wiki/SSE4#SSE4_subsets

Looking at Steam hardware survey (click the Other Settings line at the bottom of the page) the share of cpu having SSE4.1 but not SSE4.2 is about 0.23%.

  • SSE4.1 : 99.36%
  • SSE4.2 : 99.13%

So we can basically only discriminate between SSE2 and SSE4.2 and call it a day : )

libc/src/string/memory_utils/op_generic.h
407–417

The current benchmark works for all memory functions and is only able to assess the throughput of functions under a particular size distribution.
As-is, it is not a good tool to evaluate functions that may return early (bcmp and memcmp) and for which latency is important.

I'll work on adding a latency benchmark based on the one you provided me earlier. Once it's done I think we will be in a better position to decide which strategy is better.

SGTY?

gchatelet updated this revision to Diff 522959.May 17 2023, 2:05 AM
gchatelet marked an inline comment as done.
  • rebase and merge RISCV changes
gchatelet updated this revision to Diff 524646.May 23 2023, 4:04 AM
  • rebase and add a TODO to explore more optimization once we have proper latency benchmarking.

@nafi3000 I've been busy and haven't had time to work on the benchmark..
I will land this change as-is if you accept this revision (Menu : Add Action... -> Accept Revision). We can iterate on this through subsequent patches. WDYT?

nafi3000 accepted this revision.Jun 2 2023, 7:33 PM
nafi3000 added inline comments.
libc/src/string/memory_utils/bcmp_implementations.h
142

Separate change SGTM.

libc/src/string/memory_utils/op_generic.h
407–417

Sounds good.

libc/src/string/memory_utils/op_x86.h
70–71

We can use _mm_load_si128 instead of _mm_set_epi8. Snappy code has some example:
https://github.com/google/snappy/blob/main/snappy.cc
Search for pattern_generation_masks.

Anyway, this can be addressed in a separate diff. And like you mention, it may not be worth it. In snappy, we actually need a array of such shuffle masks. But in this case we just need one.

This revision is now accepted and ready to land.Jun 2 2023, 7:33 PM
This revision was automatically updated to reflect the committed changes.
gchatelet reopened this revision.Jun 5 2023, 4:43 AM

Reopening to fix aarch64 and riscv

This revision is now accepted and ready to land.Jun 5 2023, 4:43 AM
gchatelet updated this revision to Diff 529308.Jun 7 2023, 8:07 AM
  • Fix aarch64 and RISCV implementations
gchatelet updated this revision to Diff 529552.Jun 8 2023, 4:16 AM
  • Specialize types per architecture
gchatelet updated this revision to Diff 529554.Jun 8 2023, 4:30 AM
  • Forgot to add libc/src/string/memory_utils/aarch64/memcmp_implementations.h
gchatelet updated this revision to Diff 529560.Jun 8 2023, 5:05 AM
  • Add type specialization for RISCV

I still need to check that the ARM platform is not impacted by this patch. I'll land it once it's done.

gchatelet updated this revision to Diff 529581.Jun 8 2023, 6:56 AM
  • Add missing namespace for RISCV
  • Also include riscv32
lntue added inline comments.Jun 8 2023, 7:30 AM
libc/src/string/CMakeLists.txt
573

If you drop the requirement to AVX, should the compile option be -march=sandybridge instead?

gchatelet updated this revision to Diff 529858.Jun 9 2023, 2:01 AM
  • Disable non uint8_t type tests for ARM platform
gchatelet updated this revision to Diff 529866.Jun 9 2023, 2:26 AM
gchatelet marked an inline comment as done.
  • Use sandybrige with AVX

This seems to be good to go. @nafi3000 do you want to have a final look before I submit?

nafi3000 accepted this revision.Jun 9 2023, 10:11 PM
nafi3000 added inline comments.
libc/src/string/memory_utils/utils.h
157–159

nit: s/uint64_t/int64_t/ and s/uint32_t/int32_t/ in the comments.

160

For the explanation, please consider whether we can add some version of the following points:

For the int64_t to int32_t conversion we want the following properties:
- int32_t[31:31] == 1 iff diff < 0
- int32_t[31:0] == 0 iff diff == 0

We also observe that:
- When diff < 0: diff[63:32] == 0xffffffff and diff[31:0] != 0
- When diff > 0: diff[63:32] == 0 and diff[31:0] != 0
- When diff == 0: diff[63:32] == 0 and diff[31:0] == 0
- https://godbolt.org/z/8W7qWP6e5
- This implies that we can only look at diff[32:32] for determining the sign bit for the returned int32_t.

So, we do the following:
- int32_t[31:31] = diff[32:32]
- int32_t[30:0] = diff[31:0] == 0 ? 0 : non-0.

And, we can achieve the above by the expression below. We could have also used (diff64 >> 1) | (diff64 & 0x1) but (diff64 & 0xFFFF) is faster than (diff64 & 0x1). https://godbolt.org/z/j3b569rW1

We can also add all these in a separate diff.

gchatelet marked 3 inline comments as done.
  • Fix typos and added explanation for int64_t to int32_t conversion in cmp_uint32_t
This revision was landed with ongoing or failed builds.Jun 12 2023, 12:56 AM
This revision was automatically updated to reflect the committed changes.
gchatelet added inline comments.Jun 12 2023, 6:22 AM
libc/src/string/memory_utils/utils.h
160

The explanation is fantastic, I copied it verbatim.

This revision is now accepted and ready to land.Jun 12 2023, 6:23 AM
gchatelet updated this revision to Diff 530488.Jun 12 2023, 6:32 AM
  • Prevent pulling the limits.h header which turns out to define PTHREAD_STACK_MIN on aarch64
gchatelet updated this revision to Diff 530492.Jun 12 2023, 6:44 AM
  • Prevent pulling the limits.h header which turns out to define PTHREAD_STACK_MIN on aarch64
This revision was landed with ongoing or failed builds.Jun 12 2023, 6:47 AM
This revision was automatically updated to reflect the committed changes.
gchatelet reopened this revision.Jun 27 2023, 2:40 AM

Upon investigation the patch seems correct but some libraries need to be updated to conform to memcmp semantic.
One example is sqlite3 which reverses ordering by negating the result of memcmp. Since signed arithmetic is not symmetric (e.g., uint8_t ∈ [-128, 127]) negating does not negate when value is INT_MIN (e.g., godbolt).

This revision is now accepted and ready to land.Jun 27 2023, 2:40 AM
gchatelet updated this revision to Diff 535363.Jun 28 2023, 6:20 AM
  • use -5/5 instead of INT_MIN/INT_MAX for uint64 not equal comparison
gchatelet updated this revision to Diff 535369.Jun 28 2023, 6:33 AM
  • modify comment
lntue added inline comments.Jun 28 2023, 7:00 AM
libc/src/string/memory_utils/utils.h
199–202

I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is better, then the compiler should just use this for 1 and -1 also, right?

gchatelet marked an inline comment as done.Jun 28 2023, 7:10 AM
gchatelet added inline comments.
libc/src/string/memory_utils/utils.h
199–202

I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is better, then the compiler should just use this for 1 and -1 also, right?

x86 does not have conditional negate and codegen for returning 1 and -1 has higher latency.

xor     eax, eax
cmp     rdi, rsi <- serializing
sbb     eax, eax <- dep on previous instruction
or      eax, 1   <- dep on previous instruction

I think the tradeoff is around register pressure, in the -1 / 1 case we just need eax at the expense of a longer dependency chain.
In the -5 / 5 case we need ecx on top of eax but the dependency chain is shorter and then latency is reduced. Since latency matters for memcmp it makes more sense to use this construct.

Now TBH I haven't measured that the overall generated code is better but I'll run a few tests before landing.

https://godbolt.org/z/Gqahv7r7e

xbolva00 added inline comments.
libc/src/string/memory_utils/utils.h
195

So they have UB in their codebases. They should really fix instead of workarounds like this one.

gchatelet marked 2 inline comments as done.Jun 28 2023, 7:49 AM
gchatelet added inline comments.
libc/src/string/memory_utils/utils.h
195

So they have UB in their codebases. They should really fix instead of workarounds like this one.

Yeah I agree, I've been pushing for this but we have many instances of this bug (not only in sqlite3.c) and they're quite painful to chase down. They usually show up quite far away from the actual memcmp call. Fixing all of them will take time but we'll release the optimized version eventually /me hope.

nafi3000 accepted this revision.Jun 30 2023, 1:54 AM
nafi3000 added inline comments.
libc/src/string/memory_utils/utils.h
199–202

The compiler could have also used edi or esi instead of ecx. Would that cause slightly lower register pressure? E.g. why is it not doing something like:

cmp rdi, rsi
mov edi, -5
mov eax, 5
cmovb eax, edi
gchatelet marked an inline comment as done.Jun 30 2023, 5:04 AM
gchatelet added inline comments.
libc/src/string/memory_utils/utils.h
199–202

The compiler could have also used edi or esi instead of ecx. Would that cause slightly lower register pressure? E.g. why is it not doing something like:

cmp rdi, rsi
mov edi, -5
mov eax, 5
cmovb eax, edi

Not exactly sure why, it may first use available registers (greedy algorithm) and then tries extra hard to reuse but only is it necessary?

gchatelet updated this revision to Diff 536196.Jun 30 2023, 5:58 AM
  • rebase for reland
This revision was landed with ongoing or failed builds.Jun 30 2023, 6:01 AM
This revision was automatically updated to reflect the committed changes.