This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
src/
-
__support/macros/properties/
-
macros/
-
properties/
-
architectures.h
-
string/
2/2
CMakeLists.txt
-
memory_utils/
-
CMakeLists.txt
-
aarch64/
-
memcmp_implementations.h
5/5
bcmp_implementations.h
-
memcmp_implementations.h
-
memmove_implementations.h
-
memset_implementations.h
-
op_aarch64.h
5/5
op_generic.h
-
op_riscv.h
6/6
op_x86.h
9/9
utils.h
-
x86_64/
1/1
memcmp_implementations.h
-
test/src/string/memory_utils/
-
src/
-
string/
-
memory_utils/
2/2
op_tests.cpp
-
utils/bazel/llvm-project-overlay/libc/
-
bazel/
-
llvm-project-overlay/
-
libc/
-
BUILD.bazel

Differential D148717

[libc] Improve memcmp latency and codegen
ClosedPublic

Authored by gchatelet on Apr 19 2023, 8:22 AM.

Download Raw Diff

Details

Reviewers

courbet
nafi3000

Commits

rG1c814c99aaed: [libc] Improve memcmp latency and codegen
rG5e32765c15ab: [libc] Improve memcmp latency and codegen
rGbd4f97875475: [libc] Improve memcmp latency and codegen
rG9ec6ebd3ceab: [libc] Improve memcmp latency and codegen

Summary

This is based on ideas from @nafi to:

use a branchless version of 'cmp' for 'uint32_t',
completely resolve the lexicographic comparison through vector operations when wide types are available. We also get rid of byte reloads and serializing '__builtin_ctzll'.

I did not include the suggestion to replace comparisons of 'uint16_t'
with two 'uint8_t' as it did not seem to help the codegen. This can
be revisited in sub-sequent patches.

The code been rewritten to reduce nested function calls, making the
job of the inliner easier and preventing harmful code duplication.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

gchatelet created this revision.Apr 19 2023, 8:22 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 19 2023, 8:22 AM

Herald added subscribers: libc-commits, ecnelises, tschuett, kristof.beyls. · View Herald Transcript

gchatelet requested review of this revision.Apr 19 2023, 8:22 AM

gchatelet added inline comments.Apr 19 2023, 8:26 AM

libc/src/string/CMakeLists.txt
464	I'll submit this as a separate patch.

Harbormaster completed remote builds in B226612: Diff 514965.Apr 19 2023, 8:54 AM

gchatelet added a reviewer: nafi3000.Apr 20 2023, 2:44 AM

rebase
Simplifying uint32_t neq<uint64_t>

@nafi3000
I've created a trimmed down version of this code to play with the codegen : https://godbolt.org/z/rdEG5nY1q
I defaulted the compile option so that you can compare with the code we discussed earlier aka -O3 -march=haswell -mprefer-vector-width=128 -mno-avx2

Let me know what you think.
Also phabricator is sometimes confusing and you need to hit the Submit button at the bottom of the page if you've made inline comments, otherwise I won't see them.
Ping me offline if needed.

nafi3000 added inline comments.May 2 2023, 11:32 AM

libc/src/string/memory_utils/bcmp_implementations.h
142	OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g. generic::Bcmp<uint32_t>::head_tail(p1, p2, 7)
libc/src/string/memory_utils/op_generic.h
608	OOC, have you ever tried other values? E.g. how about: return a > b ? 0x7fffffff : 0x80000000; or if it does not compile due to the `MemcmpReturnType`, then: return a > b ? static_cast<int32_t>(0x7fffffff) : static_cast<int32_t>(0x80000000); `-1` and `1` are 2 values apart. `0x7fffffff` and `0x80000000` are 1 value apart. assembly diff: https://www.diffchecker.com/LMVfxJ1D/ xor eax, eax cmp rcx, r8 sbb eax, eax or eax, 1 vs cmp r8, rcx mov eax, -2147483648 sbb eax, 0 In theory... the former should take 3 cycles (`or` waiting for `sbb`, `sbb` waiting for `cmp`) while the latter should take 2 cycles (`cmp` and `mov` should happen in parallel, `sbb` happening after the `cmp`), right?
libc/src/string/memory_utils/op_x86.h
69–70	[optional nit] Maybe not in this diff, but eventually we can programmatically generate the sequences here and at lines 129 and 160 below.
201–202	Would it make sense to factor out this part to another function? This is used here and for cmp<uint32_t>.
232	Ditto, -1 : 1 vs 0x80000000 : 0x7fffffff
libc/src/string/memory_utils/x86_64/memcmp_implementations.h
75	Ditto. Similar to the bcmp comment, how about using head_tail (2 loads) instead of MemcmpSequence of 3 loads? E.g. generic::Memcmp<uint32_t>::head_tail(p1, p2, 7) x86 asm diff: https://www.diffchecker.com/XQNu3lGN/
libc/test/src/string/memory_utils/op_tests.cpp
220	Do we need to add `generic::BcmpSequence<uint32_t, uint8_t>` and `generic::BcmpSequence<uint32_t, uint16_t>` here too? I am interpreting the above list as: 8, 1, 2, 4, 1+1, 1+1+1, 2+1, 4+2+1

nafi3000 added inline comments.May 3 2023, 10:04 AM

libc/src/string/memory_utils/op_generic.h
407–417	I wonder if it is better to use `cmp<T>` only for the last comparison. Motivation is that for non-last compare blocks we need to check the comparison result anyway (e.g. line 470 above) to decide whether to load and compare the next block in the sequence. Isn't it better to compute this decision (0 or non-0) as early as possible instead of computing the full cmp result (0, <0 or >0)? E.g. if constexpr (sizeof...(TS) == 0) { if constexpr (cmp_is_expensive<T>::value) { if (eq<T>(p1, p2, 0)) return MemcmpReturnType::ZERO(); return cmp_neq<T>(p1, p2, 0); } else { return cmp<T>(p1, p2, 0); } } else { if (!eq<T>(p1, p2, 0)) return cmp_neq<T>(p1, p2, 0); return MemcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T)); } And, for the last block, I wonder if we can invariably call `cmp<T>` instead. What is better would depend on data. E.g. for `__m512i`, `cmp<T>` is faster if there is at least 1 byte mismatch in the last 64 bytes.

Address most comments

Herald added subscribers: s.egerton, simoncook. · View Herald TranscriptMay 10 2023, 7:03 AM

gchatelet added inline comments.May 10 2023, 7:03 AM

libc/src/string/memory_utils/bcmp_implementations.h
142	Yeah there are a bunch of options here. Usually I want to use `head_tail` to cover a range of sizes as it clearly diminishes the overall code size (all sizes from 5 to 8 with only two loads per pointer). Depending on how often those sizes appear in the size distribution it might be useful to special case the code. I'm tempted to keep the previous logic to prevent regressions and make this a separate change. WDYT?
libc/src/string/memory_utils/op_generic.h
608	Nice one. Picking other values was on my TODO list but I never thought it through. I had a look at codegen for armv8 it uses `cinv` instead of `cneg` but it seems to be neutral in terms of performance. https://godbolt.org/z/Y9aGq5sPd For x86 in theory this should be better yes 👍. https://godbolt.org/z/69Gefhqef For RISC-V it seems it's worse as it generates a branch. https://godbolt.org/z/bvqMeMjMP This seems to be in line with what I found on stackoverflow Now I tried it on the full implementation and the additional branch seems to be outlined leading to the same code size. I don't know the impact on the code speed but since we don't yet have an optimized version of memcmp for RISC-V we can happily revisit the function later on. I've created a separate function so we can keep track of the rationale.
libc/src/string/memory_utils/op_x86.h
69–70	AFAICT we can't do it with the intel intrinsics as they are real functions expecting a certain number of arguments. It may be possible to generate them with GCC and clang vector extensions though and then convert them back to Intel types. I gave it a try but it's brittle on clang and fails on GCC https://godbolt.org/z/Ms7fW5nP3 Not sure it's worth it.
201–202	Yes, I've done so for the `uint64_t` so let's factor this out for `uint32_t` as well.
libc/test/src/string/memory_utils/op_tests.cpp
220	Done. More coverage doesn't hurt : )

Harbormaster completed remote builds in B231089: Diff 520983.May 10 2023, 7:33 AM

Rebase
Simplifying uint32_t neq<uint64_t>
Address most comments
Fix bazel build

Harbormaster completed remote builds in B231095: Diff 520990.May 10 2023, 8:12 AM

lntue added a subscriber: lntue.May 10 2023, 9:16 AM

lntue added inline comments.

libc/src/string/memory_utils/bcmp_implementations.h
92	Currently we only check for `SSE4.2` in our CMake build https://github.com/llvm/llvm-project/blob/main/libc/cmake/modules/LLVMLibCCheckCpuFeatures.cmake#L9 Do you want to change or add `SSE4.1` to the list instead?

gchatelet marked an inline comment as done.May 11 2023, 4:48 AM

gchatelet added inline comments.

libc/src/string/memory_utils/bcmp_implementations.h
92	Technically this code only needs `SSE4.1` but I don't think it's worth discriminating between the two. https://en.wikipedia.org/wiki/SSE4#SSE4_subsets Looking at Steam hardware survey (click the `Other Settings` line at the bottom of the page) the share of cpu having SSE4.1 but not SSE4.2 is about 0.23%. SSE4.1 : 99.36% SSE4.2 : 99.13% So we can basically only discriminate between `SSE2` and `SSE4.2` and call it a day : )
libc/src/string/memory_utils/op_generic.h
407–417	The current benchmark works for all memory functions and is only able to assess the throughput of functions under a particular size distribution. As-is, it is not a good tool to evaluate functions that may return early (`bcmp` and `memcmp`) and for which latency is important. I'll work on adding a latency benchmark based on the one you provided me earlier. Once it's done I think we will be in a better position to decide which strategy is better. SGTY?

rebase and merge RISCV changes

Harbormaster completed remote builds in B232548: Diff 522959.May 17 2023, 2:16 AM

rebase and add a TODO to explore more optimization once we have proper latency benchmarking.

@nafi3000 I've been busy and haven't had time to work on the benchmark..
I will land this change as-is if you accept this revision (Menu : Add Action... -> Accept Revision). We can iterate on this through subsequent patches. WDYT?

Harbormaster completed remote builds in B233827: Diff 524646.May 23 2023, 4:11 AM

nafi3000 accepted this revision.Jun 2 2023, 7:33 PM

nafi3000 added inline comments.

libc/src/string/memory_utils/bcmp_implementations.h
142	Separate change SGTM.
libc/src/string/memory_utils/op_generic.h
407–417	Sounds good.
libc/src/string/memory_utils/op_x86.h
69–70	We can use _mm_load_si128 instead of _mm_set_epi8. Snappy code has some example: https://github.com/google/snappy/blob/main/snappy.cc Search for `pattern_generation_masks`. Anyway, this can be addressed in a separate diff. And like you mention, it may not be worth it. In snappy, we actually need a array of such shuffle masks. But in this case we just need one.

This revision is now accepted and ready to land.Jun 2 2023, 7:33 PM

Closed by commit rG9ec6ebd3ceab: [libc] Improve memcmp latency and codegen (authored by gchatelet). · Explain WhyJun 5 2023, 2:46 AM

This revision was automatically updated to reflect the committed changes.

gchatelet added a commit: rG9ec6ebd3ceab: [libc] Improve memcmp latency and codegen.

gchatelet added a reverting change: rGe49a6085111b: Revert D148717 "[libc] Improve memcmp latency and codegen".Jun 5 2023, 2:51 AM

Reopening to fix aarch64 and riscv

This revision is now accepted and ready to land.Jun 5 2023, 4:43 AM

Fix aarch64 and RISCV implementations

Harbormaster completed remote builds in B237274: Diff 529308.Jun 7 2023, 8:26 AM

Specialize types per architecture

Forgot to add libc/src/string/memory_utils/aarch64/memcmp_implementations.h

Harbormaster completed remote builds in B237462: Diff 529554.Jun 8 2023, 4:38 AM

Add type specialization for RISCV

Herald added subscribers: luke, • pcwang-thead, frasercrmck and 18 others. · View Herald TranscriptJun 8 2023, 5:05 AM

Harbormaster completed remote builds in B237467: Diff 529560.Jun 8 2023, 5:26 AM

I still need to check that the ARM platform is not impacted by this patch. I'll land it once it's done.

Add missing namespace for RISCV
Also include riscv32

Harbormaster completed remote builds in B237486: Diff 529581.Jun 8 2023, 7:03 AM

lntue added inline comments.Jun 8 2023, 7:30 AM

libc/src/string/CMakeLists.txt
573	If you drop the requirement to AVX, should the compile option be `-march=sandybridge` instead?

Disable non uint8_t type tests for ARM platform

Harbormaster completed remote builds in B237691: Diff 529858.Jun 9 2023, 2:21 AM

Use sandybrige with AVX

Harbormaster completed remote builds in B237698: Diff 529866.Jun 9 2023, 2:33 AM

This seems to be good to go. @nafi3000 do you want to have a final look before I submit?

nafi3000 accepted this revision.Jun 9 2023, 10:11 PM

nafi3000 added inline comments.

libc/src/string/memory_utils/utils.h

171–173

nit: s/uint64_t/int64_t/ and s/uint32_t/int32_t/ in the comments.

174

For the explanation, please consider whether we can add some version of the following points:

For the int64_t to int32_t conversion we want the following properties:
- int32_t[31:31] == 1 iff diff < 0
- int32_t[31:0] == 0 iff diff == 0

We also observe that:
- When diff < 0: diff[63:32] == 0xffffffff and diff[31:0] != 0
- When diff > 0: diff[63:32] == 0 and diff[31:0] != 0
- When diff == 0: diff[63:32] == 0 and diff[31:0] == 0
- https://godbolt.org/z/8W7qWP6e5
- This implies that we can only look at diff[32:32] for determining the sign bit for the returned int32_t.

So, we do the following:
- int32_t[31:31] = diff[32:32]
- int32_t[30:0] = diff[31:0] == 0 ? 0 : non-0.

And, we can achieve the above by the expression below. We could have also used (diff64 >> 1) | (diff64 & 0x1) but (diff64 & 0xFFFF) is faster than (diff64 & 0x1). https://godbolt.org/z/j3b569rW1

We can also add all these in a separate diff.

Fix typos and added explanation for int64_t to int32_t conversion in cmp_uint32_t

This revision was landed with ongoing or failed builds.Jun 12 2023, 12:56 AM

Closed by commit rGbd4f97875475: [libc] Improve memcmp latency and codegen (authored by gchatelet). · Explain Why

This revision was automatically updated to reflect the committed changes.

gchatelet added a commit: rGbd4f97875475: [libc] Improve memcmp latency and codegen.

Harbormaster completed remote builds in B238121: Diff 530416.Jun 12 2023, 1:21 AM

gchatelet added a reverting change: rG1ec995cc1c6c: Revert D148717 "[libc] Improve memcmp latency and codegen".Jun 12 2023, 1:33 AM

gchatelet added inline comments.Jun 12 2023, 6:22 AM

libc/src/string/memory_utils/utils.h
174	The explanation is fantastic, I copied it verbatim.

Reopening it to fix aarch64 build bot breakage
https://lab.llvm.org/buildbot/#/builders/223/builds/21703/steps/4/logs/stdio

This revision is now accepted and ready to land.Jun 12 2023, 6:23 AM

Prevent pulling the limits.h header which turns out to define PTHREAD_STACK_MIN on aarch64

Prevent pulling the limits.h header which turns out to define PTHREAD_STACK_MIN on aarch64

This revision was landed with ongoing or failed builds.Jun 12 2023, 6:47 AM

Closed by commit rG5e32765c15ab: [libc] Improve memcmp latency and codegen (authored by gchatelet). · Explain Why

This revision was automatically updated to reflect the committed changes.

gchatelet added a commit: rG5e32765c15ab: [libc] Improve memcmp latency and codegen.

Harbormaster completed remote builds in B238179: Diff 530492.Jun 12 2023, 6:57 AM

gchatelet added a reverting change: rGbd1cba9f4ff0: Revert D148717 "[libc] Improve memcmp latency and codegen".Jun 21 2023, 5:37 AM

Upon investigation the patch seems correct but some libraries need to be updated to conform to memcmp semantic.
One example is sqlite3 which reverses ordering by negating the result of memcmp. Since signed arithmetic is not symmetric (e.g., uint8_t ∈ [-128, 127]) negating does not negate when value is INT_MIN (e.g., godbolt).

This revision is now accepted and ready to land.Jun 27 2023, 2:40 AM

Herald added a subscriber: wangpc. · View Herald TranscriptJun 27 2023, 2:40 AM

use -5/5 instead of INT_MIN/INT_MAX for uint64 not equal comparison

modify comment

Harbormaster completed remote builds in B241766: Diff 535369.Jun 28 2023, 6:48 AM

lntue added inline comments.Jun 28 2023, 7:00 AM

libc/src/string/memory_utils/utils.h
213–216	I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is better, then the compiler should just use this for 1 and -1 also, right?

gchatelet marked an inline comment as done.Jun 28 2023, 7:10 AM

gchatelet added inline comments.

libc/src/string/memory_utils/utils.h
213–216	I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is better, then the compiler should just use this for 1 and -1 also, right? x86 does not have conditional negate and codegen for returning 1 and -1 has higher latency. xor eax, eax cmp rdi, rsi <- serializing sbb eax, eax <- dep on previous instruction or eax, 1 <- dep on previous instruction I think the tradeoff is around register pressure, in the `-1` / `1` case we just need `eax` at the expense of a longer dependency chain. In the `-5` / `5` case we need `ecx` on top of `eax` but the dependency chain is shorter and then latency is reduced. Since latency matters for `memcmp` it makes more sense to use this construct. Now TBH I haven't measured that the overall generated code is better but I'll run a few tests before landing. https://godbolt.org/z/Gqahv7r7e

xbolva00 added a subscriber: xbolva00.Jun 28 2023, 7:19 AM

xbolva00 added inline comments.

libc/src/string/memory_utils/utils.h
209	So they have UB in their codebases. They should really fix instead of workarounds like this one.

gchatelet marked 2 inline comments as done.Jun 28 2023, 7:49 AM

gchatelet added inline comments.

libc/src/string/memory_utils/utils.h
209	So they have UB in their codebases. They should really fix instead of workarounds like this one. Yeah I agree, I've been pushing for this but we have many instances of this bug (not only in `sqlite3.c`) and they're quite painful to chase down. They usually show up quite far away from the actual `memcmp` call. Fixing all of them will take time but we'll release the optimized version eventually /me hope.

nafi3000 accepted this revision.Jun 30 2023, 1:54 AM

nafi3000 added inline comments.

libc/src/string/memory_utils/utils.h
213–216	The compiler could have also used `edi` or `esi` instead of `ecx`. Would that cause slightly lower register pressure? E.g. why is it not doing something like: cmp rdi, rsi mov edi, -5 mov eax, 5 cmovb eax, edi

gchatelet marked an inline comment as done.Jun 30 2023, 5:04 AM

gchatelet added inline comments.

libc/src/string/memory_utils/utils.h
213–216	The compiler could have also used `edi` or `esi` instead of `ecx`. Would that cause slightly lower register pressure? E.g. why is it not doing something like: cmp rdi, rsi mov edi, -5 mov eax, 5 cmovb eax, edi Not exactly sure why, it may first use available registers (greedy algorithm) and then tries extra hard to reuse but only is it necessary?

rebase for reland

This revision was landed with ongoing or failed builds.Jun 30 2023, 6:01 AM

Closed by commit rG1c814c99aaed: [libc] Improve memcmp latency and codegen (authored by gchatelet). · Explain Why

This revision was automatically updated to reflect the committed changes.

gchatelet added a commit: rG1c814c99aaed: [libc] Improve memcmp latency and codegen.

Harbormaster completed remote builds in B242386: Diff 536196.Jun 30 2023, 6:10 AM

Revision Contents

Path

Size

libc/

src/

__support/

macros/

properties/

architectures.h

9 lines

string/

CMakeLists.txt

8 lines

memory_utils/

CMakeLists.txt

1 line

aarch64/

memcmp_implementations.h

33 lines

bcmp_implementations.h

128 lines

memcmp_implementations.h

40 lines

memmove_implementations.h

24 lines

memset_implementations.h

24 lines

96 lines

451 lines

84 lines

369 lines

75 lines

x86_64/

memcmp_implementations.h

91 lines

test/

src/

string/

memory_utils/

op_tests.cpp

111 lines

utils/

bazel/

llvm-project-overlay/

libc/

BUILD.bazel

1 line

Diff 536199

libc/src/__support/macros/properties/architectures.h

	Show All 39 Lines
	#if (defined(__arm__) \|\| defined(_M_ARM))			#if (defined(__arm__) \|\| defined(_M_ARM))
	#define LIBC_TARGET_ARCH_IS_ARM			#define LIBC_TARGET_ARCH_IS_ARM
	#endif			#endif

	#if defined(__aarch64__) \|\| defined(__arm64__) \|\| defined(_M_ARM64)			#if defined(__aarch64__) \|\| defined(__arm64__) \|\| defined(_M_ARM64)
	#define LIBC_TARGET_ARCH_IS_AARCH64			#define LIBC_TARGET_ARCH_IS_AARCH64
	#endif			#endif

				#if (defined(LIBC_TARGET_ARCH_IS_AARCH64) \|\| defined(LIBC_TARGET_ARCH_IS_ARM))
				#define LIBC_TARGET_ARCH_IS_ANY_ARM
				#endif

	#if defined(__riscv) && (__riscv_xlen == 64)			#if defined(__riscv) && (__riscv_xlen == 64)
	#define LIBC_TARGET_ARCH_IS_RISCV64			#define LIBC_TARGET_ARCH_IS_RISCV64
	#endif			#endif

	#if defined(__riscv) && (__riscv_xlen == 32)			#if defined(__riscv) && (__riscv_xlen == 32)
	#define LIBC_TARGET_ARCH_IS_RISCV32			#define LIBC_TARGET_ARCH_IS_RISCV32
	#endif			#endif

	#if (defined(LIBC_TARGET_ARCH_IS_AARCH64) \|\| defined(LIBC_TARGET_ARCH_IS_ARM))			#if (defined(LIBC_TARGET_ARCH_IS_RISCV64) \|\| \
	#define LIBC_TARGET_ARCH_IS_ANY_ARM			defined(LIBC_TARGET_ARCH_IS_RISCV32))
				#define LIBC_TARGET_ARCH_IS_ANY_RISCV
	#endif			#endif

	#endif // LLVM_LIBC_SUPPORT_MACROS_PROPERTIES_ARCHITECTURES_H			#endif // LLVM_LIBC_SUPPORT_MACROS_PROPERTIES_ARCHITECTURES_H

libc/src/string/CMakeLists.txt

Show First 20 Lines • Show All 444 Lines • ▼ Show 20 Lines	function(add_implementation name impl_name)

if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")		if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
# Note that '-mllvm' needs to be prefixed with 'SHELL:' to prevent CMake flag deduplication.		# Note that '-mllvm' needs to be prefixed with 'SHELL:' to prevent CMake flag deduplication.
foreach(opt IN LISTS ADD_IMPL_MLLVM_COMPILE_OPTIONS)		foreach(opt IN LISTS ADD_IMPL_MLLVM_COMPILE_OPTIONS)
list(APPEND ADD_IMPL_COMPILE_OPTIONS "SHELL:-mllvm ${opt}")		list(APPEND ADD_IMPL_COMPILE_OPTIONS "SHELL:-mllvm ${opt}")
endforeach()		endforeach()
endif()		endif()

		if("${CMAKE_CXX_COMPILER_ID}" MATCHES "GNU")
		# Prevent warning when passing x86 SIMD types as template arguments.
		# e.g. "warning: ignoring attributes on template argument ‘__m128i’ [-Wignored-attributes]"
		list(APPEND ADD_IMPL_COMPILE_OPTIONS "-Wno-ignored-attributes")
		endif()

add_entrypoint_object(${impl_name}		add_entrypoint_object(${impl_name}
NAME ${name}		NAME ${name}
SRCS ${ADD_IMPL_SRCS}		SRCS ${ADD_IMPL_SRCS}
HDRS ${ADD_IMPL_HDRS}		HDRS ${ADD_IMPL_HDRS}
DEPENDS ${ADD_IMPL_DEPENDS}		DEPENDS ${ADD_IMPL_DEPENDS}
COMPILE_OPTIONS -O3 ${ADD_IMPL_COMPILE_OPTIONS}		COMPILE_OPTIONS -O3 ${ADD_IMPL_COMPILE_OPTIONS}
		gchateletAuthorUnsubmitted Done Reply Inline Actions I'll submit this as a separate patch. gchatelet: I'll submit this as a separate patch.
)		)
get_fq_target_name(${impl_name} fq_target_name)		get_fq_target_name(${impl_name} fq_target_name)
set_target_properties(${fq_target_name} PROPERTIES REQUIRE_CPU_FEATURES "${ADD_IMPL_REQUIRE}")		set_target_properties(${fq_target_name} PROPERTIES REQUIRE_CPU_FEATURES "${ADD_IMPL_REQUIRE}")
set_property(GLOBAL APPEND PROPERTY "${name}_implementations" "${fq_target_name}")		set_property(GLOBAL APPEND PROPERTY "${name}_implementations" "${fq_target_name}")
endfunction()		endfunction()

# ------------------------------------------------------------------------------		# ------------------------------------------------------------------------------
# bcmp		# bcmp
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	DEPENDS
libc.include.string		libc.include.string
${ARGN}		${ARGN}
)		)
endfunction()		endfunction()

if(${LIBC_TARGET_ARCHITECTURE_IS_X86})		if(${LIBC_TARGET_ARCHITECTURE_IS_X86})
add_memcpy(memcpy_x86_64_opt_sse2 COMPILE_OPTIONS -march=k8 REQUIRE SSE2)		add_memcpy(memcpy_x86_64_opt_sse2 COMPILE_OPTIONS -march=k8 REQUIRE SSE2)
add_memcpy(memcpy_x86_64_opt_sse4 COMPILE_OPTIONS -march=nehalem REQUIRE SSE4_2)		add_memcpy(memcpy_x86_64_opt_sse4 COMPILE_OPTIONS -march=nehalem REQUIRE SSE4_2)
add_memcpy(memcpy_x86_64_opt_avx2 COMPILE_OPTIONS -march=haswell REQUIRE AVX2)		add_memcpy(memcpy_x86_64_opt_avx COMPILE_OPTIONS -march=sandybridge REQUIRE AVX)
		lntueUnsubmitted Done Reply Inline Actions If you drop the requirement to AVX, should the compile option be `-march=sandybridge` instead? lntue: If you drop the requirement to AVX, should the compile option be `-march=sandybridge` instead?
add_memcpy(memcpy_x86_64_opt_avx512 COMPILE_OPTIONS -march=skylake-avx512 REQUIRE AVX512F)		add_memcpy(memcpy_x86_64_opt_avx512 COMPILE_OPTIONS -march=skylake-avx512 REQUIRE AVX512F)
add_memcpy(memcpy_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE})		add_memcpy(memcpy_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE})
add_memcpy(memcpy)		add_memcpy(memcpy)
elseif(${LIBC_TARGET_ARCHITECTURE_IS_AARCH64})		elseif(${LIBC_TARGET_ARCHITECTURE_IS_AARCH64})
# Disable tail merging as it leads to lower performance.		# Disable tail merging as it leads to lower performance.
add_memcpy(memcpy_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE}		add_memcpy(memcpy_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE}
MLLVM_COMPILE_OPTIONS "-tail-merge-threshold=0")		MLLVM_COMPILE_OPTIONS "-tail-merge-threshold=0")
add_memcpy(memcpy MLLVM_COMPILE_OPTIONS "-tail-merge-threshold=0")		add_memcpy(memcpy MLLVM_COMPILE_OPTIONS "-tail-merge-threshold=0")
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

libc/src/string/memory_utils/CMakeLists.txt

Show All 18 Lines	HDRS
x86_64/memcpy_implementations.h		x86_64/memcpy_implementations.h
DEPS		DEPS
libc.src.__support.common		libc.src.__support.common
libc.src.__support.CPP.bit		libc.src.__support.CPP.bit
libc.src.__support.CPP.cstddef		libc.src.__support.CPP.cstddef
libc.src.__support.CPP.type_traits		libc.src.__support.CPP.type_traits
libc.src.__support.macros.config		libc.src.__support.macros.config
libc.src.__support.macros.optimization		libc.src.__support.macros.optimization
		libc.src.__support.macros.properties.architectures
)		)

add_header_library(		add_header_library(
memcpy_implementation		memcpy_implementation
HDRS		HDRS
memcpy_implementations.h		memcpy_implementations.h
DEPS		DEPS
.memory_utils		.memory_utils
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

libc/src/string/memory_utils/aarch64/memcmp_implementations.h

	Show All 13 Lines
	#include "src/string/memory_utils/op_generic.h"			#include "src/string/memory_utils/op_generic.h"
	#include "src/string/memory_utils/utils.h" // MemcmpReturnType			#include "src/string/memory_utils/utils.h" // MemcmpReturnType

	namespace __llvm_libc {			namespace __llvm_libc {

	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {
	if (LIBC_UNLIKELY(count >= 384)) {			if (LIBC_UNLIKELY(count >= 384)) {
	if (auto value = generic::Memcmp<16>::block(p1, p2))			if (auto value = generic::Memcmp<uint8x16_t>::block(p1, p2))
	return value;			return value;
	align_to_next_boundary<16, Arg::P1>(p1, p2, count);			align_to_next_boundary<16, Arg::P1>(p1, p2, count);
	}			}
	return generic::Memcmp<16>::loop_and_tail(p1, p2, count);			return generic::Memcmp<uint8x16_t>::loop_and_tail(p1, p2, count);
	}			}

	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_aarch64_neon_gt16(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_aarch64_neon_gt16(CPtr p1, CPtr p2, size_t count) {
	if (LIBC_UNLIKELY(count >= 128)) { // [128, ∞]			if (LIBC_UNLIKELY(count >= 128)) { // [128, ∞]
	if (auto value = generic::Memcmp<16>::block(p1, p2))			if (auto value = generic::Memcmp<uint8x16_t>::block(p1, p2))
	return value;			return value;
	align_to_next_boundary<16, Arg::P1>(p1, p2, count);			align_to_next_boundary<16, Arg::P1>(p1, p2, count);
	return generic::Memcmp<32>::loop_and_tail(p1, p2, count);			return generic::Memcmp<uint8x16x2_t>::loop_and_tail(p1, p2, count);
	}			}
	if (generic::Bcmp<16>::block(p1, p2)) // [16, 16]			if (generic::Bcmp<uint8x16_t>::block(p1, p2)) // [16, 16]
	return generic::Memcmp<16>::block(p1, p2);			return generic::Memcmp<uint8x16_t>::block(p1, p2);
	if (count < 32) // [17, 31]			if (count < 32) // [17, 31]
	return generic::Memcmp<16>::tail(p1, p2, count);			return generic::Memcmp<uint8x16_t>::tail(p1, p2, count);
	if (generic::Bcmp<16>::block(p1 + 16, p2 + 16)) // [32, 32]			if (generic::Bcmp<uint8x16_t>::block(p1 + 16, p2 + 16)) // [32, 32]
	return generic::Memcmp<16>::block(p1 + 16, p2 + 16);			return generic::Memcmp<uint8x16_t>::block(p1 + 16, p2 + 16);
	if (count < 64) // [33, 63]			if (count < 64) // [33, 63]
	return generic::Memcmp<32>::tail(p1, p2, count);			return generic::Memcmp<uint8x16x2_t>::tail(p1, p2, count);
	// [64, 127]			// [64, 127]
	return generic::Memcmp<16>::loop_and_tail(p1 + 32, p2 + 32, count - 32);			return generic::Memcmp<uint8x16_t>::loop_and_tail(p1 + 32, p2 + 32,
				count - 32);
	}			}

	LIBC_INLINE MemcmpReturnType inline_memcmp_aarch64(CPtr p1, CPtr p2,			LIBC_INLINE MemcmpReturnType inline_memcmp_aarch64(CPtr p1, CPtr p2,
	size_t count) {			size_t count) {
	if (count == 0)			if (count == 0)
	return MemcmpReturnType::ZERO();			return MemcmpReturnType::ZERO();
	if (count == 1)			if (count == 1)
	return generic::Memcmp<1>::block(p1, p2);			return generic::Memcmp<uint8_t>::block(p1, p2);
	if (count == 2)			if (count == 2)
	return generic::Memcmp<2>::block(p1, p2);			return generic::Memcmp<uint16_t>::block(p1, p2);
	if (count == 3)			if (count == 3)
	return generic::Memcmp<3>::block(p1, p2);			return generic::MemcmpSequence<uint16_t, uint8_t>::block(p1, p2);
	if (count <= 8)			if (count <= 8)
	return generic::Memcmp<4>::head_tail(p1, p2, count);			return generic::Memcmp<uint32_t>::head_tail(p1, p2, count);
	if (count <= 16)			if (count <= 16)
	return generic::Memcmp<8>::head_tail(p1, p2, count);			return generic::Memcmp<uint64_t>::head_tail(p1, p2, count);
	if constexpr (aarch64::kNeon)			if constexpr (aarch64::kNeon)
	return inline_memcmp_aarch64_neon_gt16(p1, p2, count);			return inline_memcmp_aarch64_neon_gt16(p1, p2, count);
	else			else
	return inline_memcmp_generic_gt16(p1, p2, count);			return inline_memcmp_generic_gt16(p1, p2, count);
	}			}
	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif // LIBC_SRC_STRING_MEMORY_UTILS_X86_64_MEMCMP_IMPLEMENTATIONS_H			#endif // LIBC_SRC_STRING_MEMORY_UTILS_X86_64_MEMCMP_IMPLEMENTATIONS_H

libc/src/string/memory_utils/bcmp_implementations.h

Show All 9 Lines
#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_BCMP_IMPLEMENTATIONS_H		#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_BCMP_IMPLEMENTATIONS_H

#include "src/__support/common.h"		#include "src/__support/common.h"
#include "src/__support/macros/optimization.h" // LIBC_UNLIKELY LIBC_LOOP_NOUNROLL		#include "src/__support/macros/optimization.h" // LIBC_UNLIKELY LIBC_LOOP_NOUNROLL
#include "src/__support/macros/properties/architectures.h"		#include "src/__support/macros/properties/architectures.h"
#include "src/string/memory_utils/op_aarch64.h"		#include "src/string/memory_utils/op_aarch64.h"
#include "src/string/memory_utils/op_builtin.h"		#include "src/string/memory_utils/op_builtin.h"
#include "src/string/memory_utils/op_generic.h"		#include "src/string/memory_utils/op_generic.h"
		#include "src/string/memory_utils/op_riscv.h"
#include "src/string/memory_utils/op_x86.h"		#include "src/string/memory_utils/op_x86.h"

#include <stddef.h> // size_t		#include <stddef.h> // size_t

namespace __llvm_libc {		namespace __llvm_libc {

[[maybe_unused]] LIBC_INLINE BcmpReturnType		[[maybe_unused]] LIBC_INLINE BcmpReturnType
inline_bcmp_byte_per_byte(CPtr p1, CPtr p2, size_t offset, size_t count) {		inline_bcmp_byte_per_byte(CPtr p1, CPtr p2, size_t count, size_t offset = 0) {
LIBC_LOOP_NOUNROLL		return generic::Bcmp<uint8_t>::loop_and_tail_offset(p1, p2, count, offset);
for (; offset < count; ++offset)
if (p1[offset] != p2[offset])
return BcmpReturnType::NONZERO();
return BcmpReturnType::ZERO();
}		}

[[maybe_unused]] LIBC_INLINE BcmpReturnType		[[maybe_unused]] LIBC_INLINE BcmpReturnType
inline_bcmp_aligned_access_64bit(CPtr p1, CPtr p2, size_t count) {		inline_bcmp_aligned_access_64bit(CPtr p1, CPtr p2, size_t count) {
constexpr size_t kAlign = sizeof(uint64_t);		constexpr size_t kAlign = sizeof(uint64_t);
if (count <= 2 * kAlign)		if (count <= 2 * kAlign)
return inline_bcmp_byte_per_byte(p1, p2, 0, count);		return inline_bcmp_byte_per_byte(p1, p2, count);
size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);		size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);
if (auto value = inline_bcmp_byte_per_byte(p1, p2, 0, bytes_to_p1_align))		if (auto value = inline_bcmp_byte_per_byte(p1, p2, bytes_to_p1_align))
return value;		return value;
size_t offset = bytes_to_p1_align;		size_t offset = bytes_to_p1_align;
size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);		size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);
for (; offset < count - kAlign; offset += kAlign) {		for (; offset < count - kAlign; offset += kAlign) {
uint64_t a;		uint64_t a;
if (p2_alignment == 0)		if (p2_alignment == 0)
a = load64_aligned<uint64_t>(p2, offset);		a = load64_aligned<uint64_t>(p2, offset);
else if (p2_alignment == 4)		else if (p2_alignment == 4)
a = load64_aligned<uint32_t, uint32_t>(p2, offset);		a = load64_aligned<uint32_t, uint32_t>(p2, offset);
else if (p2_alignment == 2)		else if (p2_alignment == 2)
a = load64_aligned<uint16_t, uint16_t, uint16_t, uint16_t>(p2, offset);		a = load64_aligned<uint16_t, uint16_t, uint16_t, uint16_t>(p2, offset);
else		else
a = load64_aligned<uint8_t, uint16_t, uint16_t, uint16_t, uint8_t>(		a = load64_aligned<uint8_t, uint16_t, uint16_t, uint16_t, uint8_t>(
p2, offset);		p2, offset);
uint64_t b = load64_aligned<uint64_t>(p1, offset);		uint64_t b = load64_aligned<uint64_t>(p1, offset);
if (a != b)		if (a != b)
return BcmpReturnType::NONZERO();		return BcmpReturnType::NONZERO();
}		}
return inline_bcmp_byte_per_byte(p1, p2, offset, count);		return inline_bcmp_byte_per_byte(p1, p2, count, offset);
}		}

[[maybe_unused]] LIBC_INLINE BcmpReturnType		[[maybe_unused]] LIBC_INLINE BcmpReturnType
inline_bcmp_aligned_access_32bit(CPtr p1, CPtr p2, size_t count) {		inline_bcmp_aligned_access_32bit(CPtr p1, CPtr p2, size_t count) {
constexpr size_t kAlign = sizeof(uint32_t);		constexpr size_t kAlign = sizeof(uint32_t);
if (count <= 2 * kAlign)		if (count <= 2 * kAlign)
return inline_bcmp_byte_per_byte(p1, p2, 0, count);		return inline_bcmp_byte_per_byte(p1, p2, count);
size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);		size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);
if (auto value = inline_bcmp_byte_per_byte(p1, p2, 0, bytes_to_p1_align))		if (auto value = inline_bcmp_byte_per_byte(p1, p2, bytes_to_p1_align))
return value;		return value;
size_t offset = bytes_to_p1_align;		size_t offset = bytes_to_p1_align;
size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);		size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);
for (; offset < count - kAlign; offset += kAlign) {		for (; offset < count - kAlign; offset += kAlign) {
uint32_t a;		uint32_t a;
if (p2_alignment == 0)		if (p2_alignment == 0)
a = load32_aligned<uint32_t>(p2, offset);		a = load32_aligned<uint32_t>(p2, offset);
else if (p2_alignment == 2)		else if (p2_alignment == 2)
a = load32_aligned<uint16_t, uint16_t>(p2, offset);		a = load32_aligned<uint16_t, uint16_t>(p2, offset);
else		else
a = load32_aligned<uint8_t, uint16_t, uint8_t>(p2, offset);		a = load32_aligned<uint8_t, uint16_t, uint8_t>(p2, offset);
uint32_t b = load32_aligned<uint32_t>(p1, offset);		uint32_t b = load32_aligned<uint32_t>(p1, offset);
if (a != b)		if (a != b)
return BcmpReturnType::NONZERO();		return BcmpReturnType::NONZERO();
}		}
return inline_bcmp_byte_per_byte(p1, p2, offset, count);		return inline_bcmp_byte_per_byte(p1, p2, count, offset);
}		}

#if defined(LIBC_TARGET_ARCH_IS_X86) \|\| defined(LIBC_TARGET_ARCH_IS_AARCH64)		#if defined(LIBC_TARGET_ARCH_IS_X86) \|\| defined(LIBC_TARGET_ARCH_IS_AARCH64)
[[maybe_unused]] LIBC_INLINE BcmpReturnType		[[maybe_unused]] LIBC_INLINE BcmpReturnType
inline_bcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {		inline_bcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {
if (count < 256)		return generic::Bcmp<uint64_t>::loop_and_tail_align_above(256, p1, p2, count);
return generic::Bcmp<16>::loop_and_tail(p1, p2, count);
if (auto value = generic::Bcmp<64>::block(p1, p2))
return value;
align_to_next_boundary<64, Arg::P1>(p1, p2, count);
return generic::Bcmp<64>::loop_and_tail(p1, p2, count);
}		}
#endif // defined(LIBC_TARGET_ARCH_IS_X86) \|\|		#endif // defined(LIBC_TARGET_ARCH_IS_X86) \|\|
// defined(LIBC_TARGET_ARCH_IS_AARCH64)		// defined(LIBC_TARGET_ARCH_IS_AARCH64)

#if defined(LIBC_TARGET_ARCH_IS_X86)		#if defined(LIBC_TARGET_ARCH_IS_X86)
		#if defined(__SSE4_1__)
		lntueUnsubmitted Done Reply Inline Actions Currently we only check for `SSE4.2` in our CMake build https://github.com/llvm/llvm-project/blob/main/libc/cmake/modules/LLVMLibCCheckCpuFeatures.cmake#L9 Do you want to change or add `SSE4.1` to the list instead? lntue: Currently we only check for `SSE4.2` in our CMake build https://github.com/llvm/llvm…
		gchateletAuthorUnsubmitted Done Reply Inline Actions Technically this code only needs `SSE4.1` but I don't think it's worth discriminating between the two. https://en.wikipedia.org/wiki/SSE4#SSE4_subsets Looking at Steam hardware survey (click the `Other Settings` line at the bottom of the page) the share of cpu having SSE4.1 but not SSE4.2 is about 0.23%. SSE4.1 : 99.36% SSE4.2 : 99.13% So we can basically only discriminate between `SSE2` and `SSE4.2` and call it a day : ) gchatelet: Technically this code only needs `SSE4.1` but I don't think it's worth discriminating between…
[[maybe_unused]] LIBC_INLINE BcmpReturnType		[[maybe_unused]] LIBC_INLINE BcmpReturnType
inline_bcmp_x86_sse2_gt16(CPtr p1, CPtr p2, size_t count) {		inline_bcmp_x86_sse41_gt16(CPtr p1, CPtr p2, size_t count) {
if (count <= 32)		if (count <= 32)
return x86::sse2::Bcmp<16>::head_tail(p1, p2, count);		return generic::Bcmp<__m128i>::head_tail(p1, p2, count);
if (count < 256)		return generic::Bcmp<__m128i>::loop_and_tail_align_above(256, p1, p2, count);
return x86::sse2::Bcmp<16>::loop_and_tail(p1, p2, count);
if (auto value = x86::sse2::Bcmp<16>::block(p1, p2))
return value;
align_to_next_boundary<16, Arg::P1>(p1, p2, count);
return x86::sse2::Bcmp<64>::loop_and_tail(p1, p2, count);
}		}
		#endif // __SSE4_1__

		#if defined(__AVX__)
[[maybe_unused]] LIBC_INLINE BcmpReturnType		[[maybe_unused]] LIBC_INLINE BcmpReturnType
inline_bcmp_x86_avx2_gt16(CPtr p1, CPtr p2, size_t count) {		inline_bcmp_x86_avx_gt16(CPtr p1, CPtr p2, size_t count) {
if (count <= 32)		if (count <= 32)
return x86::sse2::Bcmp<16>::head_tail(p1, p2, count);		return generic::Bcmp<__m128i>::head_tail(p1, p2, count);
if (count <= 64)		if (count <= 64)
return x86::avx2::Bcmp<32>::head_tail(p1, p2, count);		return generic::Bcmp<__m256i>::head_tail(p1, p2, count);
if (count <= 128)		return generic::Bcmp<__m256i>::loop_and_tail_align_above(256, p1, p2, count);
return x86::avx2::Bcmp<64>::head_tail(p1, p2, count);
if (LIBC_UNLIKELY(count >= 256)) {
if (auto value = x86::avx2::Bcmp<64>::block(p1, p2))
return value;
align_to_next_boundary<64, Arg::P1>(p1, p2, count);
}
return x86::avx2::Bcmp<64>::loop_and_tail(p1, p2, count);
}		}
		#endif // __AVX__

		#if defined(__AVX512BW__)
[[maybe_unused]] LIBC_INLINE BcmpReturnType		[[maybe_unused]] LIBC_INLINE BcmpReturnType
inline_bcmp_x86_avx512bw_gt16(CPtr p1, CPtr p2, size_t count) {		inline_bcmp_x86_avx512bw_gt16(CPtr p1, CPtr p2, size_t count) {
if (count <= 32)		if (count <= 32)
return x86::sse2::Bcmp<16>::head_tail(p1, p2, count);		return generic::Bcmp<__m128i>::head_tail(p1, p2, count);
if (count <= 64)		if (count <= 64)
return x86::avx2::Bcmp<32>::head_tail(p1, p2, count);		return generic::Bcmp<__m256i>::head_tail(p1, p2, count);
if (count <= 128)		if (count <= 128)
return x86::avx512bw::Bcmp<64>::head_tail(p1, p2, count);		return generic::Bcmp<__m512i>::head_tail(p1, p2, count);
if (LIBC_UNLIKELY(count >= 256)) {		return generic::Bcmp<__m512i>::loop_and_tail_align_above(256, p1, p2, count);
if (auto value = x86::avx512bw::Bcmp<64>::block(p1, p2))
return value;
align_to_next_boundary<64, Arg::P1>(p1, p2, count);
}
return x86::avx512bw::Bcmp<64>::loop_and_tail(p1, p2, count);
}		}
		#endif // __AVX512BW__

[[maybe_unused]] LIBC_INLINE BcmpReturnType inline_bcmp_x86(CPtr p1, CPtr p2,		[[maybe_unused]] LIBC_INLINE BcmpReturnType inline_bcmp_x86(CPtr p1, CPtr p2,
size_t count) {		size_t count) {
if (count == 0)		if (count == 0)
return BcmpReturnType::ZERO();		return BcmpReturnType::ZERO();
if (count == 1)		if (count == 1)
return generic::Bcmp<1>::block(p1, p2);		return generic::Bcmp<uint8_t>::block(p1, p2);
if (count == 2)		if (count == 2)
return generic::Bcmp<2>::block(p1, p2);		return generic::Bcmp<uint16_t>::block(p1, p2);
if (count <= 4)		if (count == 3)
return generic::Bcmp<2>::head_tail(p1, p2, count);		return generic::BcmpSequence<uint16_t, uint8_t>::block(p1, p2);
if (count <= 8)		if (count == 4)
return generic::Bcmp<4>::head_tail(p1, p2, count);		return generic::Bcmp<uint32_t>::block(p1, p2);
		if (count == 5)
		return generic::BcmpSequence<uint32_t, uint8_t>::block(p1, p2);
		if (count == 6)
		return generic::BcmpSequence<uint32_t, uint16_t>::block(p1, p2);
		if (count == 7)
		return generic::BcmpSequence<uint32_t, uint16_t, uint8_t>::block(p1, p2);
		nafi3000Unsubmitted Done Reply Inline Actions OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g. generic::Bcmp<uint32_t>::head_tail(p1, p2, 7) nafi3000: OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g. generic…
		gchateletAuthorUnsubmitted Done Reply Inline Actions Yeah there are a bunch of options here. Usually I want to use `head_tail` to cover a range of sizes as it clearly diminishes the overall code size (all sizes from 5 to 8 with only two loads per pointer). Depending on how often those sizes appear in the size distribution it might be useful to special case the code. I'm tempted to keep the previous logic to prevent regressions and make this a separate change. WDYT? gchatelet: Yeah there are a bunch of options here. Usually I want to use `head_tail` to cover a range of…
		nafi3000Unsubmitted Done Reply Inline Actions Separate change SGTM. nafi3000: Separate change SGTM.
		if (count == 8)
		return generic::Bcmp<uint64_t>::block(p1, p2);
if (count <= 16)		if (count <= 16)
return generic::Bcmp<8>::head_tail(p1, p2, count);		return generic::Bcmp<uint64_t>::head_tail(p1, p2, count);
if constexpr (x86::kAvx512BW)		#if defined(__AVX512BW__)
return inline_bcmp_x86_avx512bw_gt16(p1, p2, count);		return inline_bcmp_x86_avx512bw_gt16(p1, p2, count);
else if constexpr (x86::kAvx2)		#elif defined(__AVX__)
return inline_bcmp_x86_avx2_gt16(p1, p2, count);		return inline_bcmp_x86_avx_gt16(p1, p2, count);
else if constexpr (x86::kSse2)		#elif defined(__SSE4_1__)
return inline_bcmp_x86_sse2_gt16(p1, p2, count);		return inline_bcmp_x86_sse41_gt16(p1, p2, count);
else		#else
return inline_bcmp_generic_gt16(p1, p2, count);		return inline_bcmp_generic_gt16(p1, p2, count);
		#endif
}		}
#endif // defined(LIBC_TARGET_ARCH_IS_X86)		#endif // defined(LIBC_TARGET_ARCH_IS_X86)

#if defined(LIBC_TARGET_ARCH_IS_AARCH64)		#if defined(LIBC_TARGET_ARCH_IS_AARCH64)
[[maybe_unused]] LIBC_INLINE BcmpReturnType inline_bcmp_aarch64(CPtr p1,		[[maybe_unused]] LIBC_INLINE BcmpReturnType inline_bcmp_aarch64(CPtr p1,
CPtr p2,		CPtr p2,
size_t count) {		size_t count) {
if (LIBC_LIKELY(count <= 32)) {		if (LIBC_LIKELY(count <= 32)) {
if (LIBC_UNLIKELY(count >= 16)) {		if (LIBC_UNLIKELY(count >= 16)) {
return aarch64::Bcmp<16>::head_tail(p1, p2, count);		return aarch64::Bcmp<16>::head_tail(p1, p2, count);
}		}
switch (count) {		switch (count) {
case 0:		case 0:
return BcmpReturnType::ZERO();		return BcmpReturnType::ZERO();
case 1:		case 1:
return generic::Bcmp<1>::block(p1, p2);		return generic::Bcmp<uint8_t>::block(p1, p2);
case 2:		case 2:
return generic::Bcmp<2>::block(p1, p2);		return generic::Bcmp<uint16_t>::block(p1, p2);
case 3:		case 3:
return generic::Bcmp<2>::head_tail(p1, p2, count);		return generic::Bcmp<uint16_t>::head_tail(p1, p2, count);
case 4:		case 4:
return generic::Bcmp<4>::block(p1, p2);		return generic::Bcmp<uint32_t>::block(p1, p2);
case 5:		case 5:
case 6:		case 6:
case 7:		case 7:
return generic::Bcmp<4>::head_tail(p1, p2, count);		return generic::Bcmp<uint32_t>::head_tail(p1, p2, count);
case 8:		case 8:
return generic::Bcmp<8>::block(p1, p2);		return generic::Bcmp<uint64_t>::block(p1, p2);
case 9:		case 9:
case 10:		case 10:
case 11:		case 11:
case 12:		case 12:
case 13:		case 13:
case 14:		case 14:
case 15:		case 15:
return generic::Bcmp<8>::head_tail(p1, p2, count);		return generic::Bcmp<uint64_t>::head_tail(p1, p2, count);
}		}
}		}

if (count <= 64)		if (count <= 64)
return aarch64::Bcmp<32>::head_tail(p1, p2, count);		return aarch64::Bcmp<32>::head_tail(p1, p2, count);

// Aligned loop if > 256, otherwise normal loop		// Aligned loop if > 256, otherwise normal loop
if (LIBC_UNLIKELY(count > 256)) {		if (LIBC_UNLIKELY(count > 256)) {
Show All 10 Lines	#if defined(LIBC_TARGET_ARCH_IS_X86)
return inline_bcmp_x86(p1, p2, count);		return inline_bcmp_x86(p1, p2, count);
#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)		#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)
return inline_bcmp_aarch64(p1, p2, count);		return inline_bcmp_aarch64(p1, p2, count);
#elif defined(LIBC_TARGET_ARCH_IS_RISCV64)		#elif defined(LIBC_TARGET_ARCH_IS_RISCV64)
return inline_bcmp_aligned_access_64bit(p1, p2, count);		return inline_bcmp_aligned_access_64bit(p1, p2, count);
#elif defined(LIBC_TARGET_ARCH_IS_RISCV32)		#elif defined(LIBC_TARGET_ARCH_IS_RISCV32)
return inline_bcmp_aligned_access_32bit(p1, p2, count);		return inline_bcmp_aligned_access_32bit(p1, p2, count);
#else		#else
return inline_bcmp_byte_per_byte(p1, p2, 0, count);		return inline_bcmp_byte_per_byte(p1, p2, count);
#endif		#endif
}		}

LIBC_INLINE int inline_bcmp(const void p1, const void p2, size_t count) {		LIBC_INLINE int inline_bcmp(const void p1, const void p2, size_t count) {
return static_cast<int>(inline_bcmp(reinterpret_cast<CPtr>(p1),		return static_cast<int>(inline_bcmp(reinterpret_cast<CPtr>(p1),
reinterpret_cast<CPtr>(p2), count));		reinterpret_cast<CPtr>(p2), count));
}		}

} // namespace __llvm_libc		} // namespace __llvm_libc

#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_BCMP_IMPLEMENTATIONS_H		#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_BCMP_IMPLEMENTATIONS_H

libc/src/string/memory_utils/memcmp_implementations.h

	//===-- Implementation of memcmp ------------------------------------------===//			//===-- Implementation of memcmp ------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIBC_SRC_STRING_MEMORY_UTILS_MEMCMP_IMPLEMENTATIONS_H			#ifndef LLVM_LIBC_SRC_STRING_MEMORY_UTILS_MEMCMP_IMPLEMENTATIONS_H
	#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_MEMCMP_IMPLEMENTATIONS_H			#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_MEMCMP_IMPLEMENTATIONS_H

	#include "src/__support/common.h"			#include "src/__support/common.h"
	#include "src/__support/macros/optimization.h" // LIBC_UNLIKELY LIBC_LOOP_NOUNROLL			#include "src/__support/macros/optimization.h" // LIBC_UNLIKELY LIBC_LOOP_NOUNROLL
	#include "src/__support/macros/properties/architectures.h"			#include "src/__support/macros/properties/architectures.h"
	#include "src/string/memory_utils/op_generic.h"			#include "src/string/memory_utils/op_generic.h"
				#include "src/string/memory_utils/op_riscv.h"
	#include "src/string/memory_utils/utils.h" // CPtr MemcmpReturnType			#include "src/string/memory_utils/utils.h" // CPtr MemcmpReturnType

	#include <stddef.h> // size_t			#include <stddef.h> // size_t

	#if defined(LIBC_TARGET_ARCH_IS_X86)			#if defined(LIBC_TARGET_ARCH_IS_X86)
	#include "src/string/memory_utils/x86_64/memcmp_implementations.h"			#include "src/string/memory_utils/x86_64/memcmp_implementations.h"
	#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)			#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)
	#include "src/string/memory_utils/aarch64/memcmp_implementations.h"			#include "src/string/memory_utils/aarch64/memcmp_implementations.h"
	#endif			#endif

	namespace __llvm_libc {			namespace __llvm_libc {

	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_byte_per_byte(CPtr p1, CPtr p2, size_t offset, size_t count) {			inline_memcmp_byte_per_byte(CPtr p1, CPtr p2, size_t count, size_t offset = 0) {
	LIBC_LOOP_NOUNROLL			return generic::Memcmp<uint8_t>::loop_and_tail_offset(p1, p2, count, offset);
	for (; offset < count; ++offset)
	if (auto value = generic::Memcmp<1>::block(p1 + offset, p2 + offset))
	return value;
	return MemcmpReturnType::ZERO();
	}			}

	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_aligned_access_64bit(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_aligned_access_64bit(CPtr p1, CPtr p2, size_t count) {
	constexpr size_t kAlign = sizeof(uint64_t);			constexpr size_t kAlign = sizeof(uint64_t);
	if (count <= 2 * kAlign)			if (count <= 2 * kAlign)
	return inline_memcmp_byte_per_byte(p1, p2, 0, count);			return inline_memcmp_byte_per_byte(p1, p2, count);
	size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);			size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);
	if (auto value = inline_memcmp_byte_per_byte(p1, p2, 0, bytes_to_p1_align))			if (auto value = inline_memcmp_byte_per_byte(p1, p2, bytes_to_p1_align))
	return value;			return value;
	size_t offset = bytes_to_p1_align;			size_t offset = bytes_to_p1_align;
	size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);			size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);
	for (; offset < count - kAlign; offset += kAlign) {			for (; offset < count - kAlign; offset += kAlign) {
	uint64_t b;			uint64_t b;
	if (p2_alignment == 0)			if (p2_alignment == 0)
	b = load64_aligned<uint64_t>(p2, offset);			b = load64_aligned<uint64_t>(p2, offset);
	else if (p2_alignment == 4)			else if (p2_alignment == 4)
	b = load64_aligned<uint32_t, uint32_t>(p2, offset);			b = load64_aligned<uint32_t, uint32_t>(p2, offset);
	else if (p2_alignment == 2)			else if (p2_alignment == 2)
	b = load64_aligned<uint16_t, uint16_t, uint16_t, uint16_t>(p2, offset);			b = load64_aligned<uint16_t, uint16_t, uint16_t, uint16_t>(p2, offset);
	else			else
	b = load64_aligned<uint8_t, uint16_t, uint16_t, uint16_t, uint8_t>(			b = load64_aligned<uint8_t, uint16_t, uint16_t, uint16_t, uint8_t>(
	p2, offset);			p2, offset);
	uint64_t a = load64_aligned<uint64_t>(p1, offset);			uint64_t a = load64_aligned<uint64_t>(p1, offset);
	if (a != b) {			if (a != b)
	// TODO use cmp_neq_uint64_t from D148717 once it's submitted.			return cmp_neq_uint64_t(Endian::to_big_endian(a),
	return Endian::to_big_endian(a) < Endian::to_big_endian(b) ? -1 : 1;			Endian::to_big_endian(b));
	}
	}			}
	return inline_memcmp_byte_per_byte(p1, p2, offset, count);			return inline_memcmp_byte_per_byte(p1, p2, count, offset);
	}			}

	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_aligned_access_32bit(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_aligned_access_32bit(CPtr p1, CPtr p2, size_t count) {
	constexpr size_t kAlign = sizeof(uint32_t);			constexpr size_t kAlign = sizeof(uint32_t);
	if (count <= 2 * kAlign)			if (count <= 2 * kAlign)
	return inline_memcmp_byte_per_byte(p1, p2, 0, count);			return inline_memcmp_byte_per_byte(p1, p2, count);
	size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);			size_t bytes_to_p1_align = distance_to_align_up<kAlign>(p1);
	if (auto value = inline_memcmp_byte_per_byte(p1, p2, 0, bytes_to_p1_align))			if (auto value = inline_memcmp_byte_per_byte(p1, p2, bytes_to_p1_align))
	return value;			return value;
	size_t offset = bytes_to_p1_align;			size_t offset = bytes_to_p1_align;
	size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);			size_t p2_alignment = distance_to_align_down<kAlign>(p2 + offset);
	for (; offset < count - kAlign; offset += kAlign) {			for (; offset < count - kAlign; offset += kAlign) {
	uint32_t b;			uint32_t b;
	if (p2_alignment == 0)			if (p2_alignment == 0)
	b = load32_aligned<uint32_t>(p2, offset);			b = load32_aligned<uint32_t>(p2, offset);
	else if (p2_alignment == 2)			else if (p2_alignment == 2)
	b = load32_aligned<uint16_t, uint16_t>(p2, offset);			b = load32_aligned<uint16_t, uint16_t>(p2, offset);
	else			else
	b = load32_aligned<uint8_t, uint16_t, uint8_t>(p2, offset);			b = load32_aligned<uint8_t, uint16_t, uint8_t>(p2, offset);
	uint32_t a = load32_aligned<uint32_t>(p1, offset);			uint32_t a = load32_aligned<uint32_t>(p1, offset);
	if (a != b) {			if (a != b)
	// TODO use cmp_uint32_t from D148717 once it's submitted.			return cmp_uint32_t(Endian::to_big_endian(a), Endian::to_big_endian(b));
	// We perform the difference as an uint64_t.
	const int64_t diff = static_cast<int64_t>(Endian::to_big_endian(a)) -
	static_cast<int64_t>(Endian::to_big_endian(b));
	// And reduce the uint64_t into an uint32_t.
	return static_cast<int32_t>((diff >> 1) \| (diff & 0xFFFF));
	}
	}			}
	return inline_memcmp_byte_per_byte(p1, p2, offset, count);			return inline_memcmp_byte_per_byte(p1, p2, count, offset);
	}			}

	LIBC_INLINE MemcmpReturnType inline_memcmp(CPtr p1, CPtr p2, size_t count) {			LIBC_INLINE MemcmpReturnType inline_memcmp(CPtr p1, CPtr p2, size_t count) {
	#if defined(LIBC_TARGET_ARCH_IS_X86)			#if defined(LIBC_TARGET_ARCH_IS_X86)
	return inline_memcmp_x86(p1, p2, count);			return inline_memcmp_x86(p1, p2, count);
	#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)			#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)
	return inline_memcmp_aarch64(p1, p2, count);			return inline_memcmp_aarch64(p1, p2, count);
	#elif defined(LIBC_TARGET_ARCH_IS_RISCV64)			#elif defined(LIBC_TARGET_ARCH_IS_RISCV64)
	return inline_memcmp_aligned_access_64bit(p1, p2, count);			return inline_memcmp_aligned_access_64bit(p1, p2, count);
	#elif defined(LIBC_TARGET_ARCH_IS_RISCV32)			#elif defined(LIBC_TARGET_ARCH_IS_RISCV32)
	return inline_memcmp_aligned_access_32bit(p1, p2, count);			return inline_memcmp_aligned_access_32bit(p1, p2, count);
	#else			#else
	return inline_memcmp_byte_per_byte(p1, p2, 0, count);			return inline_memcmp_byte_per_byte(p1, p2, count);
	#endif			#endif
	}			}

	LIBC_INLINE int inline_memcmp(const void p1, const void p2, size_t count) {			LIBC_INLINE int inline_memcmp(const void p1, const void p2, size_t count) {
	return static_cast<int>(inline_memcmp(reinterpret_cast<CPtr>(p1),			return static_cast<int>(inline_memcmp(reinterpret_cast<CPtr>(p1),
	reinterpret_cast<CPtr>(p2), count));			reinterpret_cast<CPtr>(p2), count));
	}			}

	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_MEMCMP_IMPLEMENTATIONS_H			#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_MEMCMP_IMPLEMENTATIONS_H

libc/src/string/memory_utils/memmove_implementations.h

Show All 32 Lines	for (ptrdiff_t offset = count - 1; offset >= 0; --offset)
builtin::Memcpy<1>::block(dst + offset, src + offset);		builtin::Memcpy<1>::block(dst + offset, src + offset);
}		}
}		}

LIBC_INLINE void inline_memmove(Ptr dst, CPtr src, size_t count) {		LIBC_INLINE void inline_memmove(Ptr dst, CPtr src, size_t count) {
#if defined(LIBC_TARGET_ARCH_IS_X86) \|\| defined(LIBC_TARGET_ARCH_IS_AARCH64)		#if defined(LIBC_TARGET_ARCH_IS_X86) \|\| defined(LIBC_TARGET_ARCH_IS_AARCH64)
#if defined(LIBC_TARGET_ARCH_IS_X86)		#if defined(LIBC_TARGET_ARCH_IS_X86)
#if defined(__AVX512F__)		#if defined(__AVX512F__)
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = uint8x32_t;		using uint256_t = generic_v256;
using uint512_t = uint8x64_t;		using uint512_t = generic_v512;
#elif defined(__AVX__)		#elif defined(__AVX__)
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = uint8x32_t;		using uint256_t = generic_v256;
using uint512_t = cpp::array<uint8x32_t, 2>;		using uint512_t = cpp::array<generic_v256, 2>;
#elif defined(__SSE2__)		#elif defined(__SSE2__)
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = cpp::array<uint8x16_t, 2>;		using uint256_t = cpp::array<generic_v128, 2>;
using uint512_t = cpp::array<uint8x16_t, 4>;		using uint512_t = cpp::array<generic_v128, 4>;
#else		#else
using uint128_t = cpp::array<uint64_t, 2>;		using uint128_t = cpp::array<uint64_t, 2>;
using uint256_t = cpp::array<uint64_t, 4>;		using uint256_t = cpp::array<uint64_t, 4>;
using uint512_t = cpp::array<uint64_t, 8>;		using uint512_t = cpp::array<uint64_t, 8>;
#endif		#endif
#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)		#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)
static_assert(aarch64::kNeon, "aarch64 supports vector types");		static_assert(aarch64::kNeon, "aarch64 supports vector types");
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = uint8x32_t;		using uint256_t = generic_v256;
using uint512_t = uint8x64_t;		using uint512_t = generic_v512;
#endif		#endif
if (count == 0)		if (count == 0)
return;		return;
if (count == 1)		if (count == 1)
return generic::Memmove<uint8_t>::block(dst, src);		return generic::Memmove<uint8_t>::block(dst, src);
if (count <= 4)		if (count <= 4)
return generic::Memmove<uint16_t>::head_tail(dst, src, count);		return generic::Memmove<uint16_t>::head_tail(dst, src, count);
if (count <= 8)		if (count <= 8)
Show All 29 Lines

libc/src/string/memory_utils/memset_implementations.h

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	for (; offset < count - kAlign; offset += kAlign)
store64_aligned<uint64_t>(generic::splat<uint64_t>(value), dst, offset);		store64_aligned<uint64_t>(generic::splat<uint64_t>(value), dst, offset);
inline_memset_byte_per_byte(dst, offset, value, count);		inline_memset_byte_per_byte(dst, offset, value, count);
}		}

#if defined(LIBC_TARGET_ARCH_IS_X86)		#if defined(LIBC_TARGET_ARCH_IS_X86)
[[maybe_unused]] LIBC_INLINE static void		[[maybe_unused]] LIBC_INLINE static void
inline_memset_x86(Ptr dst, uint8_t value, size_t count) {		inline_memset_x86(Ptr dst, uint8_t value, size_t count) {
#if defined(__AVX512F__)		#if defined(__AVX512F__)
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = uint8x32_t;		using uint256_t = generic_v256;
using uint512_t = uint8x64_t;		using uint512_t = generic_v512;
#elif defined(__AVX__)		#elif defined(__AVX__)
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = uint8x32_t;		using uint256_t = generic_v256;
using uint512_t = cpp::array<uint8x32_t, 2>;		using uint512_t = cpp::array<generic_v256, 2>;
#elif defined(__SSE2__)		#elif defined(__SSE2__)
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = cpp::array<uint8x16_t, 2>;		using uint256_t = cpp::array<generic_v128, 2>;
using uint512_t = cpp::array<uint8x16_t, 4>;		using uint512_t = cpp::array<generic_v128, 4>;
#else		#else
using uint128_t = cpp::array<uint64_t, 2>;		using uint128_t = cpp::array<uint64_t, 2>;
using uint256_t = cpp::array<uint64_t, 4>;		using uint256_t = cpp::array<uint64_t, 4>;
using uint512_t = cpp::array<uint64_t, 8>;		using uint512_t = cpp::array<uint64_t, 8>;
#endif		#endif

if (count == 0)		if (count == 0)
return;		return;
Show All 19 Lines	#endif
return generic::Memset<uint256_t>::loop_and_tail(dst, value, count);		return generic::Memset<uint256_t>::loop_and_tail(dst, value, count);
}		}
#endif // defined(LIBC_TARGET_ARCH_IS_X86)		#endif // defined(LIBC_TARGET_ARCH_IS_X86)

#if defined(LIBC_TARGET_ARCH_IS_AARCH64)		#if defined(LIBC_TARGET_ARCH_IS_AARCH64)
[[maybe_unused]] LIBC_INLINE static void		[[maybe_unused]] LIBC_INLINE static void
inline_memset_aarch64(Ptr dst, uint8_t value, size_t count) {		inline_memset_aarch64(Ptr dst, uint8_t value, size_t count) {
static_assert(aarch64::kNeon, "aarch64 supports vector types");		static_assert(aarch64::kNeon, "aarch64 supports vector types");
using uint128_t = uint8x16_t;		using uint128_t = generic_v128;
using uint256_t = uint8x32_t;		using uint256_t = generic_v256;
using uint512_t = uint8x64_t;		using uint512_t = generic_v512;
if (count == 0)		if (count == 0)
return;		return;
if (count <= 3) {		if (count <= 3) {
generic::Memset<uint8_t>::block(dst, value);		generic::Memset<uint8_t>::block(dst, value);
if (count > 1)		if (count > 1)
generic::Memset<uint16_t>::tail(dst, value, count);		generic::Memset<uint16_t>::tail(dst, value, count);
return;		return;
}		}
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

libc/src/string/memory_utils/op_aarch64.h

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	#endif

LIBC_INLINE static void loop_and_tail(Ptr dst, uint8_t value, size_t count) {		LIBC_INLINE static void loop_and_tail(Ptr dst, uint8_t value, size_t count) {
size_t offset = 0;		size_t offset = 0;
do {		do {
block(dst + offset, value);		block(dst + offset, value);
offset += SIZE;		offset += SIZE;
} while (offset < count - SIZE);		} while (offset < count - SIZE);
// Unaligned store, we can't use 'dc zva' here.		// Unaligned store, we can't use 'dc zva' here.
generic::Memset<uint8x64_t>::tail(dst, value, count);		generic::Memset<generic_v512>::tail(dst, value, count);
}		}
};		};

LIBC_INLINE static bool hasZva() {		LIBC_INLINE static bool hasZva() {
uint64_t zva_val;		uint64_t zva_val;
asm("mrs %[zva_val], dczid_el0" : [zva_val] "=r"(zva_val));		asm("mrs %[zva_val], dczid_el0" : [zva_val] "=r"(zva_val));
// DC ZVA is permitted if DZP, bit [4] is zero.		// DC ZVA is permitted if DZP, bit [4] is zero.
// BS, bits [3:0] is log2 of the block count in words.		// BS, bits [3:0] is log2 of the block count in words.
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	do {
offset += SIZE;		offset += SIZE;
} while (offset < count - SIZE);		} while (offset < count - SIZE);
return tail(p1, p2, count);		return tail(p1, p2, count);
}		}
};		};

} // namespace __llvm_libc::aarch64		} // namespace __llvm_libc::aarch64

		namespace __llvm_libc::generic {

		///////////////////////////////////////////////////////////////////////////////
		// Specializations for uint16_t
		template <> struct cmp_is_expensive<uint16_t> : public cpp::false_type {};
		template <> LIBC_INLINE bool eq<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
		return load<uint16_t>(p1, offset) == load<uint16_t>(p2, offset);
		}
		template <>
		LIBC_INLINE uint32_t neq<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
		return load<uint16_t>(p1, offset) ^ load<uint16_t>(p2, offset);
		}
		template <>
		LIBC_INLINE MemcmpReturnType cmp<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
		return static_cast<int32_t>(load_be<uint16_t>(p1, offset)) -
		static_cast<int32_t>(load_be<uint16_t>(p2, offset));
		}

		///////////////////////////////////////////////////////////////////////////////
		// Specializations for uint32_t
		template <> struct cmp_is_expensive<uint32_t> : cpp::false_type {};
		template <>
		LIBC_INLINE uint32_t neq<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
		return load<uint32_t>(p1, offset) ^ load<uint32_t>(p2, offset);
		}
		template <>
		LIBC_INLINE MemcmpReturnType cmp<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
		const auto a = load_be<uint32_t>(p1, offset);
		const auto b = load_be<uint32_t>(p2, offset);
		return a > b ? 1 : a < b ? -1 : 0;
		}

		///////////////////////////////////////////////////////////////////////////////
		// Specializations for uint64_t
		template <> struct cmp_is_expensive<uint64_t> : cpp::false_type {};
		template <>
		LIBC_INLINE uint32_t neq<uint64_t>(CPtr p1, CPtr p2, size_t offset) {
		return load<uint64_t>(p1, offset) != load<uint64_t>(p2, offset);
		}
		template <>
		LIBC_INLINE MemcmpReturnType cmp<uint64_t>(CPtr p1, CPtr p2, size_t offset) {
		const auto a = load_be<uint64_t>(p1, offset);
		const auto b = load_be<uint64_t>(p2, offset);
		if (a != b)
		return a > b ? 1 : -1;
		return MemcmpReturnType::ZERO();
		}

		///////////////////////////////////////////////////////////////////////////////
		// Specializations for uint8x16_t
		template <> struct is_vector<uint8x16_t> : cpp::true_type {};
		template <> struct cmp_is_expensive<uint8x16_t> : cpp::false_type {};
		template <>
		LIBC_INLINE uint32_t neq<uint8x16_t>(CPtr p1, CPtr p2, size_t offset) {
		for (size_t i = 0; i < 2; ++i) {
		auto a = load<uint64_t>(p1, offset);
		auto b = load<uint64_t>(p2, offset);
		uint32_t cond = a != b;
		if (cond)
		return cond;
		offset += sizeof(uint64_t);
		}
		return 0;
		}
		template <>
		LIBC_INLINE MemcmpReturnType cmp<uint8x16_t>(CPtr p1, CPtr p2, size_t offset) {
		for (size_t i = 0; i < 2; ++i) {
		auto a = load_be<uint64_t>(p1, offset);
		auto b = load_be<uint64_t>(p2, offset);
		if (a != b)
		return cmp_neq_uint64_t(a, b);
		offset += sizeof(uint64_t);
		}
		return MemcmpReturnType::ZERO();
		}

		///////////////////////////////////////////////////////////////////////////////
		// Specializations for uint8x16x2_t
		template <> struct is_vector<uint8x16x2_t> : cpp::true_type {};
		template <> struct cmp_is_expensive<uint8x16x2_t> : cpp::false_type {};
		template <>
		LIBC_INLINE MemcmpReturnType cmp<uint8x16x2_t>(CPtr p1, CPtr p2,
		size_t offset) {
		for (size_t i = 0; i < 4; ++i) {
		auto a = load_be<uint64_t>(p1, offset);
		auto b = load_be<uint64_t>(p2, offset);
		if (a != b)
		return cmp_neq_uint64_t(a, b);
		offset += sizeof(uint64_t);
		}
		return MemcmpReturnType::ZERO();
		}
		} // namespace __llvm_libc::generic

#endif // LIBC_TARGET_ARCH_IS_AARCH64		#endif // LIBC_TARGET_ARCH_IS_AARCH64

#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_AARCH64_H		#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_AARCH64_H

libc/src/string/memory_utils/op_generic.h

Show All 27 Lines
#include "src/__support/common.h"		#include "src/__support/common.h"
#include "src/__support/endian.h"		#include "src/__support/endian.h"
#include "src/__support/macros/optimization.h"		#include "src/__support/macros/optimization.h"
#include "src/string/memory_utils/op_builtin.h"		#include "src/string/memory_utils/op_builtin.h"
#include "src/string/memory_utils/utils.h"		#include "src/string/memory_utils/utils.h"

#include <stdint.h>		#include <stdint.h>

		static_assert((UINTPTR_MAX == 4294967295U) \|\|
		(UINTPTR_MAX == 18446744073709551615UL),
		"We currently only support 32- or 64-bit platforms");

		#if defined(UINT64_MAX)
		#define LLVM_LIBC_HAS_UINT64
		#endif

namespace __llvm_libc {		namespace __llvm_libc {
// Compiler types using the vector attributes.		// Compiler types using the vector attributes.
using uint8x1_t = uint8_t __attribute__((__vector_size__(1)));		using generic_v128 = uint8_t __attribute__((__vector_size__(16)));
using uint8x2_t = uint8_t __attribute__((__vector_size__(2)));		using generic_v256 = uint8_t __attribute__((__vector_size__(32)));
using uint8x4_t = uint8_t __attribute__((__vector_size__(4)));		using generic_v512 = uint8_t __attribute__((__vector_size__(64)));
using uint8x8_t = uint8_t __attribute__((__vector_size__(8)));
using uint8x16_t = uint8_t __attribute__((__vector_size__(16)));
using uint8x32_t = uint8_t __attribute__((__vector_size__(32)));
using uint8x64_t = uint8_t __attribute__((__vector_size__(64)));
} // namespace __llvm_libc		} // namespace __llvm_libc

namespace __llvm_libc::generic {		namespace __llvm_libc::generic {

// We accept three types of values as elements for generic operations:		// We accept three types of values as elements for generic operations:
// - scalar : unsigned integral types		// - scalar : unsigned integral types,
// - vector : compiler types using the vector attributes		// - vector : compiler types using the vector attributes or platform builtins,
// - array : a cpp::array<T, N> where T is itself either a scalar or a vector.		// - array : a cpp::array<T, N> where T is itself either a scalar or a vector.
// The following traits help discriminate between these cases.		// The following traits help discriminate between these cases.
template <typename T>
constexpr bool is_scalar_v = cpp::is_integral_v<T> && cpp::is_unsigned_v<T>;

template <typename T>		template <typename T> struct is_scalar : cpp::false_type {};
constexpr bool is_vector_v =		template <> struct is_scalar<uint8_t> : cpp::true_type {};
cpp::details::is_unqualified_any_of<T, uint8x1_t, uint8x2_t, uint8x4_t,		template <> struct is_scalar<uint16_t> : cpp::true_type {};
uint8x8_t, uint8x16_t, uint8x32_t,		template <> struct is_scalar<uint32_t> : cpp::true_type {};
uint8x64_t>();		#ifdef LLVM_LIBC_HAS_UINT64
		template <> struct is_scalar<uint64_t> : cpp::true_type {};
		#endif // LLVM_LIBC_HAS_UINT64
		template <typename T> constexpr bool is_scalar_v = is_scalar<T>::value;

		template <typename T> struct is_vector : cpp::false_type {};
		template <> struct is_vector<generic_v128> : cpp::true_type {};
		template <> struct is_vector<generic_v256> : cpp::true_type {};
		template <> struct is_vector<generic_v512> : cpp::true_type {};
		template <typename T> constexpr bool is_vector_v = is_vector<T>::value;

template <class T> struct is_array : cpp::false_type {};		template <class T> struct is_array : cpp::false_type {};
template <class T, size_t N> struct is_array<cpp::array<T, N>> {		template <class T, size_t N> struct is_array<cpp::array<T, N>> {
static constexpr bool value = is_scalar_v<T> \|\| is_vector_v<T>;		static constexpr bool value = is_scalar_v<T> \|\| is_vector_v<T>;
};		};
template <typename T> constexpr bool is_array_v = is_array<T>::value;		template <typename T> constexpr bool is_array_v = is_array<T>::value;

template <typename T>		template <typename T>
constexpr bool is_element_type_v =		constexpr bool is_element_type_v =
is_scalar_v<T> \|\| is_vector_v<T> \|\| is_array_v<T>;		is_scalar_v<T> \|\| is_vector_v<T> \|\| is_array_v<T>;

//		// Helper struct to retrieve the number of elements of an array.
template <class T> struct array_size {};		template <class T> struct array_size {};
template <class T, size_t N>		template <class T, size_t N>
struct array_size<cpp::array<T, N>> : cpp::integral_constant<size_t, N> {};		struct array_size<cpp::array<T, N>> : cpp::integral_constant<size_t, N> {};
template <typename T> constexpr size_t array_size_v = array_size<T>::value;		template <typename T> constexpr size_t array_size_v = array_size<T>::value;

// Generic operations for the above type categories.		// Generic operations for the above type categories.

template <typename T> T load(CPtr src) {		template <typename T> T load(CPtr src) {
Show All 28 Lines	else if constexpr (is_vector_v<T>) {
T Out;		T Out;
// This for loop is optimized out for vector types.		// This for loop is optimized out for vector types.
for (size_t i = 0; i < sizeof(T); ++i)		for (size_t i = 0; i < sizeof(T); ++i)
Out[i] = value;		Out[i] = value;
return Out;		return Out;
}		}
}		}

static_assert((UINTPTR_MAX == 4294967295U) \|\|
(UINTPTR_MAX == 18446744073709551615UL),
"We currently only support 32- or 64-bit platforms");

#if defined(LIBC_TARGET_ARCH_IS_X86_64) \|\| defined(LIBC_TARGET_ARCH_IS_AARCH64)
#define LLVM_LIBC_HAS_UINT64
#endif

namespace details {
// Checks that each type is sorted in strictly decreasing order of size.
// i.e. sizeof(First) > sizeof(Second) > ... > sizeof(Last)
template <typename First> constexpr bool is_decreasing_size() {
return sizeof(First) == 1;
}
template <typename First, typename Second, typename... Next>
constexpr bool is_decreasing_size() {
if constexpr (sizeof...(Next) > 0)
return sizeof(First) > sizeof(Second) && is_decreasing_size<Next...>();
else
return sizeof(First) > sizeof(Second) && is_decreasing_size<Second>();
}

template <size_t Size, typename... Ts> struct Largest;
template <size_t Size> struct Largest<Size> : cpp::type_identity<uint8_t> {};
template <size_t Size, typename T, typename... Ts>
struct Largest<Size, T, Ts...> {
using next = Largest<Size, Ts...>;
using type = cpp::conditional_t<(Size >= sizeof(T)), T, typename next::type>;
};

} // namespace details

// 'SupportedTypes' holds a list of natively supported types.
// The types are instanciations of ScalarType or VectorType.
// They should be ordered in strictly decreasing order.
// The 'TypeFor<Size>' type retrieves is the largest supported type that can
// handle 'Size' bytes. e.g.
//
// using ST = SupportedTypes<ScalarType<uint16_t>, ScalarType<uint8_t>>;
// using Type = ST::TypeFor<10>;
// static_assert(cpp:is_same_v<Type, ScalarType<uint16_t>>);

template <typename First, typename... Ts> struct SupportedTypes {
static_assert(details::is_decreasing_size<First, Ts...>());

using MaxType = First;

template <size_t Size>
using TypeFor = typename details::Largest<Size, First, Ts...>::type;
};

// Map from sizes to structures offering static load, store and splat methods.
// Note: On platforms lacking vector support, we use the ArrayType below and
// decompose the operation in smaller pieces.

// Lists a generic native types to use for Memset and Memmove operations.
// TODO: Inject the native types within Memset and Memmove depending on the
// target architectures and derive MaxSize from it.
using NativeTypeMap = SupportedTypes<uint8x64_t, //
uint8x32_t, //
uint8x16_t,
#if defined(LLVM_LIBC_HAS_UINT64)
uint64_t, // Not available on 32bit
#endif
uint32_t, //
uint16_t, //
uint8_t>;

namespace details {

// Helper to test if a type is void.
template <typename T> inline constexpr bool is_void_v = cpp::is_same_v<T, void>;

// In case the 'Size' is not supported we can fall back to a sequence of smaller
// operations using the largest natively supported type.
template <size_t Size, size_t MaxSize> static constexpr bool useArrayType() {
return (Size > MaxSize) && ((Size % MaxSize) == 0) &&
!details::is_void_v<NativeTypeMap::TypeFor<MaxSize>>;
}

// Compute the type to handle an operation of 'Size' bytes knowing that the
// underlying platform only support native types up to MaxSize bytes.
template <size_t Size, size_t MaxSize>
using getTypeFor = cpp::conditional_t<
useArrayType<Size, MaxSize>(),
cpp::array<NativeTypeMap::TypeFor<MaxSize>, Size / MaxSize>,
NativeTypeMap::TypeFor<Size>>;

} // namespace details

///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////
// Memset		// Memset
///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////

template <typename T> struct Memset {		template <typename T> struct Memset {
		static_assert(is_element_type_v<T>);
static constexpr size_t SIZE = sizeof(T);		static constexpr size_t SIZE = sizeof(T);

LIBC_INLINE static void block(Ptr dst, uint8_t value) {		LIBC_INLINE static void block(Ptr dst, uint8_t value) {
static_assert(is_element_type_v<T>);
if constexpr (is_scalar_v<T> \|\| is_vector_v<T>) {		if constexpr (is_scalar_v<T> \|\| is_vector_v<T>) {
store<T>(dst, splat<T>(value));		store<T>(dst, splat<T>(value));
} else if constexpr (is_array_v<T>) {		} else if constexpr (is_array_v<T>) {
using value_type = typename T::value_type;		using value_type = typename T::value_type;
const auto Splat = splat<value_type>(value);		const auto Splat = splat<value_type>(value);
for (size_t I = 0; I < array_size_v<T>; ++I)		for (size_t I = 0; I < array_size_v<T>; ++I)
store<value_type>(dst + (I * sizeof(value_type)), Splat);		store<value_type>(dst + (I * sizeof(value_type)), Splat);
}		}
Show All 18 Lines	LIBC_INLINE static void loop_and_tail(Ptr dst, uint8_t value, size_t count) {
tail(dst, value, count);		tail(dst, value, count);
}		}
};		};

template <typename T, typename... TS> struct MemsetSequence {		template <typename T, typename... TS> struct MemsetSequence {
static constexpr size_t SIZE = (sizeof(T) + ... + sizeof(TS));		static constexpr size_t SIZE = (sizeof(T) + ... + sizeof(TS));
LIBC_INLINE static void block(Ptr dst, uint8_t value) {		LIBC_INLINE static void block(Ptr dst, uint8_t value) {
Memset<T>::block(dst, value);		Memset<T>::block(dst, value);
if constexpr (sizeof...(TS) > 0) {		if constexpr (sizeof...(TS) > 0)
return MemsetSequence<TS...>::block(dst + sizeof(T), value);		return MemsetSequence<TS...>::block(dst + sizeof(T), value);
}		}
}
};		};

///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////
// Memmove		// Memmove
///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////

template <typename T> struct Memmove {		template <typename T> struct Memmove {
		static_assert(is_element_type_v<T>);
static constexpr size_t SIZE = sizeof(T);		static constexpr size_t SIZE = sizeof(T);

LIBC_INLINE static void block(Ptr dst, CPtr src) {		LIBC_INLINE static void block(Ptr dst, CPtr src) {
store<T>(dst, load<T>(src));		store<T>(dst, load<T>(src));
}		}

LIBC_INLINE static void head_tail(Ptr dst, CPtr src, size_t count) {		LIBC_INLINE static void head_tail(Ptr dst, CPtr src, size_t count) {
const size_t offset = count - SIZE;		const size_t offset = count - SIZE;
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	do {
block(dst + offset, src + offset);		block(dst + offset, src + offset);
offset -= SIZE;		offset -= SIZE;
} while (offset >= 0);		} while (offset >= 0);
store<T>(dst, head_value);		store<T>(dst, head_value);
}		}
};		};

///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////
// Bcmp		// Low level operations for Bcmp and Memcmp that operate on memory locations.
///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////
template <size_t Size> struct Bcmp {
static constexpr size_t SIZE = Size;
static constexpr size_t MaxSize = LLVM_LIBC_IS_DEFINED(LLVM_LIBC_HAS_UINT64)
? sizeof(uint64_t)
: sizeof(uint32_t);

template <typename T> LIBC_INLINE static uint32_t load_xor(CPtr p1, CPtr p2) {
static_assert(sizeof(T) <= sizeof(uint32_t));
return load<T>(p1) ^ load<T>(p2);
}

		// Same as load above but with an offset to the pointer.
		// Making the offset explicit hints the compiler to use relevant addressing mode
		// consistently.
		template <typename T> LIBC_INLINE static T load(CPtr ptr, size_t offset) {
		return ::__llvm_libc::load<T>(ptr + offset);
		}

		// Same as above but also makes sure the loaded value is in big endian format.
		// This is useful when implementing lexicograhic comparisons as big endian
		// scalar comparison directly maps to lexicographic byte comparisons.
		template <typename T> LIBC_INLINE static T load_be(CPtr ptr, size_t offset) {
		return Endian::to_big_endian(load<T>(ptr, offset));
		}

		// Equality: returns true iff values at locations (p1 + offset) and (p2 +
		// offset) compare equal.
		template <typename T> static bool eq(CPtr p1, CPtr p2, size_t offset);

		// Not equals: returns non-zero iff values at locations (p1 + offset) and (p2 +
		// offset) differ.
		template <typename T> static uint32_t neq(CPtr p1, CPtr p2, size_t offset);

		// Lexicographic comparison:
		// - returns 0 iff values at locations (p1 + offset) and (p2 + offset) compare
		// equal.
		// - returns a negative value if value at location (p1 + offset) is
		// lexicographically less than value at (p2 + offset).
		// - returns a positive value if value at location (p1 + offset) is
		// lexicographically greater than value at (p2 + offset).
template <typename T>		template <typename T>
LIBC_INLINE static uint32_t load_not_equal(CPtr p1, CPtr p2) {		static MemcmpReturnType cmp(CPtr p1, CPtr p2, size_t offset);
return load<T>(p1) != load<T>(p2);
}

LIBC_INLINE static BcmpReturnType block(CPtr p1, CPtr p2) {		// Lexicographic comparison of non-equal values:
if constexpr (Size == 1) {		// - returns a negative value if value at location (p1 + offset) is
return load_xor<uint8_t>(p1, p2);		// lexicographically less than value at (p2 + offset).
} else if constexpr (Size == 2) {		// - returns a positive value if value at location (p1 + offset) is
return load_xor<uint16_t>(p1, p2);		// lexicographically greater than value at (p2 + offset).
} else if constexpr (Size == 4) {		template <typename T>
return load_xor<uint32_t>(p1, p2);		static MemcmpReturnType cmp_neq(CPtr p1, CPtr p2, size_t offset);
} else if constexpr (Size == 8) {
return load_not_equal<uint64_t>(p1, p2);		///////////////////////////////////////////////////////////////////////////////
} else if constexpr (details::useArrayType<Size, MaxSize>()) {		// Memcmp implementation
for (size_t offset = 0; offset < Size; offset += MaxSize)		//
if (auto value = Bcmp<MaxSize>::block(p1 + offset, p2 + offset))		// When building memcmp, not all types are considered equals.
return value;		//
		// For instance, the lexicographic comparison of two uint8_t can be implemented
		// as a simple subtraction, but for wider operations the logic can be much more
		// involving, especially on little endian platforms.
		//
		// For such wider types it is a good strategy to test for equality first and
		// only do the expensive lexicographic comparison if necessary.
		//
		// Decomposing the algorithm like this for wider types allows us to have
		// efficient implementation of higher order functions like 'head_tail' or
		// 'loop_and_tail'.
		///////////////////////////////////////////////////////////////////////////////

		// Type traits to decide whether we can use 'cmp' directly or if we need to
		// split the computation.
		template <typename T> struct cmp_is_expensive;

		template <typename T> struct Memcmp {
		static_assert(is_element_type_v<T>);
		static constexpr size_t SIZE = sizeof(T);

		private:
		LIBC_INLINE static MemcmpReturnType block_offset(CPtr p1, CPtr p2,
		size_t offset) {
		if constexpr (cmp_is_expensive<T>::value) {
		if (!eq<T>(p1, p2, offset))
		return cmp_neq<T>(p1, p2, offset);
		return MemcmpReturnType::ZERO();
} else {		} else {
deferred_static_assert("Unimplemented Size");		return cmp<T>(p1, p2, offset);
}		}
return BcmpReturnType::ZERO();
}		}

LIBC_INLINE static BcmpReturnType tail(CPtr p1, CPtr p2, size_t count) {		public:
return block(p1 + count - SIZE, p2 + count - SIZE);		LIBC_INLINE static MemcmpReturnType block(CPtr p1, CPtr p2) {
		return block_offset(p1, p2, 0);
}		}

LIBC_INLINE static BcmpReturnType head_tail(CPtr p1, CPtr p2, size_t count) {		LIBC_INLINE static MemcmpReturnType tail(CPtr p1, CPtr p2, size_t count) {
return block(p1, p2) \| tail(p1, p2, count);		return block_offset(p1, p2, count - SIZE);
}		}

LIBC_INLINE static BcmpReturnType loop_and_tail(CPtr p1, CPtr p2,		LIBC_INLINE static MemcmpReturnType head_tail(CPtr p1, CPtr p2,
size_t count) {		size_t count) {
static_assert(Size > 1, "a loop of size 1 does not need tail");		if constexpr (cmp_is_expensive<T>::value) {
size_t offset = 0;		if (!eq<T>(p1, p2, 0))
do {		return cmp_neq<T>(p1, p2, 0);
if (auto value = block(p1 + offset, p2 + offset))		} else {
		if (const auto value = cmp<T>(p1, p2, 0))
return value;		return value;
offset += SIZE;		}
} while (offset < count - SIZE);
return tail(p1, p2, count);		return tail(p1, p2, count);
}		}
};

///////////////////////////////////////////////////////////////////////////////		LIBC_INLINE static MemcmpReturnType loop_and_tail(CPtr p1, CPtr p2,
// Memcmp		size_t count) {
		nafi3000Unsubmitted Done Reply Inline Actions I wonder if it is better to use `cmp<T>` only for the last comparison. Motivation is that for non-last compare blocks we need to check the comparison result anyway (e.g. line 470 above) to decide whether to load and compare the next block in the sequence. Isn't it better to compute this decision (0 or non-0) as early as possible instead of computing the full cmp result (0, <0 or >0)? E.g. if constexpr (sizeof...(TS) == 0) { if constexpr (cmp_is_expensive<T>::value) { if (eq<T>(p1, p2, 0)) return MemcmpReturnType::ZERO(); return cmp_neq<T>(p1, p2, 0); } else { return cmp<T>(p1, p2, 0); } } else { if (!eq<T>(p1, p2, 0)) return cmp_neq<T>(p1, p2, 0); return MemcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T)); } And, for the last block, I wonder if we can invariably call `cmp<T>` instead. What is better would depend on data. E.g. for `__m512i`, `cmp<T>` is faster if there is at least 1 byte mismatch in the last 64 bytes. nafi3000: I wonder if it is better to use `cmp<T>` only for the last comparison. Motivation is that for…
		gchateletAuthorUnsubmitted Done Reply Inline Actions The current benchmark works for all memory functions and is only able to assess the throughput of functions under a particular size distribution. As-is, it is not a good tool to evaluate functions that may return early (`bcmp` and `memcmp`) and for which latency is important. I'll work on adding a latency benchmark based on the one you provided me earlier. Once it's done I think we will be in a better position to decide which strategy is better. SGTY? gchatelet: The current benchmark works for all memory functions and is only able to assess the throughput…
		nafi3000Unsubmitted Done Reply Inline Actions Sounds good. nafi3000: Sounds good.
///////////////////////////////////////////////////////////////////////////////		return loop_and_tail_offset(p1, p2, count, 0);
template <size_t Size> struct Memcmp {
static constexpr size_t SIZE = Size;
static constexpr size_t MaxSize = LLVM_LIBC_IS_DEFINED(LLVM_LIBC_HAS_UINT64)
? sizeof(uint64_t)
: sizeof(uint32_t);

template <typename T> LIBC_INLINE static T load_be(CPtr ptr) {
return Endian::to_big_endian(load<T>(ptr));
}		}

template <typename T>		LIBC_INLINE static MemcmpReturnType
LIBC_INLINE static MemcmpReturnType load_be_diff(CPtr p1, CPtr p2) {		loop_and_tail_offset(CPtr p1, CPtr p2, size_t count, size_t offset) {
return load_be<T>(p1) - load_be<T>(p2);		if constexpr (SIZE > 1) {
		const size_t limit = count - SIZE;
		LIBC_LOOP_NOUNROLL
		for (; offset < limit; offset += SIZE) {
		if constexpr (cmp_is_expensive<T>::value) {
		if (!eq<T>(p1, p2, offset))
		return cmp_neq<T>(p1, p2, offset);
		} else {
		if (const auto value = cmp<T>(p1, p2, offset))
		return value;
		}
		}
		return block_offset(p1, p2, limit); // tail
		} else {
		// No need for a tail operation when SIZE == 1.
		LIBC_LOOP_NOUNROLL
		for (; offset < count; offset += SIZE)
		if (auto value = cmp<T>(p1, p2, offset))
		return value;
		return MemcmpReturnType::ZERO();
		}
}		}

template <typename T>		LIBC_INLINE static MemcmpReturnType
LIBC_INLINE static MemcmpReturnType load_be_cmp(CPtr p1, CPtr p2) {		loop_and_tail_align_above(size_t threshold, CPtr p1, CPtr p2, size_t count) {
const auto la = load_be<T>(p1);		const AlignHelper<sizeof(T)> helper(p1);
const auto lb = load_be<T>(p2);		if (LIBC_UNLIKELY(count >= threshold) && helper.not_aligned()) {
return la > lb ? 1 : la < lb ? -1 : 0;		if (auto value = block(p1, p2))
		return value;
		adjust(helper.offset(), p1, p2, count);
		}
		return loop_and_tail(p1, p2, count);
}		}
		};

		template <typename T, typename... TS> struct MemcmpSequence {
		static constexpr size_t SIZE = (sizeof(T) + ... + sizeof(TS));
LIBC_INLINE static MemcmpReturnType block(CPtr p1, CPtr p2) {		LIBC_INLINE static MemcmpReturnType block(CPtr p1, CPtr p2) {
if constexpr (Size == 1) {		// TODO: test suggestion in
return load_be_diff<uint8_t>(p1, p2);		// https://reviews.llvm.org/D148717?id=515724#inline-1446890
} else if constexpr (Size == 2) {		// once we have a proper way to check memory operation latency.
return load_be_diff<uint16_t>(p1, p2);		if constexpr (cmp_is_expensive<T>::value) {
} else if constexpr (Size == 4) {		if (!eq<T>(p1, p2, 0))
return load_be_cmp<uint32_t>(p1, p2);		return cmp_neq<T>(p1, p2, 0);
} else if constexpr (Size == 8) {
return load_be_cmp<uint64_t>(p1, p2);
} else if constexpr (details::useArrayType<Size, MaxSize>()) {
for (size_t offset = 0; offset < Size; offset += MaxSize)
if (Bcmp<MaxSize>::block(p1 + offset, p2 + offset))
return Memcmp<MaxSize>::block(p1 + offset, p2 + offset);
return MemcmpReturnType::ZERO();
} else if constexpr (Size == 3) {
if (auto value = Memcmp<2>::block(p1, p2))
return value;
return Memcmp<1>::block(p1 + 2, p2 + 2);
} else {		} else {
deferred_static_assert("Unimplemented Size");		if (auto value = cmp<T>(p1, p2, 0))
		return value;
}		}
		if constexpr (sizeof...(TS) > 0)
		return MemcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T));
		else
		return MemcmpReturnType::ZERO();
}		}
		};

LIBC_INLINE static MemcmpReturnType tail(CPtr p1, CPtr p2, size_t count) {		///////////////////////////////////////////////////////////////////////////////
return block(p1 + count - SIZE, p2 + count - SIZE);		// Bcmp
		///////////////////////////////////////////////////////////////////////////////
		template <typename T> struct Bcmp {
		static_assert(is_element_type_v<T>);
		static constexpr size_t SIZE = sizeof(T);

		LIBC_INLINE static BcmpReturnType block(CPtr p1, CPtr p2) {
		return neq<T>(p1, p2, 0);
}		}

LIBC_INLINE static MemcmpReturnType head_tail(CPtr p1, CPtr p2,		LIBC_INLINE static BcmpReturnType tail(CPtr p1, CPtr p2, size_t count) {
size_t count) {		const size_t tail_offset = count - SIZE;
if (auto value = block(p1, p2))		return neq<T>(p1, p2, tail_offset);
		}

		LIBC_INLINE static BcmpReturnType head_tail(CPtr p1, CPtr p2, size_t count) {
		if (const auto value = neq<T>(p1, p2, 0))
return value;		return value;
return tail(p1, p2, count);		return tail(p1, p2, count);
}		}

LIBC_INLINE static MemcmpReturnType loop_and_tail(CPtr p1, CPtr p2,		LIBC_INLINE static BcmpReturnType loop_and_tail(CPtr p1, CPtr p2,
size_t count) {		size_t count) {
static_assert(Size > 1, "a loop of size 1 does not need tail");		return loop_and_tail_offset(p1, p2, count, 0);
size_t offset = 0;		}
do {
if (auto value = block(p1 + offset, p2 + offset))		LIBC_INLINE static BcmpReturnType
		loop_and_tail_offset(CPtr p1, CPtr p2, size_t count, size_t offset) {
		if constexpr (SIZE > 1) {
		const size_t limit = count - SIZE;
		LIBC_LOOP_NOUNROLL
		for (; offset < limit; offset += SIZE)
		if (const auto value = neq<T>(p1, p2, offset))
return value;		return value;
offset += SIZE;
} while (offset < count - SIZE);
return tail(p1, p2, count);		return tail(p1, p2, count);
		} else {
		// No need for a tail operation when SIZE == 1.
		LIBC_LOOP_NOUNROLL
		for (; offset < count; offset += SIZE)
		if (const auto value = neq<T>(p1, p2, offset))
		return value;
		return BcmpReturnType::ZERO();
		}
		}

		LIBC_INLINE static BcmpReturnType
		loop_and_tail_align_above(size_t threshold, CPtr p1, CPtr p2, size_t count) {
		static_assert(SIZE > 1,
		"No need to align when processing one byte at a time");
		const AlignHelper<sizeof(T)> helper(p1);
		if (LIBC_UNLIKELY(count >= threshold) && helper.not_aligned()) {
		if (auto value = block(p1, p2))
		return value;
		adjust(helper.offset(), p1, p2, count);
		}
		return loop_and_tail(p1, p2, count);
}		}
};		};

		template <typename T, typename... TS> struct BcmpSequence {
		static constexpr size_t SIZE = (sizeof(T) + ... + sizeof(TS));
		LIBC_INLINE static BcmpReturnType block(CPtr p1, CPtr p2) {
		if (auto value = neq<T>(p1, p2, 0))
		return value;
		if constexpr (sizeof...(TS) > 0)
		return BcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T));
		else
		return BcmpReturnType::ZERO();
		}
		};

		///////////////////////////////////////////////////////////////////////////////
		// Specializations for uint8_t
		template <> struct cmp_is_expensive<uint8_t> : public cpp::false_type {};
		template <> LIBC_INLINE bool eq<uint8_t>(CPtr p1, CPtr p2, size_t offset) {
		return load<uint8_t>(p1, offset) == load<uint8_t>(p2, offset);
		}
		template <> LIBC_INLINE uint32_t neq<uint8_t>(CPtr p1, CPtr p2, size_t offset) {
		return load<uint8_t>(p1, offset) ^ load<uint8_t>(p2, offset);
		}
		template <>
		LIBC_INLINE MemcmpReturnType cmp<uint8_t>(CPtr p1, CPtr p2, size_t offset) {
		return static_cast<int32_t>(load<uint8_t>(p1, offset)) -
		static_cast<int32_t>(load<uint8_t>(p2, offset));
		}
		template <>
		LIBC_INLINE MemcmpReturnType cmp_neq<uint8_t>(CPtr p1, CPtr p2, size_t offset);
} // namespace __llvm_libc::generic		} // namespace __llvm_libc::generic

#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_GENERIC_H		#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_GENERIC_H
		nafi3000Unsubmitted Done Reply Inline Actions OOC, have you ever tried other values? E.g. how about: return a > b ? 0x7fffffff : 0x80000000; or if it does not compile due to the `MemcmpReturnType`, then: return a > b ? static_cast<int32_t>(0x7fffffff) : static_cast<int32_t>(0x80000000); `-1` and `1` are 2 values apart. `0x7fffffff` and `0x80000000` are 1 value apart. assembly diff: https://www.diffchecker.com/LMVfxJ1D/ xor eax, eax cmp rcx, r8 sbb eax, eax or eax, 1 vs cmp r8, rcx mov eax, -2147483648 sbb eax, 0 In theory... the former should take 3 cycles (`or` waiting for `sbb`, `sbb` waiting for `cmp`) while the latter should take 2 cycles (`cmp` and `mov` should happen in parallel, `sbb` happening after the `cmp`), right? nafi3000: OOC, have you ever tried other values? E.g. how about: ``` return a > b ? 0x7fffffff…
		gchateletAuthorUnsubmitted Done Reply Inline Actions Nice one. Picking other values was on my TODO list but I never thought it through. I had a look at codegen for armv8 it uses `cinv` instead of `cneg` but it seems to be neutral in terms of performance. https://godbolt.org/z/Y9aGq5sPd For x86 in theory this should be better yes 👍. https://godbolt.org/z/69Gefhqef For RISC-V it seems it's worse as it generates a branch. https://godbolt.org/z/bvqMeMjMP This seems to be in line with what I found on stackoverflow Now I tried it on the full implementation and the additional branch seems to be outlined leading to the same code size. I don't know the impact on the code speed but since we don't yet have an optimized version of memcmp for RISC-V we can happily revisit the function later on. I've created a separate function so we can keep track of the rationale. gchatelet: Nice one. Picking other values was on my TODO list but I never thought it through. I had a…

libc/src/string/memory_utils/op_riscv.h

This file was added.

				//===-- RISC-V implementation of memory function building blocks ----------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file provides x86 specific building blocks to compose memory functions.
				//
				//===----------------------------------------------------------------------===//
				#ifndef LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_RISCV_H
				#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_RISCV_H

				#include "src/__support/macros/properties/architectures.h"

				#if defined(LIBC_TARGET_ARCH_IS_ANY_RISCV)

				#include "src/__support/common.h"
				#include "src/string/memory_utils/op_generic.h"

				namespace __llvm_libc::generic {

				///////////////////////////////////////////////////////////////////////////////
				// Specializations for uint16_t
				template <> struct cmp_is_expensive<uint16_t> : public cpp::false_type {};
				template <> LIBC_INLINE bool eq<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
				return load<uint16_t>(p1, offset) == load<uint16_t>(p2, offset);
				}
				template <>
				LIBC_INLINE uint32_t neq<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
				return load<uint16_t>(p1, offset) ^ load<uint16_t>(p2, offset);
				}
				template <>
				LIBC_INLINE MemcmpReturnType cmp<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
				return static_cast<int32_t>(load_be<uint16_t>(p1, offset)) -
				static_cast<int32_t>(load_be<uint16_t>(p2, offset));
				}
				template <>
				LIBC_INLINE MemcmpReturnType cmp_neq<uint16_t>(CPtr p1, CPtr p2, size_t offset);

				///////////////////////////////////////////////////////////////////////////////
				// Specializations for uint32_t
				template <> struct cmp_is_expensive<uint32_t> : public cpp::false_type {};
				template <> LIBC_INLINE bool eq<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
				return load<uint32_t>(p1, offset) == load<uint32_t>(p2, offset);
				}
				template <>
				LIBC_INLINE uint32_t neq<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
				return load<uint32_t>(p1, offset) ^ load<uint32_t>(p2, offset);
				}
				template <>
				LIBC_INLINE MemcmpReturnType cmp<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
				const auto a = load_be<uint32_t>(p1, offset);
				const auto b = load_be<uint32_t>(p2, offset);
				return cmp_uint32_t(a, b);
				}
				template <>
				LIBC_INLINE MemcmpReturnType cmp_neq<uint32_t>(CPtr p1, CPtr p2, size_t offset);

				///////////////////////////////////////////////////////////////////////////////
				// Specializations for uint64_t
				template <> struct cmp_is_expensive<uint64_t> : public cpp::true_type {};
				template <> LIBC_INLINE bool eq<uint64_t>(CPtr p1, CPtr p2, size_t offset) {
				return load<uint64_t>(p1, offset) == load<uint64_t>(p2, offset);
				}
				template <>
				LIBC_INLINE uint32_t neq<uint64_t>(CPtr p1, CPtr p2, size_t offset) {
				return !eq<uint64_t>(p1, p2, offset);
				}
				template <>
				LIBC_INLINE MemcmpReturnType cmp<uint64_t>(CPtr p1, CPtr p2, size_t offset);
				template <>
				LIBC_INLINE MemcmpReturnType cmp_neq<uint64_t>(CPtr p1, CPtr p2,
				size_t offset) {
				const auto a = load_be<uint64_t>(p1, offset);
				const auto b = load_be<uint64_t>(p2, offset);
				return cmp_neq_uint64_t(a, b);
				}

				} // namespace __llvm_libc::generic

				#endif // LIBC_TARGET_ARCH_IS_ANY_RISCV
				#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_RISCV_H

libc/src/string/memory_utils/op_x86.h

	Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	///////////////////////////////////////////////////////////////////////////////			///////////////////////////////////////////////////////////////////////////////
	// Memcpy repmovsb implementation			// Memcpy repmovsb implementation
	struct Memcpy {			struct Memcpy {
	static void repmovsb(void dst, const void src, size_t count) {			static void repmovsb(void dst, const void src, size_t count) {
	asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(count) : : "memory");			asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(count) : : "memory");
	}			}
	};			};

	///////////////////////////////////////////////////////////////////////////////			} // namespace __llvm_libc::x86
	// Bcmp

	// Base implementation for the Bcmp specializations.			namespace __llvm_libc::generic {
	// - BlockSize is either 16, 32 or 64 depending on the available compile time
	// features, it is used to switch between "single native operation" or a
	// "sequence of native operations".
	// - BlockBcmp is the function that implements the bcmp logic.
	template <size_t Size, size_t BlockSize, auto BlockBcmp> struct BcmpImpl {
	static constexpr size_t SIZE = Size;
	LIBC_INLINE static BcmpReturnType block(CPtr p1, CPtr p2) {
	if constexpr (Size == BlockSize) {
	return BlockBcmp(p1, p2);
	} else if constexpr (Size % BlockSize == 0) {
	for (size_t offset = 0; offset < Size; offset += BlockSize)
	if (auto value = BlockBcmp(p1 + offset, p2 + offset))
	return value;
	} else {
	deferred_static_assert("SIZE not implemented");
	}
	return BcmpReturnType::ZERO();
	}

	LIBC_INLINE static BcmpReturnType tail(CPtr p1, CPtr p2, size_t count) {
	return block(p1 + count - Size, p2 + count - Size);
	}

	LIBC_INLINE static BcmpReturnType head_tail(CPtr p1, CPtr p2, size_t count) {
	return block(p1, p2) \| tail(p1, p2, count);
	}

	LIBC_INLINE static BcmpReturnType loop_and_tail(CPtr p1, CPtr p2,
	size_t count) {
	static_assert(Size > 1, "a loop of size 1 does not need tail");
	size_t offset = 0;
	do {
	if (auto value = block(p1 + offset, p2 + offset))
	return value;
	offset += Size;
	} while (offset < count - Size);
	return tail(p1, p2, count);
	}
	};

	namespace sse2 {			///////////////////////////////////////////////////////////////////////////////
	LIBC_INLINE BcmpReturnType bcmp16(CPtr p1, CPtr p2) {			// Specializations for uint16_t
	#if defined(__SSE2__)			template <> struct cmp_is_expensive<uint16_t> : public cpp::false_type {};
	using T = char __attribute__((__vector_size__(16)));			template <> LIBC_INLINE bool eq<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
	// A mask indicating which bytes differ after loading 16 bytes from p1 and p2.			return load<uint16_t>(p1, offset) == load<uint16_t>(p2, offset);
	const int mask =			}
	_mm_movemask_epi8(cpp::bit_cast<__m128i>(load<T>(p1) != load<T>(p2)));			template <>
	return static_cast<uint32_t>(mask);			LIBC_INLINE uint32_t neq<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
	#else			return load<uint16_t>(p1, offset) ^ load<uint16_t>(p2, offset);
				nafi3000Unsubmitted Done Reply Inline Actions [optional nit] Maybe not in this diff, but eventually we can programmatically generate the sequences here and at lines 129 and 160 below. nafi3000: [optional nit] Maybe not in this diff, but eventually we can programmatically generate the…
				gchateletAuthorUnsubmitted Done Reply Inline Actions AFAICT we can't do it with the intel intrinsics as they are real functions expecting a certain number of arguments. It may be possible to generate them with GCC and clang vector extensions though and then convert them back to Intel types. I gave it a try but it's brittle on clang and fails on GCC https://godbolt.org/z/Ms7fW5nP3 Not sure it's worth it. gchatelet: AFAICT we can't do it with the intel intrinsics as they are real functions expecting a certain…
				nafi3000Unsubmitted Done Reply Inline Actions We can use _mm_load_si128 instead of _mm_set_epi8. Snappy code has some example: https://github.com/google/snappy/blob/main/snappy.cc Search for `pattern_generation_masks`. Anyway, this can be addressed in a separate diff. And like you mention, it may not be worth it. In snappy, we actually need a array of such shuffle masks. But in this case we just need one. nafi3000: We can use _mm_load_si128 instead of _mm_set_epi8. Snappy code has some example: https://github.
	(void)p1;			}
	(void)p2;			template <>
	return BcmpReturnType::ZERO();			LIBC_INLINE MemcmpReturnType cmp<uint16_t>(CPtr p1, CPtr p2, size_t offset) {
	#endif // defined(__SSE2__)			return static_cast<int32_t>(load_be<uint16_t>(p1, offset)) -
				static_cast<int32_t>(load_be<uint16_t>(p2, offset));
	}			}
	template <size_t Size> using Bcmp = BcmpImpl<Size, 16, bcmp16>;			template <>
	} // namespace sse2			LIBC_INLINE MemcmpReturnType cmp_neq<uint16_t>(CPtr p1, CPtr p2, size_t offset);

	namespace avx2 {			///////////////////////////////////////////////////////////////////////////////
	LIBC_INLINE BcmpReturnType bcmp32(CPtr p1, CPtr p2) {			// Specializations for uint32_t
	#if defined(__AVX2__)			template <> struct cmp_is_expensive<uint32_t> : public cpp::false_type {};
	using T = char __attribute__((__vector_size__(32)));			template <> LIBC_INLINE bool eq<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
	// A mask indicating which bytes differ after loading 32 bytes from p1 and p2.			return load<uint32_t>(p1, offset) == load<uint32_t>(p2, offset);
	const int mask =			}
	_mm256_movemask_epi8(cpp::bit_cast<__m256i>(load<T>(p1) != load<T>(p2)));			template <>
	// _mm256_movemask_epi8 returns an int but it is to be interpreted as a 32-bit			LIBC_INLINE uint32_t neq<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
	// mask.			return load<uint32_t>(p1, offset) ^ load<uint32_t>(p2, offset);
	return static_cast<uint32_t>(mask);			}
	#else			template <>
	(void)p1;			LIBC_INLINE MemcmpReturnType cmp<uint32_t>(CPtr p1, CPtr p2, size_t offset) {
	(void)p2;			const auto a = load_be<uint32_t>(p1, offset);
	return BcmpReturnType::ZERO();			const auto b = load_be<uint32_t>(p2, offset);
	#endif // defined(__AVX2__)			return cmp_uint32_t(a, b);
	}			}
	template <size_t Size> using Bcmp = BcmpImpl<Size, 32, bcmp32>;			template <>
	} // namespace avx2			LIBC_INLINE MemcmpReturnType cmp_neq<uint32_t>(CPtr p1, CPtr p2, size_t offset);

	namespace avx512bw {			///////////////////////////////////////////////////////////////////////////////
	LIBC_INLINE BcmpReturnType bcmp64(CPtr p1, CPtr p2) {			// Specializations for uint64_t
	#if defined(__AVX512BW__)			template <> struct cmp_is_expensive<uint64_t> : public cpp::true_type {};
	using T = char __attribute__((__vector_size__(64)));			template <> LIBC_INLINE bool eq<uint64_t>(CPtr p1, CPtr p2, size_t offset) {
	// A mask indicating which bytes differ after loading 64 bytes from p1 and p2.			return load<uint64_t>(p1, offset) == load<uint64_t>(p2, offset);
	const uint64_t mask = _mm512_cmpneq_epi8_mask(			}
	cpp::bit_cast<__m512i>(load<T>(p1)), cpp::bit_cast<__m512i>(load<T>(p2)));			template <>
	const bool mask_is_set = mask != 0;			LIBC_INLINE uint32_t neq<uint64_t>(CPtr p1, CPtr p2, size_t offset) {
	return static_cast<uint32_t>(mask_is_set);			return !eq<uint64_t>(p1, p2, offset);
	#else			}
	(void)p1;			template <>
	(void)p2;			LIBC_INLINE MemcmpReturnType cmp<uint64_t>(CPtr p1, CPtr p2, size_t offset);
	return BcmpReturnType::ZERO();			template <>
	#endif // defined(__AVX512BW__)			LIBC_INLINE MemcmpReturnType cmp_neq<uint64_t>(CPtr p1, CPtr p2,
	}			size_t offset) {
	template <size_t Size> using Bcmp = BcmpImpl<Size, 64, bcmp64>;			const auto a = load_be<uint64_t>(p1, offset);
	} // namespace avx512bw			const auto b = load_be<uint64_t>(p2, offset);
				return cmp_neq_uint64_t(a, b);
	// Assuming that the mask is non zero, the index of the first mismatching byte
	// is the number of trailing zeros in the mask. Trailing zeros and not leading
	// zeros because the x86 architecture is little endian.
	LIBC_INLINE MemcmpReturnType char_diff_no_zero(CPtr p1, CPtr p2,
	uint64_t mask) {
	const size_t diff_index = __builtin_ctzll(mask);
	const int16_t ca = cpp::to_integer<uint8_t>(p1[diff_index]);
	const int16_t cb = cpp::to_integer<uint8_t>(p2[diff_index]);
	return ca - cb;
	}			}

	///////////////////////////////////////////////////////////////////////////////			///////////////////////////////////////////////////////////////////////////////
	// Memcmp			// Specializations for __m128i
				#if defined(__SSE4_1__)
	// Base implementation for the Memcmp specializations.			template <> struct is_vector<__m128i> : cpp::true_type {};
	// - BlockSize is either 16, 32 or 64 depending on the available compile time			template <> struct cmp_is_expensive<__m128i> : cpp::true_type {};
	// features, it is used to switch between "single native operation" or a			LIBC_INLINE __m128i bytewise_max(__m128i a, __m128i b) {
	// "sequence of native operations".			return _mm_max_epu8(a, b);
	// - BlockMemcmp is the function that implements the memcmp logic.			}
	// - BlockBcmp is the function that implements the bcmp logic.			LIBC_INLINE __m128i bytewise_reverse(__m128i value) {
	template <size_t Size, size_t BlockSize, auto BlockMemcmp, auto BlockBcmp>			return _mm_shuffle_epi8(value, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, //
	struct MemcmpImpl {			8, 9, 10, 11, 12, 13, 14, 15));
	static constexpr size_t SIZE = Size;			}
	LIBC_INLINE static MemcmpReturnType block(CPtr p1, CPtr p2) {			LIBC_INLINE uint16_t big_endian_cmp_mask(__m128i max, __m128i value) {
	if constexpr (Size == BlockSize) {			return _mm_movemask_epi8(bytewise_reverse(_mm_cmpeq_epi8(max, value)));
	return BlockMemcmp(p1, p2);			}
	} else if constexpr (Size % BlockSize == 0) {			template <> LIBC_INLINE bool eq<__m128i>(CPtr p1, CPtr p2, size_t offset) {
	for (size_t offset = 0; offset < Size; offset += BlockSize)			const auto a = load<__m128i>(p1, offset);
	if (auto value = BlockBcmp(p1 + offset, p2 + offset))			const auto b = load<__m128i>(p2, offset);
	return BlockMemcmp(p1 + offset, p2 + offset);			const auto xored = _mm_xor_si128(a, b);
	} else {			return _mm_testz_si128(xored, xored) == 1; // 1 iff xored == 0
	deferred_static_assert("SIZE not implemented");			}
	}			template <> LIBC_INLINE uint32_t neq<__m128i>(CPtr p1, CPtr p2, size_t offset) {
	return MemcmpReturnType::ZERO();			const auto a = load<__m128i>(p1, offset);
	}			const auto b = load<__m128i>(p2, offset);
				const auto xored = _mm_xor_si128(a, b);
	LIBC_INLINE static MemcmpReturnType tail(CPtr p1, CPtr p2, size_t count) {			return _mm_testz_si128(xored, xored) == 0; // 0 iff xored != 0
	return block(p1 + count - Size, p2 + count - Size);			}
	}			template <>
				LIBC_INLINE MemcmpReturnType cmp_neq<__m128i>(CPtr p1, CPtr p2, size_t offset) {
	LIBC_INLINE static MemcmpReturnType head_tail(CPtr p1, CPtr p2,			const auto a = load<__m128i>(p1, offset);
	size_t count) {			const auto b = load<__m128i>(p2, offset);
	if (auto value = block(p1, p2))			const auto vmax = bytewise_max(a, b);
	return value;			const auto le = big_endian_cmp_mask(vmax, b);
	return tail(p1, p2, count);			const auto ge = big_endian_cmp_mask(vmax, a);
	}			static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint16_t>);
				return static_cast<int32_t>(ge) - static_cast<int32_t>(le);
	LIBC_INLINE static MemcmpReturnType loop_and_tail(CPtr p1, CPtr p2,
	size_t count) {
	static_assert(Size > 1, "a loop of size 1 does not need tail");
	size_t offset = 0;
	do {
	if (auto value = block(p1 + offset, p2 + offset))
	return value;
	offset += Size;
	} while (offset < count - Size);
	return tail(p1, p2, count);
	}			}
	};			#endif // __SSE4_1__

	namespace sse2 {			///////////////////////////////////////////////////////////////////////////////
	LIBC_INLINE MemcmpReturnType memcmp16(CPtr p1, CPtr p2) {			// Specializations for __m256i
	#if defined(__SSE2__)			#if defined(__AVX__)
	using T = char __attribute__((__vector_size__(16)));			template <> struct is_vector<__m256i> : cpp::true_type {};
	// A mask indicating which bytes differ after loading 16 bytes from p1 and p2.			template <> struct cmp_is_expensive<__m256i> : cpp::true_type {};
	if (int mask =			template <> LIBC_INLINE bool eq<__m256i>(CPtr p1, CPtr p2, size_t offset) {
	_mm_movemask_epi8(cpp::bit_cast<__m128i>(load<T>(p1) != load<T>(p2))))			const auto a = load<__m256i>(p1, offset);
	return char_diff_no_zero(p1, p2, mask);			const auto b = load<__m256i>(p2, offset);
	return MemcmpReturnType::ZERO();			const auto xored = _mm256_castps_si256(
	#else			_mm256_xor_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b)));
	(void)p1;			return _mm256_testz_si256(xored, xored) == 1; // 1 iff xored == 0
	(void)p2;			}
	return MemcmpReturnType::ZERO();			template <> LIBC_INLINE uint32_t neq<__m256i>(CPtr p1, CPtr p2, size_t offset) {
	#endif // defined(__SSE2__)			const auto a = load<__m256i>(p1, offset);
				const auto b = load<__m256i>(p2, offset);
				const auto xored = _mm256_castps_si256(
				_mm256_xor_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b)));
				return _mm256_testz_si256(xored, xored) == 0; // 0 iff xored != 0
	}			}
	template <size_t Size> using Memcmp = MemcmpImpl<Size, 16, memcmp16, bcmp16>;			#endif // __AVX__
	} // namespace sse2

	namespace avx2 {
	LIBC_INLINE MemcmpReturnType memcmp32(CPtr p1, CPtr p2) {
	#if defined(__AVX2__)			#if defined(__AVX2__)
	using T = char __attribute__((__vector_size__(32)));			LIBC_INLINE __m256i bytewise_max(__m256i a, __m256i b) {
	// A mask indicating which bytes differ after loading 32 bytes from p1 and p2.			return _mm256_max_epu8(a, b);
	if (int mask = _mm256_movemask_epi8(
	cpp::bit_cast<__m256i>(load<T>(p1) != load<T>(p2))))
	return char_diff_no_zero(p1, p2, mask);
	return MemcmpReturnType::ZERO();
	#else
	(void)p1;
	(void)p2;
	return MemcmpReturnType::ZERO();
	#endif // defined(__AVX2__)
	}			}
	template <size_t Size> using Memcmp = MemcmpImpl<Size, 32, memcmp32, bcmp32>;			LIBC_INLINE __m256i bytewise_reverse(__m256i value) {
	} // namespace avx2			return _mm256_shuffle_epi8(value,
				_mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, //
				8, 9, 10, 11, 12, 13, 14, 15, //
				16, 17, 18, 19, 20, 21, 22, 23, //
				24, 25, 26, 27, 28, 29, 30, 31));
				}
				LIBC_INLINE uint32_t big_endian_cmp_mask(__m256i max, __m256i value) {
				return _mm256_movemask_epi8(bytewise_reverse(_mm256_cmpeq_epi8(max, value)));
				}
				template <>
				LIBC_INLINE MemcmpReturnType cmp_neq<__m256i>(CPtr p1, CPtr p2, size_t offset) {
				const auto a = load<__m256i>(p1, offset);
				const auto b = load<__m256i>(p2, offset);
				const auto vmax = bytewise_max(a, b);
				const auto le = big_endian_cmp_mask(vmax, b);
				const auto ge = big_endian_cmp_mask(vmax, a);
				static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint32_t>);
				return cmp_uint32_t(ge, le);
				}
				nafi3000Unsubmitted Done Reply Inline Actions Would it make sense to factor out this part to another function? This is used here and for cmp<uint32_t>. nafi3000: Would it make sense to factor out this part to another function? This is used here and for…
				gchateletAuthorUnsubmitted Done Reply Inline Actions Yes, I've done so for the `uint64_t` so let's factor this out for `uint32_t` as well. gchatelet: Yes, I've done so for the `uint64_t` so let's factor this out for `uint32_t` as well.
				#endif // __AVX2__

	namespace avx512bw {			///////////////////////////////////////////////////////////////////////////////
	LIBC_INLINE MemcmpReturnType memcmp64(CPtr p1, CPtr p2) {			// Specializations for __m512i
	#if defined(__AVX512BW__)			#if defined(__AVX512BW__)
	using T = char __attribute__((__vector_size__(64)));			template <> struct is_vector<__m512i> : cpp::true_type {};
	// A mask indicating which bytes differ after loading 64 bytes from p1 and p2.			template <> struct cmp_is_expensive<__m512i> : cpp::true_type {};
	if (uint64_t mask =			LIBC_INLINE __m512i bytewise_max(__m512i a, __m512i b) {
	_mm512_cmpneq_epi8_mask(cpp::bit_cast<__m512i>(load<T>(p1)),			return _mm512_max_epu8(a, b);
	cpp::bit_cast<__m512i>(load<T>(p2))))			}
	return char_diff_no_zero(p1, p2, mask);			LIBC_INLINE __m512i bytewise_reverse(__m512i value) {
	return MemcmpReturnType::ZERO();			return _mm512_shuffle_epi8(value,
	#else			_mm512_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, //
	(void)p1;			8, 9, 10, 11, 12, 13, 14, 15, //
	(void)p2;			16, 17, 18, 19, 20, 21, 22, 23, //
	return MemcmpReturnType::ZERO();			24, 25, 26, 27, 28, 29, 30, 31, //
	#endif // defined(__AVX512BW__)			32, 33, 34, 35, 36, 37, 38, 39, //
				40, 41, 42, 43, 44, 45, 46, 47, //
				48, 49, 50, 51, 52, 53, 54, 55, //
				56, 57, 58, 59, 60, 61, 62, 63));
				}
				LIBC_INLINE uint64_t big_endian_cmp_mask(__m512i max, __m512i value) {
				return _mm512_cmpeq_epi8_mask(bytewise_reverse(max), bytewise_reverse(value));
				}
				template <> LIBC_INLINE bool eq<__m512i>(CPtr p1, CPtr p2, size_t offset) {
				const auto a = load<__m512i>(p1, offset);
				const auto b = load<__m512i>(p2, offset);
				return _mm512_cmpneq_epi8_mask(a, b) == 0;
				}
				template <> LIBC_INLINE uint32_t neq<__m512i>(CPtr p1, CPtr p2, size_t offset) {
				nafi3000Unsubmitted Done Reply Inline Actions Ditto, -1 : 1 vs 0x80000000 : 0x7fffffff nafi3000: Ditto, -1 : 1 vs 0x80000000 : 0x7fffffff
				const auto a = load<__m512i>(p1, offset);
				const auto b = load<__m512i>(p2, offset);
				const uint64_t xored = _mm512_cmpneq_epi8_mask(a, b);
				return (xored >> 32) \| (xored & 0xFFFFFFFF);
				}
				template <>
				LIBC_INLINE MemcmpReturnType cmp_neq<__m512i>(CPtr p1, CPtr p2, size_t offset) {
				const auto a = load<__m512i>(p1, offset);
				const auto b = load<__m512i>(p2, offset);
				const auto vmax = bytewise_max(a, b);
				const auto le = big_endian_cmp_mask(vmax, b);
				const auto ge = big_endian_cmp_mask(vmax, a);
				static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint64_t>);
				return cmp_neq_uint64_t(ge, le);
	}			}
	template <size_t Size> using Memcmp = MemcmpImpl<Size, 64, memcmp64, bcmp64>;			#endif // __AVX512BW__
	} // namespace avx512bw

	} // namespace __llvm_libc::x86			} // namespace __llvm_libc::generic

	#endif // LIBC_TARGET_ARCH_IS_X86_64			#endif // LIBC_TARGET_ARCH_IS_X86_64

	#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_X86_H			#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_X86_H

libc/src/string/memory_utils/utils.h

Show All 9 Lines

#define LLVM_LIBC_SRC_MEMORY_UTILS_UTILS_H #define LLVM_LIBC_SRC_MEMORY_UTILS_UTILS_H

#include "src/__support/CPP/bit.h" #include "src/__support/CPP/bit.h"

#include "src/__support/CPP/cstddef.h" #include "src/__support/CPP/cstddef.h"

#include "src/__support/CPP/type_traits.h" #include "src/__support/CPP/type_traits.h"

#include "src/__support/endian.h" #include "src/__support/endian.h"

#include "src/__support/macros/attributes.h" // LIBC_INLINE #include "src/__support/macros/attributes.h" // LIBC_INLINE

#include "src/__support/macros/config.h" // LIBC_HAS_BUILTIN #include "src/__support/macros/config.h" // LIBC_HAS_BUILTIN

#include "src/__support/macros/properties/architectures.h"

#include <stddef.h> // size_t #include <stddef.h> // size_t

#include <stdint.h> // intptr_t / uintptr_t #include <stdint.h> // intptr_t / uintptr_t / INT32_MAX / INT32_MIN

namespace __llvm_libc { namespace __llvm_libc {

// Allows compile time error reporting in `if constexpr` branches. // Allows compile time error reporting in `if constexpr` branches.

template <bool flag = false> template <bool flag = false>

static void deferred_static_assert(const char *msg) { static void deferred_static_assert(const char *msg) {

static_assert(flag, "compilation error"); static_assert(flag, "compilation error");

(void)msg; (void)msg;

▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines

private: private:

T value; T value;

}; };

using MemcmpReturnType = StrictIntegralType<int32_t>; using MemcmpReturnType = StrictIntegralType<int32_t>;

using BcmpReturnType = StrictIntegralType<uint32_t>; using BcmpReturnType = StrictIntegralType<uint32_t>;

// This implements the semantic of 'memcmp' returning a negative value when 'a'

// is less than 'b', '0' when 'a' equals 'b' and a positive number otherwise.

LIBC_INLINE MemcmpReturnType cmp_uint32_t(uint32_t a, uint32_t b) {

// We perform the difference as an int64_t.

const int64_t diff = static_cast<int64_t>(a) - static_cast<int64_t>(b);

// For the int64_t to int32_t conversion we want the following properties:

nafi3000Unsubmitted

Done

LIBC_INLINE MemcmpReturnType cmp_uint32_t(uint32_t a, uint32_t b) {

- // We perform the difference as an uint64_t.

+ // We perform the difference as an int64_t.

const int64_t diff = static_cast<int64_t>(a) - static_cast<int64_t>(b);

- // And reduce the uint64_t into an uint32_t.

+ // And reduce the int64_t into an int32_t.

// TODO: provide a detailed explanation.

nit: s/uint64_t/int64_t/ and s/uint32_t/int32_t/ in the comments.

nafi3000: nit: s/uint64_t/int64_t/ and s/uint32_t/int32_t/ in the comments.

// - int32_t[31:31] == 1 iff diff < 0

nafi3000Unsubmitted

Done

For the explanation, please consider whether we can add some version of the following points:

For the int64_t to int32_t conversion we want the following properties:
- int32_t[31:31] == 1 iff diff < 0
- int32_t[31:0] == 0 iff diff == 0

We also observe that:
- When diff < 0: diff[63:32] == 0xffffffff and diff[31:0] != 0
- When diff > 0: diff[63:32] == 0 and diff[31:0] != 0
- When diff == 0: diff[63:32] == 0 and diff[31:0] == 0
- https://godbolt.org/z/8W7qWP6e5
- This implies that we can only look at diff[32:32] for determining the sign bit for the returned int32_t.

So, we do the following:
- int32_t[31:31] = diff[32:32]
- int32_t[30:0] = diff[31:0] == 0 ? 0 : non-0.

And, we can achieve the above by the expression below. We could have also used (diff64 >> 1) | (diff64 & 0x1) but (diff64 & 0xFFFF) is faster than (diff64 & 0x1). https://godbolt.org/z/j3b569rW1

We can also add all these in a separate diff.

nafi3000: For the explanation, please consider whether we can add some version of the following points…

gchateletAuthorUnsubmitted

Done

The explanation is fantastic, I copied it verbatim.

gchatelet: The explanation is fantastic, I copied it verbatim.

// - int32_t[31:0] == 0 iff diff == 0

// We also observe that:

// - When diff < 0: diff[63:32] == 0xffffffff and diff[31:0] != 0

// - When diff > 0: diff[63:32] == 0 and diff[31:0] != 0

// - When diff == 0: diff[63:32] == 0 and diff[31:0] == 0

// - https://godbolt.org/z/8W7qWP6e5

// - This implies that we can only look at diff[32:32] for determining the

// sign bit for the returned int32_t.

// So, we do the following:

// - int32_t[31:31] = diff[32:32]

// - int32_t[30:0] = diff[31:0] == 0 ? 0 : non-0.

// And, we can achieve the above by the expression below. We could have also

// used (diff64 >> 1) | (diff64 & 0x1) but (diff64 & 0xFFFF) is faster than

// (diff64 & 0x1). https://godbolt.org/z/j3b569rW1

return static_cast<int32_t>((diff >> 1) | (diff & 0xFFFF));

}

// Returns a negative value if 'a' is less than 'b' and a positive value

// otherwise. This implements the semantic of 'memcmp' when we know that 'a' and

// 'b' differ.

LIBC_INLINE MemcmpReturnType cmp_neq_uint64_t(uint64_t a, uint64_t b) {

#if defined(LIBC_TARGET_ARCH_IS_X86_64)

// On x86, the best strategy would be to use 'INT32_MAX' and 'INT32_MIN' for

// positive and negative value respectively as they are one value apart:

// xor eax, eax <- free

// cmp rdi, rsi <- serializing

// adc eax, 2147483647 <- serializing

// Unfortunately we found instances of client code that negate the result of

// 'memcmp' to reverse ordering. Because signed integers are not symmetric

// (e.g., int8_t ∈ [-128, 127]) returning 'INT_MIN' would break such code as

// `-INT_MIN` is not representable as an int32_t.

xbolva00Unsubmitted

Done

So they have UB in their codebases. They should really fix instead of workarounds like this one.

xbolva00: So they have UB in their codebases. They should really fix instead of workarounds like this one.

gchateletAuthorUnsubmitted

Done

So they have UB in their codebases. They should really fix instead of workarounds like this one.

Yeah I agree, I've been pushing for this but we have many instances of this bug (not only in sqlite3.c) and they're quite painful to chase down. They usually show up quite far away from the actual memcmp call. Fixing all of them will take time but we'll release the optimized version eventually /me hope.

gchatelet: > So they have UB in their codebases. They should really fix instead of workarounds like this…

// As a consequence, we use 5 and -5 which is still OK nice in terms of

// latency.

// cmp rdi, rsi <- serializing

// mov ecx, -5 <- can be done in parallel

// mov eax, 5 <- can be done in parallel

// cmovb eax, ecx <- serializing

lntueUnsubmitted

Done

I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is better, then the compiler should just use this for 1 and -1 also, right?

lntue: I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is…

gchateletAuthorUnsubmitted

Done

I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is better, then the compiler should just use this for 1 and -1 also, right?

x86 does not have conditional negate and codegen for returning 1 and -1 has higher latency.

xor     eax, eax
cmp     rdi, rsi <- serializing
sbb     eax, eax <- dep on previous instruction
or      eax, 1   <- dep on previous instruction

I think the tradeoff is around register pressure, in the -1 / 1 case we just need eax at the expense of a longer dependency chain.
In the -5 / 5 case we need ecx on top of eax but the dependency chain is shorter and then latency is reduced. Since latency matters for memcmp it makes more sense to use this construct.

Now TBH I haven't measured that the overall generated code is better but I'll run a few tests before landing.

https://godbolt.org/z/Gqahv7r7e

gchatelet: > I wonder what's the tradeoffs between this and what is generated for 1 and -1? If this is…

nafi3000Unsubmitted

Done

The compiler could have also used edi or esi instead of ecx. Would that cause slightly lower register pressure? E.g. why is it not doing something like:

cmp rdi, rsi
mov edi, -5
mov eax, 5
cmovb eax, edi

nafi3000: The compiler could have also used `edi` or `esi` instead of `ecx`. Would that cause slightly…

gchateletAuthorUnsubmitted

Done

The compiler could have also used edi or esi instead of ecx. Would that cause slightly lower register pressure? E.g. why is it not doing something like:
cmp rdi, rsi
mov edi, -5
mov eax, 5
cmovb eax, edi

Not exactly sure why, it may first use available registers (greedy algorithm) and then tries extra hard to reuse but only is it necessary?

gchatelet: > The compiler could have also used `edi` or `esi` instead of `ecx`. Would that cause slightly…

static constexpr int32_t POSITIVE = 5;

static constexpr int32_t NEGATIVE = -5;

#else

// On RISC-V we simply use '1' and '-1' as it leads to branchless code.

// On ARMv8, both strategies lead to the same performance.

static constexpr int32_t POSITIVE = 1;

static constexpr int32_t NEGATIVE = -1;

#endif

static_assert(POSITIVE > 0);

static_assert(NEGATIVE < 0);

return a < b ? NEGATIVE : POSITIVE;

}

// Loads bytes from memory (possibly unaligned) and materializes them as // Loads bytes from memory (possibly unaligned) and materializes them as

// type. // type.

template <typename T> LIBC_INLINE T load(CPtr ptr) { template <typename T> LIBC_INLINE T load(CPtr ptr) {

T Out; T Out;

memcpy_inline<sizeof(T)>(&Out, ptr); memcpy_inline<sizeof(T)>(&Out, ptr);

return Out; return Out;

} }

▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines void align_to_next_boundary(T1 *__restrict &p1, T2 *__restrict &p2,

if constexpr (AlignOn == Arg::P1) if constexpr (AlignOn == Arg::P1)

align_p1_to_next_boundary<SIZE>(p1, p2, count); align_p1_to_next_boundary<SIZE>(p1, p2, count);

else if constexpr (AlignOn == Arg::P2) else if constexpr (AlignOn == Arg::P2)

align_p1_to_next_boundary<SIZE>(p2, p1, count); // swapping p1 and p2. align_p1_to_next_boundary<SIZE>(p2, p1, count); // swapping p1 and p2.

else else

deferred_static_assert("AlignOn must be either Arg::P1 or Arg::P2"); deferred_static_assert("AlignOn must be either Arg::P1 or Arg::P2");

} }

template <size_t SIZE> struct AlignHelper {

AlignHelper(CPtr ptr) : offset_(distance_to_next_aligned<SIZE>(ptr)) {}

LIBC_INLINE bool not_aligned() const { return offset_ != SIZE; }

LIBC_INLINE uintptr_t offset() const { return offset_; }

private:

uintptr_t offset_;

};

} // namespace __llvm_libc } // namespace __llvm_libc

#endif // LLVM_LIBC_SRC_MEMORY_UTILS_UTILS_H #endif // LLVM_LIBC_SRC_MEMORY_UTILS_UTILS_H

libc/src/string/memory_utils/x86_64/memcmp_implementations.h

	Show All 12 Lines
	#include "src/string/memory_utils/op_generic.h"			#include "src/string/memory_utils/op_generic.h"
	#include "src/string/memory_utils/op_x86.h"			#include "src/string/memory_utils/op_x86.h"
	#include "src/string/memory_utils/utils.h" // MemcmpReturnType			#include "src/string/memory_utils/utils.h" // MemcmpReturnType

	namespace __llvm_libc {			namespace __llvm_libc {

	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {
	if (LIBC_UNLIKELY(count >= 384)) {			return generic::Memcmp<uint64_t>::loop_and_tail_align_above(384, p1, p2,
	if (auto value = generic::Memcmp<16>::block(p1, p2))			count);
	return value;
	align_to_next_boundary<16, Arg::P1>(p1, p2, count);
	}
	return generic::Memcmp<16>::loop_and_tail(p1, p2, count);
	}			}

				#if defined(__SSE4_1__)
	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_x86_sse2_gt16(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_x86_sse41_gt16(CPtr p1, CPtr p2, size_t count) {
	if (LIBC_UNLIKELY(count >= 384)) {			return generic::Memcmp<__m128i>::loop_and_tail_align_above(384, p1, p2,
	if (auto value = x86::sse2::Memcmp<16>::block(p1, p2))			count);
	return value;
	align_to_next_boundary<16, Arg::P1>(p1, p2, count);
	}
	return x86::sse2::Memcmp<16>::loop_and_tail(p1, p2, count);
	}			}
				#endif // __SSE4_1__

				#if defined(__AVX2__)
	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_x86_avx2_gt16(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_x86_avx2_gt16(CPtr p1, CPtr p2, size_t count) {
	if (count <= 32)			if (count <= 32)
	return x86::sse2::Memcmp<16>::head_tail(p1, p2, count);			return generic::Memcmp<__m128i>::head_tail(p1, p2, count);
	if (count <= 64)			if (count <= 64)
	return x86::avx2::Memcmp<32>::head_tail(p1, p2, count);			return generic::Memcmp<__m256i>::head_tail(p1, p2, count);
	if (count <= 128)			return generic::Memcmp<__m256i>::loop_and_tail_align_above(384, p1, p2,
	return x86::avx2::Memcmp<64>::head_tail(p1, p2, count);			count);
	if (LIBC_UNLIKELY(count >= 384)) {
	if (auto value = x86::avx2::Memcmp<32>::block(p1, p2))
	return value;
	align_to_next_boundary<32, Arg::P1>(p1, p2, count);
	}
	return x86::avx2::Memcmp<32>::loop_and_tail(p1, p2, count);
	}			}
				#endif // __AVX2__

				#if defined(__AVX512BW__)
	[[maybe_unused]] LIBC_INLINE MemcmpReturnType			[[maybe_unused]] LIBC_INLINE MemcmpReturnType
	inline_memcmp_x86_avx512bw_gt16(CPtr p1, CPtr p2, size_t count) {			inline_memcmp_x86_avx512bw_gt16(CPtr p1, CPtr p2, size_t count) {
	if (count <= 32)			if (count <= 32)
	return x86::sse2::Memcmp<16>::head_tail(p1, p2, count);			return generic::Memcmp<__m128i>::head_tail(p1, p2, count);
	if (count <= 64)			if (count <= 64)
	return x86::avx2::Memcmp<32>::head_tail(p1, p2, count);			return generic::Memcmp<__m256i>::head_tail(p1, p2, count);
	if (count <= 128)			if (count <= 128)
	return x86::avx512bw::Memcmp<64>::head_tail(p1, p2, count);			return generic::Memcmp<__m512i>::head_tail(p1, p2, count);
	if (LIBC_UNLIKELY(count >= 384)) {			return generic::Memcmp<__m512i>::loop_and_tail_align_above(384, p1, p2,
	if (auto value = x86::avx512bw::Memcmp<64>::block(p1, p2))			count);
	return value;
	align_to_next_boundary<64, Arg::P1>(p1, p2, count);
	}
	return x86::avx512bw::Memcmp<64>::loop_and_tail(p1, p2, count);
	}			}
				#endif // __AVX512BW__

	LIBC_INLINE MemcmpReturnType inline_memcmp_x86(CPtr p1, CPtr p2, size_t count) {			LIBC_INLINE MemcmpReturnType inline_memcmp_x86(CPtr p1, CPtr p2, size_t count) {

	if (count == 0)			if (count == 0)
	return MemcmpReturnType::ZERO();			return MemcmpReturnType::ZERO();
	if (count == 1)			if (count == 1)
	return generic::Memcmp<1>::block(p1, p2);			return generic::Memcmp<uint8_t>::block(p1, p2);
	if (count == 2)			if (count == 2)
	return generic::Memcmp<2>::block(p1, p2);			return generic::Memcmp<uint16_t>::block(p1, p2);
	if (count == 3)			if (count == 3)
	return generic::Memcmp<3>::block(p1, p2);			return generic::MemcmpSequence<uint16_t, uint8_t>::block(p1, p2);
	if (count <= 8)			if (count == 4)
	return generic::Memcmp<4>::head_tail(p1, p2, count);			return generic::Memcmp<uint32_t>::block(p1, p2);
				if (count == 5)
				return generic::MemcmpSequence<uint32_t, uint8_t>::block(p1, p2);
				if (count == 6)
				return generic::MemcmpSequence<uint32_t, uint16_t>::block(p1, p2);
				if (count == 7)
				return generic::Memcmp<uint32_t>::head_tail(p1, p2, 7);
				nafi3000Unsubmitted Done Reply Inline Actions Ditto. Similar to the bcmp comment, how about using head_tail (2 loads) instead of MemcmpSequence of 3 loads? E.g. generic::Memcmp<uint32_t>::head_tail(p1, p2, 7) x86 asm diff: https://www.diffchecker.com/XQNu3lGN/ nafi3000: Ditto. Similar to the bcmp comment, how about using head_tail (2 loads) instead of…
				if (count == 8)
				return generic::Memcmp<uint64_t>::block(p1, p2);
	if (count <= 16)			if (count <= 16)
	return generic::Memcmp<8>::head_tail(p1, p2, count);			return generic::Memcmp<uint64_t>::head_tail(p1, p2, count);
	if constexpr (x86::kAvx512BW)			#if defined(__AVX512BW__)
	return inline_memcmp_x86_avx512bw_gt16(p1, p2, count);			return inline_memcmp_x86_avx512bw_gt16(p1, p2, count);
	else if constexpr (x86::kAvx2)			#elif defined(__AVX2__)
	return inline_memcmp_x86_avx2_gt16(p1, p2, count);			return inline_memcmp_x86_avx2_gt16(p1, p2, count);
	else if constexpr (x86::kSse2)			#elif defined(__SSE4_1__)
	return inline_memcmp_x86_sse2_gt16(p1, p2, count);			return inline_memcmp_x86_sse41_gt16(p1, p2, count);
	else			#else
	return inline_memcmp_generic_gt16(p1, p2, count);			return inline_memcmp_generic_gt16(p1, p2, count);
				#endif
	}			}

	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif // LIBC_SRC_STRING_MEMORY_UTILS_X86_64_MEMCMP_IMPLEMENTATIONS_H			#endif // LIBC_SRC_STRING_MEMORY_UTILS_X86_64_MEMCMP_IMPLEMENTATIONS_H

libc/test/src/string/memory_utils/op_tests.cpp

//===-- Unittests for op_ files -------------------------------------------===//		//===-- Unittests for op_ files -------------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "memory_check_utils.h"		#include "memory_check_utils.h"
#include "src/string/memory_utils/op_aarch64.h"		#include "src/string/memory_utils/op_aarch64.h"
#include "src/string/memory_utils/op_builtin.h"		#include "src/string/memory_utils/op_builtin.h"
#include "src/string/memory_utils/op_generic.h"		#include "src/string/memory_utils/op_generic.h" // LLVM_LIBC_HAS_UINT64
		#include "src/string/memory_utils/op_riscv.h"
#include "src/string/memory_utils/op_x86.h"		#include "src/string/memory_utils/op_x86.h"
#include "test/UnitTest/Test.h"		#include "test/UnitTest/Test.h"

#if defined(LIBC_TARGET_ARCH_IS_X86_64) \|\| defined(LIBC_TARGET_ARCH_IS_AARCH64)
#define LLVM_LIBC_HAS_UINT64
#endif

namespace __llvm_libc {		namespace __llvm_libc {

template <typename T> struct has_head_tail {		template <typename T> struct has_head_tail {
template <typename C> static char sfinae(decltype(&C::head_tail));		template <typename C> static char sfinae(decltype(&C::head_tail));
template <typename C> static uint16_t sfinae(...);		template <typename C> static uint16_t sfinae(...);
static constexpr bool value = sizeof(sfinae<T>(0)) == sizeof(char);		static constexpr bool value = sizeof(sfinae<T>(0)) == sizeof(char);
};		};

▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	#ifdef LLVM_LIBC_HAS_BUILTIN_MEMSET_INLINE
builtin::Memset<16>, //		builtin::Memset<16>, //
builtin::Memset<32>, //		builtin::Memset<32>, //
builtin::Memset<64>,		builtin::Memset<64>,
#endif		#endif
#ifdef LLVM_LIBC_HAS_UINT64		#ifdef LLVM_LIBC_HAS_UINT64
generic::Memset<uint64_t>, generic::Memset<cpp::array<uint64_t, 2>>,		generic::Memset<uint64_t>, generic::Memset<cpp::array<uint64_t, 2>>,
#endif		#endif
#ifdef __AVX512F__		#ifdef __AVX512F__
generic::Memset<uint8x64_t>, generic::Memset<cpp::array<uint8x64_t, 2>>,		generic::Memset<generic_v512>, generic::Memset<cpp::array<generic_v512, 2>>,
#endif		#endif
#ifdef __AVX__		#ifdef __AVX__
generic::Memset<uint8x32_t>, generic::Memset<cpp::array<uint8x32_t, 2>>,		generic::Memset<generic_v256>, generic::Memset<cpp::array<generic_v256, 2>>,
#endif		#endif
#ifdef __SSE2__		#ifdef __SSE2__
generic::Memset<uint8x16_t>, generic::Memset<cpp::array<uint8x16_t, 2>>,		generic::Memset<generic_v128>, generic::Memset<cpp::array<generic_v128, 2>>,
#endif		#endif
generic::Memset<uint32_t>, generic::Memset<cpp::array<uint32_t, 2>>, //		generic::Memset<uint32_t>, generic::Memset<cpp::array<uint32_t, 2>>, //
generic::Memset<uint16_t>, generic::Memset<cpp::array<uint16_t, 2>>, //		generic::Memset<uint16_t>, generic::Memset<cpp::array<uint16_t, 2>>, //
generic::Memset<uint8_t>, generic::Memset<cpp::array<uint8_t, 2>>, //		generic::Memset<uint8_t>, generic::Memset<cpp::array<uint8_t, 2>>, //
generic::MemsetSequence<uint8_t, uint8_t>, //		generic::MemsetSequence<uint8_t, uint8_t>, //
generic::MemsetSequence<uint16_t, uint8_t>, //		generic::MemsetSequence<uint16_t, uint8_t>, //
generic::MemsetSequence<uint32_t, uint16_t, uint8_t> //		generic::MemsetSequence<uint32_t, uint16_t, uint8_t> //
>;		>;
Show All 40 Lines	if constexpr (kSize > 1) {
auto dst = DstBuffer.span().subspan(0, size);		auto dst = DstBuffer.span().subspan(0, size);
ASSERT_TRUE((CheckMemset<LoopImpl>(dst, value, size)));		ASSERT_TRUE((CheckMemset<LoopImpl>(dst, value, size)));
}		}
}		}
}		}
}		}

using BcmpImplementations = testing::TypeList<		using BcmpImplementations = testing::TypeList<
#ifdef __SSE2__		#ifdef LIBC_TARGET_ARCH_IS_X86_64
x86::sse2::Bcmp<16>, //		#ifdef __SSE4_1__
x86::sse2::Bcmp<32>, //		generic::Bcmp<__m128i>,
x86::sse2::Bcmp<64>, //		#endif // __SSE4_1__
x86::sse2::Bcmp<128>, //
#endif
#ifdef __AVX2__		#ifdef __AVX2__
x86::avx2::Bcmp<32>, //		generic::Bcmp<__m256i>,
x86::avx2::Bcmp<64>, //		#endif // __AVX2__
x86::avx2::Bcmp<128>, //
#endif
#ifdef __AVX512BW__		#ifdef __AVX512BW__
x86::avx512bw::Bcmp<64>, //		generic::Bcmp<__m512i>,
x86::avx512bw::Bcmp<128>, //		#endif // __AVX512BW__
#endif
		#endif // LIBC_TARGET_ARCH_IS_X86_64
#ifdef LIBC_TARGET_ARCH_IS_AARCH64		#ifdef LIBC_TARGET_ARCH_IS_AARCH64
aarch64::Bcmp<16>, //		aarch64::Bcmp<16>, //
aarch64::Bcmp<32>, //		aarch64::Bcmp<32>,
#endif		#endif
		#ifndef LIBC_TARGET_ARCH_IS_ARM // Removing non uint8_t types for ARM
		generic::Bcmp<uint16_t>,
		generic::Bcmp<uint32_t>, //
#ifdef LLVM_LIBC_HAS_UINT64		#ifdef LLVM_LIBC_HAS_UINT64
generic::Bcmp<8>, //		generic::Bcmp<uint64_t>,
#endif		#endif // LLVM_LIBC_HAS_UINT64
generic::Bcmp<1>, //		generic::BcmpSequence<uint16_t, uint8_t>,
generic::Bcmp<2>, //		generic::BcmpSequence<uint32_t, uint8_t>, //
generic::Bcmp<4>, //		generic::BcmpSequence<uint32_t, uint16_t>, //
generic::Bcmp<16>, //		generic::BcmpSequence<uint32_t, uint16_t, uint8_t>,
generic::Bcmp<32>, //		#endif // LIBC_TARGET_ARCH_IS_ARM
		nafi3000Unsubmitted Done Reply Inline Actions Do we need to add `generic::BcmpSequence<uint32_t, uint8_t>` and `generic::BcmpSequence<uint32_t, uint16_t>` here too? I am interpreting the above list as: 8, 1, 2, 4, 1+1, 1+1+1, 2+1, 4+2+1 nafi3000: Do we need to add `generic::BcmpSequence<uint32_t, uint8_t>` and `generic…
		gchateletAuthorUnsubmitted Done Reply Inline Actions Done. More coverage doesn't hurt : ) gchatelet: Done. More coverage doesn't hurt : )
generic::Bcmp<64> //		generic::BcmpSequence<uint8_t, uint8_t>,
>;		generic::BcmpSequence<uint8_t, uint8_t, uint8_t>, //
		generic::Bcmp<uint8_t>>;

// Adapt CheckBcmp signature to op implementation signatures.		// Adapt CheckBcmp signature to op implementation signatures.
template <auto FnImpl>		template <auto FnImpl>
int CmpAdaptor(cpp::span<char> p1, cpp::span<char> p2, size_t size) {		int CmpAdaptor(cpp::span<char> p1, cpp::span<char> p2, size_t size) {
return (int)FnImpl(as_byte(p1), as_byte(p2), size);		return (int)FnImpl(as_byte(p1), as_byte(p2), size);
}		}
template <size_t Size, auto FnImpl>		template <size_t Size, auto FnImpl>
int CmpBlockAdaptor(cpp::span<char> p1, cpp::span<char> p2, size_t size) {		int CmpBlockAdaptor(cpp::span<char> p1, cpp::span<char> p2, size_t size) {
return (int)FnImpl(as_byte(p1), as_byte(p2));		return (int)FnImpl(as_byte(p1), as_byte(p2));
}		}

TYPED_TEST(LlvmLibcOpTest, Bcmp, BcmpImplementations) {		TYPED_TEST(LlvmLibcOpTest, Bcmp, BcmpImplementations) {
using Impl = ParamType;		using Impl = ParamType;
constexpr size_t kSize = Impl::SIZE;		constexpr size_t kSize = Impl::SIZE;
{ // Test block operation		{ // Test block operation
static constexpr auto BlockImpl = CmpBlockAdaptor<kSize, Impl::block>;		static constexpr auto BlockImpl = CmpBlockAdaptor<kSize, Impl::block>;
Buffers Buffer1(kSize);		Buffers Buffer1(kSize);
Buffers Buffer2(kSize);		Buffers Buffer2(kSize);
for (auto span1 : Buffer1.spans()) {		for (auto span1 : Buffer1.spans()) {
Randomize(span1);		Randomize(span1);
for (auto span2 : Buffer2.spans())		for (auto span2 : Buffer2.spans())
ASSERT_TRUE((CheckBcmp<BlockImpl>(span1, span2, kSize)));		ASSERT_TRUE((CheckBcmp<BlockImpl>(span1, span2, kSize)));
}		}
}		}
{ // Test head tail operations from kSize to 2 * kSize.		if constexpr (has_head_tail<Impl>::value) {
		// Test head tail operations from kSize to 2 * kSize.
static constexpr auto HeadTailImpl = CmpAdaptor<Impl::head_tail>;		static constexpr auto HeadTailImpl = CmpAdaptor<Impl::head_tail>;
Buffer Buffer1(2 * kSize);		Buffer Buffer1(2 * kSize);
Buffer Buffer2(2 * kSize);		Buffer Buffer2(2 * kSize);
Randomize(Buffer1.span());		Randomize(Buffer1.span());
for (size_t size = kSize; size < 2 * kSize; ++size) {		for (size_t size = kSize; size < 2 * kSize; ++size) {
auto span1 = Buffer1.span().subspan(0, size);		auto span1 = Buffer1.span().subspan(0, size);
auto span2 = Buffer2.span().subspan(0, size);		auto span2 = Buffer2.span().subspan(0, size);
ASSERT_TRUE((CheckBcmp<HeadTailImpl>(span1, span2, size)));		ASSERT_TRUE((CheckBcmp<HeadTailImpl>(span1, span2, size)));
}		}
}		}
{ // Test loop operations from kSize to 3 * kSize.		if constexpr (has_loop_and_tail<Impl>::value) {
		// Test loop operations from kSize to 3 * kSize.
if constexpr (kSize > 1) {		if constexpr (kSize > 1) {
static constexpr auto LoopImpl = CmpAdaptor<Impl::loop_and_tail>;		static constexpr auto LoopImpl = CmpAdaptor<Impl::loop_and_tail>;
Buffer Buffer1(3 * kSize);		Buffer Buffer1(3 * kSize);
Buffer Buffer2(3 * kSize);		Buffer Buffer2(3 * kSize);
Randomize(Buffer1.span());		Randomize(Buffer1.span());
for (size_t size = kSize; size < 3 * kSize; ++size) {		for (size_t size = kSize; size < 3 * kSize; ++size) {
auto span1 = Buffer1.span().subspan(0, size);		auto span1 = Buffer1.span().subspan(0, size);
auto span2 = Buffer2.span().subspan(0, size);		auto span2 = Buffer2.span().subspan(0, size);
ASSERT_TRUE((CheckBcmp<LoopImpl>(span1, span2, size)));		ASSERT_TRUE((CheckBcmp<LoopImpl>(span1, span2, size)));
}		}
}		}
}		}
}		}

using MemcmpImplementations = testing::TypeList<		using MemcmpImplementations = testing::TypeList<
		#ifdef LIBC_TARGET_ARCH_IS_X86_64
#ifdef __SSE2__		#ifdef __SSE2__
x86::sse2::Memcmp<16>, //		generic::Memcmp<__m128i>, //
x86::sse2::Memcmp<32>, //
x86::sse2::Memcmp<64>, //
x86::sse2::Memcmp<128>, //
#endif		#endif
#ifdef __AVX2__		#ifdef __AVX2__
x86::avx2::Memcmp<32>, //		generic::Memcmp<__m256i>, //
x86::avx2::Memcmp<64>, //
x86::avx2::Memcmp<128>, //
#endif		#endif
#ifdef __AVX512BW__		#ifdef __AVX512BW__
x86::avx512bw::Memcmp<64>, //		generic::Memcmp<__m512i>, //
x86::avx512bw::Memcmp<128>, //
#endif		#endif
#ifdef LLVM_LIBC_HAS_UINT64		#endif // LIBC_TARGET_ARCH_IS_X86_64
generic::Memcmp<8>, //		#ifdef LIBC_TARGET_ARCH_IS_AARCH64
		generic::Memcmp<uint8x16_t>, //
		generic::Memcmp<uint8x16x2_t>,
#endif		#endif
generic::Memcmp<1>, //		#ifndef LIBC_TARGET_ARCH_IS_ARM // Removing non uint8_t types for ARM
generic::Memcmp<2>, //		generic::Memcmp<uint16_t>,
generic::Memcmp<3>, //		generic::Memcmp<uint32_t>, //
generic::Memcmp<4>, //		#ifdef LLVM_LIBC_HAS_UINT64
generic::Memcmp<16>, //		generic::Memcmp<uint64_t>,
generic::Memcmp<32>, //		#endif // LLVM_LIBC_HAS_UINT64
generic::Memcmp<64> //		generic::MemcmpSequence<uint16_t, uint8_t>,
>;		generic::MemcmpSequence<uint32_t, uint16_t, uint8_t>, //
		#endif // LIBC_TARGET_ARCH_IS_ARM
		generic::MemcmpSequence<uint8_t, uint8_t>,
		generic::MemcmpSequence<uint8_t, uint8_t, uint8_t>,
		generic::Memcmp<uint8_t>>;

TYPED_TEST(LlvmLibcOpTest, Memcmp, MemcmpImplementations) {		TYPED_TEST(LlvmLibcOpTest, Memcmp, MemcmpImplementations) {
using Impl = ParamType;		using Impl = ParamType;
constexpr size_t kSize = Impl::SIZE;		constexpr size_t kSize = Impl::SIZE;
{ // Test block operation		{ // Test block operation
static constexpr auto BlockImpl = CmpBlockAdaptor<kSize, Impl::block>;		static constexpr auto BlockImpl = CmpBlockAdaptor<kSize, Impl::block>;
Buffers Buffer1(kSize);		Buffers Buffer1(kSize);
Buffers Buffer2(kSize);		Buffers Buffer2(kSize);
for (auto span1 : Buffer1.spans()) {		for (auto span1 : Buffer1.spans()) {
Randomize(span1);		Randomize(span1);
for (auto span2 : Buffer2.spans())		for (auto span2 : Buffer2.spans())
ASSERT_TRUE((CheckMemcmp<BlockImpl>(span1, span2, kSize)));		ASSERT_TRUE((CheckMemcmp<BlockImpl>(span1, span2, kSize)));
}		}
}		}
{ // Test head tail operations from kSize to 2 * kSize.		if constexpr (has_head_tail<Impl>::value) {
		// Test head tail operations from kSize to 2 * kSize.
static constexpr auto HeadTailImpl = CmpAdaptor<Impl::head_tail>;		static constexpr auto HeadTailImpl = CmpAdaptor<Impl::head_tail>;
Buffer Buffer1(2 * kSize);		Buffer Buffer1(2 * kSize);
Buffer Buffer2(2 * kSize);		Buffer Buffer2(2 * kSize);
Randomize(Buffer1.span());		Randomize(Buffer1.span());
for (size_t size = kSize; size < 2 * kSize; ++size) {		for (size_t size = kSize; size < 2 * kSize; ++size) {
auto span1 = Buffer1.span().subspan(0, size);		auto span1 = Buffer1.span().subspan(0, size);
auto span2 = Buffer2.span().subspan(0, size);		auto span2 = Buffer2.span().subspan(0, size);
ASSERT_TRUE((CheckMemcmp<HeadTailImpl>(span1, span2, size)));		ASSERT_TRUE((CheckMemcmp<HeadTailImpl>(span1, span2, size)));
}		}
}		}
{ // Test loop operations from kSize to 3 * kSize.		if constexpr (has_loop_and_tail<Impl>::value) {
		// Test loop operations from kSize to 3 * kSize.
if constexpr (kSize > 1) {		if constexpr (kSize > 1) {
static constexpr auto LoopImpl = CmpAdaptor<Impl::loop_and_tail>;		static constexpr auto LoopImpl = CmpAdaptor<Impl::loop_and_tail>;
Buffer Buffer1(3 * kSize);		Buffer Buffer1(3 * kSize);
Buffer Buffer2(3 * kSize);		Buffer Buffer2(3 * kSize);
Randomize(Buffer1.span());		Randomize(Buffer1.span());
for (size_t size = kSize; size < 3 * kSize; ++size) {		for (size_t size = kSize; size < 3 * kSize; ++size) {
auto span1 = Buffer1.span().subspan(0, size);		auto span1 = Buffer1.span().subspan(0, size);
auto span2 = Buffer2.span().subspan(0, size);		auto span2 = Buffer2.span().subspan(0, size);
ASSERT_TRUE((CheckMemcmp<LoopImpl>(span1, span2, size)));		ASSERT_TRUE((CheckMemcmp<LoopImpl>(span1, span2, size)));
}		}
}		}
}		}
}		}

} // namespace __llvm_libc		} // namespace __llvm_libc

utils/bazel/llvm-project-overlay/libc/BUILD.bazel

	Show First 20 Lines • Show All 1,989 Lines • ▼ Show 20 Lines
	]			]

	libc_support_library(			libc_support_library(
	name = "string_memory_utils",			name = "string_memory_utils",
	hdrs = [			hdrs = [
	"src/string/memory_utils/op_aarch64.h",			"src/string/memory_utils/op_aarch64.h",
	"src/string/memory_utils/op_builtin.h",			"src/string/memory_utils/op_builtin.h",
	"src/string/memory_utils/op_generic.h",			"src/string/memory_utils/op_generic.h",
				"src/string/memory_utils/op_riscv.h",
	"src/string/memory_utils/op_x86.h",			"src/string/memory_utils/op_x86.h",
	"src/string/memory_utils/utils.h",			"src/string/memory_utils/utils.h",
	],			],
	textual_hdrs = [			textual_hdrs = [
	"src/string/memory_utils/aarch64/memcmp_implementations.h",			"src/string/memory_utils/aarch64/memcmp_implementations.h",
	"src/string/memory_utils/aarch64/memcpy_implementations.h",			"src/string/memory_utils/aarch64/memcpy_implementations.h",
	"src/string/memory_utils/bcmp_implementations.h",			"src/string/memory_utils/bcmp_implementations.h",
	"src/string/memory_utils/bzero_implementations.h",			"src/string/memory_utils/bzero_implementations.h",
	▲ Show 20 Lines • Show All 700 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Improve memcmp latency and codegenClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 536199

libc/src/__support/macros/properties/architectures.h

libc/src/string/CMakeLists.txt

libc/src/string/memory_utils/CMakeLists.txt

libc/src/string/memory_utils/aarch64/memcmp_implementations.h

libc/src/string/memory_utils/bcmp_implementations.h

libc/src/string/memory_utils/memcmp_implementations.h

libc/src/string/memory_utils/memmove_implementations.h

libc/src/string/memory_utils/memset_implementations.h

libc/src/string/memory_utils/op_aarch64.h

libc/src/string/memory_utils/op_generic.h

libc/src/string/memory_utils/op_riscv.h

libc/src/string/memory_utils/op_x86.h

libc/src/string/memory_utils/utils.h

libc/src/string/memory_utils/x86_64/memcmp_implementations.h

libc/test/src/string/memory_utils/op_tests.cpp

utils/bazel/llvm-project-overlay/libc/BUILD.bazel

[libc] Improve memcmp latency and codegen
ClosedPublic