This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/src/string/
-
src/
-
string/
-
CMakeLists.txt
-
aarch64/
2/2
memset.cpp
-
memory_utils/
2/2
elements_aarch64.h

Differential D107848

[libc] Add optimized memset for AArch64
ClosedPublic

Authored by avieira on Aug 10 2021, 10:56 AM.

Download Raw Diff

Details

Reviewers

gchatelet
sivachandra

Commits

rG8b87c3d57367: [libc] Add optimized memset for AArch64

Summary

Hi,

This is an optimized version of memset for AArch64, improving on the general implementation.

I do believe there is still room for improvement on the generated code though, but I suggest I look at those as follow ups.

Things I'd look to look at are:

a different Bump for single operands like SplatSet as if you only need to align a single pointer I believe it is simpler to just mask away the bottom bits;
introducing a DoLoop where we use a do-while-loop rather than a for-loop to use in situations where we know to have at least a single iteration, ideally the compiler would figure this one out through some valuerange analysis, but unfortunately it isn't currently, so maybe we can help it out;
changing Chained to do the Tail last, thus creating chains of stores & loads that are contiguous as I suspect that could potentially help with fetching behaviour in loops
trying to get the non dc zva loop in memset to use stores with post-increment as that could help reduce the loop to two stores, a cmp and a branch. This will probably require compiler changes though.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

avieira created this revision.Aug 10 2021, 10:56 AM

Herald added subscribers: ecnelises, tschuett, kristof.beyls, mgorny. · View Herald TranscriptAug 10 2021, 10:56 AM

avieira requested review of this revision.Aug 10 2021, 10:56 AM

Harbormaster completed remote builds in B118939: Diff 365540.Aug 10 2021, 11:57 AM

Matt added a subscriber: Matt.Aug 10 2021, 1:24 PM

@gchatelet is on vacation and will be back early September. I would prefer that he reviews this patch. Is it OK to wait?

Yeah no problem!

Thx for the patch Andre!

Quick questions:

Are dc zva and mrs broadly available on aarch64 or should they be guarded by some #define?
Could we use compiler builtins instead?

PS: I'll be available this week (then in vacation again).

libc/src/string/aarch64/memset.cpp
18–29	Hmm the unaligned `Tail<_64>` is preventing the reuse of the `Loop` logic... I'll update the code to separate the looping element from the tail element.
libc/src/string/memory_utils/elements_aarch64.h
22–42	if `__ARM_NEON` is undefined, the `aarch64_memset` namespace will not exist which will make the code fail to compile in `libc/src/string/aarch64/memset.cpp`.
38–39	maybe using _32 = Repeated<_16, 2>; using _64 = Repeated<_16, 4>;

gchatelet added inline comments.Aug 17 2021, 1:56 AM

libc/src/string/aarch64/memset.cpp
18–29	I've submitted rG8e4efad9917ce0b7d1751c34a8d6907e610050e6. You can now write: struct ZVA { static constexpr size_t kSize = 64; static void SplatSet(char *dst, const unsigned char value, size_t size) { assert(value == 0); asm("dc zva, %[dst]" : : [dst] "r"(dst) : "memory"); } }; And use `SplatSet<Align<_64, Arg::_1>::Then<Loop<ZVA, _64>>>(dst, 0, count);`

Are dc zva and mrs broadly available on aarch64 or should they be guarded by some #define?
They are available for any AArch64 ISA, so no define required since this is for AArch64 only.

Could we use compiler builtins instead?
There are currently no compiler builtins for these.

Harbormaster completed remote builds in B124399: Diff 373229.Sep 17 2021, 8:56 AM

gchatelet accepted this revision.Sep 21 2021, 7:29 AM

This revision is now accepted and ready to land.Sep 21 2021, 7:29 AM

Closed by commit rG8b87c3d57367: [libc] Add optimized memset for AArch64 (authored by avieira). · Explain WhySep 23 2021, 1:22 AM

This revision was automatically updated to reflect the committed changes.

avieira added a commit: rG8b87c3d57367: [libc] Add optimized memset for AArch64.

Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2021, 1:22 AM

Herald added a subscriber: libc-commits. · View Herald Transcript

Revision Contents

Path

Size

libc/

src/

string/

CMakeLists.txt

9 lines

aarch64/

memset.cpp

49 lines

memory_utils/

elements_aarch64.h

48 lines

Diff 374468

libc/src/string/CMakeLists.txt

	Show First 20 Lines • Show All 335 Lines • ▼ Show 20 Lines
	endif()			endif()

	# ------------------------------------------------------------------------------			# ------------------------------------------------------------------------------
	# memset			# memset
	# ------------------------------------------------------------------------------			# ------------------------------------------------------------------------------

	function(add_memset memset_name)			function(add_memset memset_name)
	add_implementation(memset ${memset_name}			add_implementation(memset ${memset_name}
	SRCS ${LIBC_SOURCE_DIR}/src/string/memset.cpp			SRCS ${MEMSET_SRC}
	HDRS ${LIBC_SOURCE_DIR}/src/string/memset.h			HDRS ${LIBC_SOURCE_DIR}/src/string/memset.h
	DEPENDS			DEPENDS
	.memory_utils.memory_utils			.memory_utils.memory_utils
	libc.include.string			libc.include.string
	COMPILE_OPTIONS			COMPILE_OPTIONS
	-fno-builtin-memset			-fno-builtin-memset
	${ARGN}			${ARGN}
	)			)
	endfunction()			endfunction()

	if(${LIBC_TARGET_ARCHITECTURE_IS_X86})			if(${LIBC_TARGET_ARCHITECTURE_IS_X86})
				set(MEMSET_SRC ${LIBC_SOURCE_DIR}/src/string/memset.cpp)
	add_memset(memset_x86_64_opt_sse2 COMPILE_OPTIONS -march=k8 REQUIRE SSE2)			add_memset(memset_x86_64_opt_sse2 COMPILE_OPTIONS -march=k8 REQUIRE SSE2)
	add_memset(memset_x86_64_opt_sse4 COMPILE_OPTIONS -march=nehalem REQUIRE SSE4_2)			add_memset(memset_x86_64_opt_sse4 COMPILE_OPTIONS -march=nehalem REQUIRE SSE4_2)
	add_memset(memset_x86_64_opt_avx2 COMPILE_OPTIONS -march=haswell REQUIRE AVX2)			add_memset(memset_x86_64_opt_avx2 COMPILE_OPTIONS -march=haswell REQUIRE AVX2)
	add_memset(memset_x86_64_opt_avx512 COMPILE_OPTIONS -march=skylake-avx512 REQUIRE AVX512F)			add_memset(memset_x86_64_opt_avx512 COMPILE_OPTIONS -march=skylake-avx512 REQUIRE AVX512F)
	add_memset(memset_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE})			add_memset(memset_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE})
	add_memset(memset)			add_memset(memset)
				elseif(${LIBC_TARGET_ARCHITECTURE_IS_AARCH64})
				set(MEMSET_SRC ${LIBC_SOURCE_DIR}/src/string/aarch64/memset.cpp)
				add_memset(memset_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE}
				COMPILE_OPTIONS "SHELL:-mllvm --tail-merge-threshold=0")
				add_memset(memset COMPILE_OPTIONS "SHELL:-mllvm --tail-merge-threshold=0")
	else()			else()
				set(MEMSET_SRC ${LIBC_SOURCE_DIR}/src/string/memset.cpp)
	add_memset(memset_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE})			add_memset(memset_opt_host COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE})
	add_memset(memset)			add_memset(memset)
	endif()			endif()

libc/src/string/aarch64/memset.cpp

This file was added.

				//===-- Implementation of memset ------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/string/memset.h"
				#include "src/__support/common.h"
				#include "src/string/memory_utils/memset_utils.h"

				namespace __llvm_libc {

				using namespace __llvm_libc::aarch64_memset;

				inline static void AArch64Memset(char *dst, int value, size_t count) {
				if (count == 0)
				return;
				if (count <= 3) {
				SplatSet<_1>(dst, value);
				if (count > 1)
				SplatSet<Tail<_2>>(dst, value, count);
				return;
				}
				if (count <= 8)
				return SplatSet<HeadTail<_4>>(dst, value, count);
				if (count <= 16)
				return SplatSet<HeadTail<_8>>(dst, value, count);
				gchateletUnsubmitted Done Reply Inline Actions Hmm the unaligned `Tail<_64>` is preventing the reuse of the `Loop` logic... I'll update the code to separate the looping element from the tail element. gchatelet: Hmm the unaligned `Tail<_64>` is preventing the reuse of the `Loop` logic... I'll update the…
				gchateletUnsubmitted Done Reply Inline Actions I've submitted rG8e4efad9917ce0b7d1751c34a8d6907e610050e6. You can now write: struct ZVA { static constexpr size_t kSize = 64; static void SplatSet(char dst, const unsigned char value, size_t size) { assert(value == 0); asm("dc zva, %[dst]" : : [dst] "r"(dst) : "memory"); } }; And use `SplatSet<Align<_64, Arg::_1>::Then<Loop<ZVA, _64>>>(dst, 0, count);` gchatelet:* I've submitted rG8e4efad9917ce0b7d1751c34a8d6907e610050e6. You can now write: ``` struct ZVA…
				if (count <= 32)
				return SplatSet<HeadTail<_16>>(dst, value, count);
				if (count <= 96) {
				SplatSet<_32>(dst, value);
				if (count <= 64)
				return SplatSet<Tail<_32>>(dst, value, count);
				SplatSet<Skip<32>::Then<_32>>(dst, value);
				SplatSet<Tail<_32>>(dst, value, count);
				return;
				}
				if (count < 448 \|\| value != 0 \|\| !AArch64ZVA(dst, count))
				return SplatSet<Align<_16, Arg::_1>::Then<Loop<_64>>>(dst, value, count);
				}

				LLVM_LIBC_FUNCTION(void , memset, (void dst, int value, size_t count)) {
				AArch64Memset((char *)dst, value, count);
				return dst;
				}

				} // namespace __llvm_libc

libc/src/string/memory_utils/elements_aarch64.h

	Show All 12 Lines
	#include <stddef.h> // size_t			#include <stddef.h> // size_t
	#include <stdint.h> // uint8_t, uint16_t, uint32_t, uint64_t			#include <stdint.h> // uint8_t, uint16_t, uint32_t, uint64_t

	#ifdef __ARM_NEON			#ifdef __ARM_NEON
	#include <arm_neon.h>			#include <arm_neon.h>
	#endif			#endif

	namespace __llvm_libc {			namespace __llvm_libc {
				namespace aarch64_memset {
				#ifdef __ARM_NEON
				struct Splat8 {
				static constexpr size_t kSize = 8;
				static void SplatSet(char *dst, const unsigned char value) {
				vst1_u8((uint8_t *)dst, vdup_n_u8(value));
				}
				};

				struct Splat16 {
				static constexpr size_t kSize = 16;
				static void SplatSet(char *dst, const unsigned char value) {
				vst1q_u8((uint8_t *)dst, vdupq_n_u8(value));
				}
				};

				using _8 = Splat8;
				using _16 = Splat16;
				#else
				gchateletUnsubmitted Done Reply Inline Actions maybe using _32 = Repeated<_16, 2>; using _64 = Repeated<_16, 4>; gchatelet: maybe ``` using _32 = Repeated<_16, 2>; using _64 = Repeated<_16, 4>; ```
				using _8 = __llvm_libc::scalar::_8;
				using _16 = Repeated<_8, 2>;
				#endif // __ARM_NEON
				gchateletUnsubmitted Done Reply Inline Actions if `__ARM_NEON` is undefined, the `aarch64_memset` namespace will not exist which will make the code fail to compile in `libc/src/string/aarch64/memset.cpp`. gchatelet: if `__ARM_NEON` is undefined, the `aarch64_memset` namespace will not exist which will make the…

				using _1 = __llvm_libc::scalar::_1;
				using _2 = __llvm_libc::scalar::_2;
				using _3 = __llvm_libc::scalar::_3;
				using _4 = __llvm_libc::scalar::_4;
				using _32 = Chained<_16, _16>;
				using _64 = Chained<_32, _32>;

				struct ZVA {
				static constexpr size_t kSize = 64;
				static void SplatSet(char *dst, const unsigned char value) {
				asm("dc zva, %[dst]" : : [dst] "r"(dst) : "memory");
				}
				};

				inline static bool AArch64ZVA(char *dst, size_t count) {
				uint64_t zva_val;
				asm("mrs %[zva_val], dczid_el0" : [zva_val] "=r"(zva_val));
				if ((zva_val & 31) != 4)
				return false;
				SplatSet<Align<_64, Arg::_1>::Then<Loop<ZVA, _64>>>(dst, 0, count);
				return true;
				}

				} // namespace aarch64_memset

	namespace aarch64 {			namespace aarch64 {

	using _1 = __llvm_libc::scalar::_1;			using _1 = __llvm_libc::scalar::_1;
	using _2 = __llvm_libc::scalar::_2;			using _2 = __llvm_libc::scalar::_2;
	using _3 = __llvm_libc::scalar::_3;			using _3 = __llvm_libc::scalar::_3;
	using _4 = __llvm_libc::scalar::_4;			using _4 = __llvm_libc::scalar::_4;
	using _8 = __llvm_libc::scalar::_8;			using _8 = __llvm_libc::scalar::_8;
	using _16 = __llvm_libc::scalar::_16;			using _16 = __llvm_libc::scalar::_16;
	Show All 40 Lines