This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/src/string/
-
src/
-
string/
3
CMakeLists.txt
-
aarch64/
-
CMakeLists.txt
4/15
memcpy.cpp

Differential D92236

[LIBC] Add optimized memcpy routine for AArch64
ClosedPublic

Authored by avieira on Nov 27 2020, 9:12 AM.

Download Raw Diff

Details

Reviewers

sivachandra
dxf
gchatelet

Group Reviewers

Restricted Project

Summary

Hi all,

This is an optimized memcpy routine for AArch64 using lessons learned from Arm's Optimized Routines(AOR) memcpy (the aarch64 Advanced SIMD one to be precise).
Benchmarked this on a Neoverse-N1 and it beats the current default implementation for both small and big configurations.

I am not entirely familiar with how the compile-time libc implementation selection works, so as you can see from this patch I've enabled this for 'AArch64', though maybe the community may want to choose different implementations for other AArch64 cores? I believe the non-Advanced SIMD variant of AOR's memcpy has been found to work better for earlier AArch64 cores for instance.

I'm curious to hear how the maintainers/community feel about this.

Kind Regards,
Andre

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

avieira created this revision.Nov 27 2020, 9:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 27 2020, 9:12 AM

Herald added subscribers: ecnelises, danielkiss, tschuett and 2 others. · View Herald Transcript

avieira requested review of this revision.Nov 27 2020, 9:12 AM

avieira added reviewers: sivachandra, dxf.Nov 27 2020, 9:31 AM

Harbormaster completed remote builds in B80361: Diff 308076.Nov 27 2020, 10:01 AM

Thanks for the patch. I am adding @gchatelet as a reviewer as he is the current owner/maintainer of the mem* functions.

My visual scan says the structuring is alright. Also, if libc builds with this patch applied and the tests pass, I guess that is proof enough that the structuring is alright. But, I will let @gchatelet drive the code review here.

sivachandra added a reviewer: gchatelet.Nov 29 2020, 8:48 PM

gchatelet added inline comments.Dec 1 2020, 4:20 AM

libc/src/string/CMakeLists.txt
84	This change is not needed: it will be handled by the `else()` clause. We have a special case for x86 to be able to support 32 and 64 bits architectures with the same code.
libc/src/string/aarch64/memcpy.cpp
38	Please use the `likely` and `unlikely` macros from `src/__support/common.h.def`. Here and below.
40–41	Pairs of CopyBlock used to produce an overlapping copy can make use `CopyBlockOverlap` from `src/string/memory_utils/memcpy_utils.h` instead. Simply replace there two lines with `CopyBlockOverlap<32>(dst, src, count)`
51	ditto `CopyBlockOverlap`
54	ditto `CopyBlockOverlap`
67	ditto `CopyBlockOverlap`
71–103	I think the code here could be replaced with a call to `CopyAlignedBlocks<64>` from `src/string/memory_utils/memcpy_utils.h` which essentially aligns the destination pointers, copies blocks of 64 bytes and handles the trailing bytes with an overlapping copy. If it's important that the alignment is 16B instead of 64B I can change the implementation but since the code is heavily tested I'd be in favor of reusing it as much as possible. Note: I believe that the compiler will generate the same code whether using two `CopyBlock<32>` or a single `CopyBlock<64>`.
82	`const`
94	`const`

Hi Guillaume,

Thanks for the review! I left some further questions/comments as replies. I agree with all other changes I didn't reply to. I'll update the patch once I've figured out the pending queries.

libc/src/string/CMakeLists.txt
84	I'm confused by this one. If I don't do this, then a call to __llvm_libc::memcpy in the benchmark will lead to the 'default' 'src/string/memcpy.cpp' implementation rather than the aarch64 one.
libc/src/string/aarch64/memcpy.cpp
71–103	I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment? I am also looking at a potential codegen difference between 2* CopyBlock<32> and CopyBlock<64>, will get back to you next week on this.
94	Can't make this const as it is updated in the loop below.

gchatelet added inline comments.Dec 7 2020, 6:47 AM

libc/src/string/CMakeLists.txt
84	Oh my bad you're right, this code has changed (it used to work differently).

gchatelet added inline comments.Dec 17 2020, 6:02 AM

libc/src/string/aarch64/memcpy.cpp
71–103	I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment? Rereading this comment, it seems there's a mistake in our code, it should be source aligned as you suggest and not destination aligned. I've sent D93457 to fix it.

avieira added inline comments.Dec 17 2020, 7:29 AM

libc/src/string/aarch64/memcpy.cpp
71–103	I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning. I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both. I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this? Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two.

gchatelet added inline comments.Dec 17 2020, 8:42 AM

libc/src/string/aarch64/memcpy.cpp
71–103	I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning. That might happen and we can add a template parameter to the function if it comes up. I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both. 👍 I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this? Definitely, nothing is carved in stone, we can adapt the framework as needed. Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two. Interesting! I wonder if this translates to gains on x86 as well. I'll make tests as well on my side. BTW the benchmarking framework has been updated in D93210. Let me know if you need help to use it.

gchatelet mentioned this in D94770: [libc] CopyAlignedBlocks can now specify alignment on top of block size.Jan 15 2021, 5:43 AM

gchatelet mentioned this in rG5bf47e142b6e: [libc] CopyAlignedBlocks can now specify alignment on top of block size.Jan 15 2021, 7:32 AM

Hi,

So here is an updated version for an optimized memcpy routine for AArch64. This one basically uses the same as the default memcpy, but picks a different block size and alignment for copies > 128.
I also disable tail merging as I found it was leading to worse code. This new memcpy seems to show improvements accross the board for both sweep and distribution benchmarks.

I am continuing to investigate a better organization of the copies smaller than 128bytes, as I had before, using the new benchmarks. Using the same code I had before I am seeing an improvement in Uniform1024 (new uniform distribution I added for sizes 0-1024), I also see an improvement in Memcpy Distributions A, M, Q and U, but a regression in B, L, S and W. For distribution D the optimized version beats the older version but shows a regression compared to the version in this patch.

I'll spend a few extra cycles trying to see if I can find a sweet spot, but I might leave it like this.

Is this OK for main?

Also I have two patches downstream for:

Uniform1024 distribution, an uniform distribution for sizes 0-1024
Options to define a Sweep 'min size' and 'step'.

Let me know if you are interested in either of these.

Kind regards,
Andre

Hey Andre, my apologies for the lag. Answers inlined

In D92236#2509136, @avieira wrote:

So here is an updated version for an optimized memcpy routine for AArch64. This one basically uses the same as the default memcpy, but picks a different block size and alignment for copies > 128.

👍

I also disable tail merging as I found it was leading to worse code. This new memcpy seems to show improvements accross the board for both sweep and distribution benchmarks.

Interesting! I never played with it but it might be interesting for all implementations as well. Thx for the feedback, really appreciate it.

I am continuing to investigate a better organization of the copies smaller than 128bytes, as I had before, using the new benchmarks. Using the same code I had before I am seeing an improvement in Uniform1024 (new uniform distribution I added for sizes 0-1024), I also see an improvement in Memcpy Distributions A, M, Q and U, but a regression in B, L, S and W. For distribution D the optimized version beats the older version but shows a regression compared to the version in this patch.

Yeah the letters distributions are likely to swing since they refer to individual applications that prefer certain sizes. I would like to have an even bigger bucket (1K to 4K?) but then some processors may run out of L1 and hitting L2 will bias the benchmark...
Which processors did you benchmark it on? How likely it is to be representative of all aarch64 cpus? Can you leave this information in the implementation file for the record?

I'll spend a few extra cycles trying to see if I can find a sweet spot, but I might leave it like this.

Fine by me.

Is this OK for main?

Definitely looks good to me!

Also I have two patches downstream for:

Uniform1024 distribution, an uniform distribution for sizes 0-1024

Options to define a Sweep 'min size' and 'step'.

Let me know if you are interested in either of these.

Yes why not.
For 2) I believe you had to store more information in the json file and adapt the python script as well right?

libc/src/string/aarch64/memcpy.cpp
35	Can you add a quick note on which processor(s) the implementation has been tuned and how it is representative of other aarch64 cpus? (If you happen to know)
57	Can you format the file? There's a space missing right after the coma in the template instantiation.

This revision is now accepted and ready to land.Jan 29 2021, 8:17 AM

I've committed the patch in https://reviews.llvm.org/rG369f7de3135a517a69c45084d4b175f7b0d5e6f5 but it seems when copying the revision link I may have hit ctrl + x in vim and linked it to D92235 :(

avieira mentioned this in D92235: [ARM] Turn pred_cast(xor(x, -1)) into xor(pred_cast(x), -1).Feb 4 2021, 4:18 AM

Revision Contents

Path

Size

libc/

src/

string/

CMakeLists.txt

3 lines

aarch64/

CMakeLists.txt

1 line

memcpy.cpp

113 lines

Diff 308076

libc/src/string/CMakeLists.txt

	Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	# ------------------------------------------------------------------------------			# ------------------------------------------------------------------------------
	# memcpy			# memcpy
	# ------------------------------------------------------------------------------			# ------------------------------------------------------------------------------

	# include the relevant architecture specific implementations			# include the relevant architecture specific implementations
	if(${LIBC_TARGET_MACHINE} STREQUAL "x86_64")			if(${LIBC_TARGET_MACHINE} STREQUAL "x86_64")
	set(LIBC_STRING_TARGET_ARCH "x86")			set(LIBC_STRING_TARGET_ARCH "x86")
	set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/x86/memcpy.cpp)			set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/x86/memcpy.cpp)
				elseif(${LIBC_TARGET_MACHINE} STREQUAL "aarch64")
				gchateletUnsubmitted Not Done Reply Inline Actions This change is not needed: it will be handled by the `else()` clause. We have a special case for x86 to be able to support 32 and 64 bits architectures with the same code. gchatelet: This change is not needed: it will be handled by the `else()` clause. We have a special case…
				avieiraAuthorUnsubmitted Not Done Reply Inline Actions I'm confused by this one. If I don't do this, then a call to __llvm_libc::memcpy in the benchmark will lead to the 'default' 'src/string/memcpy.cpp' implementation rather than the aarch64 one. avieira: I'm confused by this one. If I don't do this, then a call to __llvm_libc::memcpy in the…
				gchateletUnsubmitted Not Done Reply Inline Actions Oh my bad you're right, this code has changed (it used to work differently). gchatelet: Oh my bad you're right, this code has changed (it used to work differently).
				set(LIBC_STRING_TARGET_ARCH "aarch64")
				set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/aarch64/memcpy.cpp)
	else()			else()
	set(LIBC_STRING_TARGET_ARCH ${LIBC_TARGET_MACHINE})			set(LIBC_STRING_TARGET_ARCH ${LIBC_TARGET_MACHINE})
	set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/memcpy.cpp)			set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/memcpy.cpp)
	endif()			endif()

	function(add_memcpy memcpy_name)			function(add_memcpy memcpy_name)
	add_implementation(memcpy ${memcpy_name}			add_implementation(memcpy ${memcpy_name}
	SRCS ${MEMCPY_SRC}			SRCS ${MEMCPY_SRC}
	▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

libc/src/string/aarch64/CMakeLists.txt

This file was added.

add_memcpy("memcpy_${LIBC_TARGET_MACHINE}")

libc/src/string/aarch64/memcpy.cpp

This file was added.

				//===-- Implementation of memcpy ------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/string/memcpy.h"
				#include "src/__support/common.h"
				#include "src/string/memory_utils/memcpy_utils.h"

				namespace __llvm_libc {

				// Design rationale
				// ================
				//
				// Using a profiler to observe size distributions for calls into libc
				// functions, it was found most operations act on a small number of bytes.
				// This makes it important to favor small sizes.
				//
				// We have used __builtin_expect to tell the compiler to favour lower sizes as
				// that will reduce the branching overhead where that would hurt most
				// proportional to total cost of copying.
				//
				// The function is written in C++ for several reasons:
				// - The compiler can __see__ the code, this is useful when performing Profile
				// Guided Optimization as the optimized code can take advantage of branching
				// probabilities.
				// - It also allows for easier customization and favors testing multiple
				// implementation parameters.
				// - As compilers and processors get better, the generated code is improved
				// with little change on the code side.
				static void memcpy_aarch64(char __restrict dst, const char __restrict src,
				size_t count) {
				gchateletUnsubmitted Not Done Reply Inline Actions Can you add a quick note on which processor(s) the implementation has been tuned and how it is representative of other aarch64 cpus? (If you happen to know) gchatelet: Can you add a quick note on which processor(s) the implementation has been tuned and how it is…
				char *dst_m = dst + count;
				const char *src_m = src + count;
				if (__builtin_expect(count < 128, 1)) {
				gchateletUnsubmitted Not Done Reply Inline Actions Please use the `likely` and `unlikely` macros from `src/__support/common.h.def`. Here and below. gchatelet: Please use the `likely` and `unlikely` macros from `src/__support/common.h.def`. Here and below.
				if (__builtin_expect(count > 32, 0)) {
				CopyBlock<32>(dst, src);
				CopyBlock<32>(dst_m - 32, src_m - 32);
				gchateletUnsubmitted Not Done Reply Inline Actions Pairs of CopyBlock used to produce an overlapping copy can make use `CopyBlockOverlap` from `src/string/memory_utils/memcpy_utils.h` instead. Simply replace there two lines with `CopyBlockOverlap<32>(dst, src, count)` gchatelet: Pairs of CopyBlock used to produce an overlapping copy can make use `CopyBlockOverlap` from…
				if (__builtin_expect(count > 64, 0)) {
				CopyBlock<32>(dst + 32, src + 32);
				if (__builtin_expect(count > 96, 0)) {
				CopyBlock<32>(dst + 64, src + 64);
				}
				}
				return;
				} else if (__builtin_expect(count < 16, 1)) {
				if (__builtin_expect((count & 0x8) != 0, 0)) {
				CopyBlock<8>(dst, src);
				gchateletUnsubmitted Not Done Reply Inline Actions ditto `CopyBlockOverlap` gchatelet: ditto `CopyBlockOverlap`
				return CopyBlock<8>(dst_m - 8, src_m - 8);
				} else if (__builtin_expect((count & 0x4) != 0, 0)) {
				CopyBlock<4>(dst, src);
				gchateletUnsubmitted Not Done Reply Inline Actions ditto `CopyBlockOverlap` gchatelet: ditto `CopyBlockOverlap`
				return CopyBlock<4>(dst_m - 4, src_m - 4);
				} else {
				if (count == 0)
				gchateletUnsubmitted Not Done Reply Inline Actions Can you format the file? There's a space missing right after the coma in the template instantiation. gchatelet: Can you format the file? There's a space missing right after the coma in the template…
				return;
				if (count == 1)
				return CopyBlock<1>(dst, src);
				if (count == 2)
				return CopyBlock<2>(dst, src);
				if (count == 3)
				return CopyBlock<3>(dst, src);
				}
				} else {
				CopyBlock<16>(dst, src);
				gchateletUnsubmitted Not Done Reply Inline Actions ditto `CopyBlockOverlap` gchatelet: ditto `CopyBlockOverlap`
				return CopyBlock<16>(dst_m - 16, src_m - 16);
				}
				}
				// Large copy
				// Copy 16 bytes and then align src to 16-byte alignment.
				CopyBlock<16>(dst, src);

				// Align to either source or destination depending on target.
				// Default aligns to source, define 'ALIGN_DST' to align to destination.
				#if ALIGN_DST
				#define ALIGN_SRCDST dst
				#else
				#define ALIGN_SRCDST src
				#endif
				size_t misalign = ((intptr_t)ALIGN_SRCDST) % 16;
				gchateletUnsubmitted Not Done Reply Inline Actions `const` gchatelet: `const`
				dst -= misalign;
				src -= misalign;

				// Copy 64 bytes from aligned src/dst
				CopyBlock<32>(dst + 16, src + 16);
				CopyBlock<32>(dst + 48, src + 48);

				// Since we are copying the last 64-bytes unconditionally and we have
				// already copied 64 + 16 - misalign bytes, we only need to copy the
				// remaining bytes. Since the difference may be negative we must use a
				// signed comparison.
				int64_t count2 = count - (144 - misalign);
				gchateletUnsubmitted Not Done Reply Inline Actions `const` gchatelet: `const`
				avieiraAuthorUnsubmitted Done Reply Inline Actions Can't make this const as it is updated in the loop below. avieira: Can't make this const as it is updated in the loop below.
				while (count2 > 0) {
				CopyBlock<32>(dst + 80, src + 80);
				CopyBlock<32>(dst + 112, src + 112);
				count2 -= 64;
				dst += 64;
				src += 64;
				}
				// Copy last 64-bytes.
				return CopyBlock<64>(dst_m - 64, src_m - 64);
				gchateletUnsubmitted Not Done Reply Inline Actions I think the code here could be replaced with a call to `CopyAlignedBlocks<64>` from `src/string/memory_utils/memcpy_utils.h` which essentially aligns the destination pointers, copies blocks of 64 bytes and handles the trailing bytes with an overlapping copy. If it's important that the alignment is 16B instead of 64B I can change the implementation but since the code is heavily tested I'd be in favor of reusing it as much as possible. Note: I believe that the compiler will generate the same code whether using two `CopyBlock<32>` or a single `CopyBlock<64>`. gchatelet: I think the code here could be replaced with a call to `CopyAlignedBlocks<64>` from…
				avieiraAuthorUnsubmitted Done Reply Inline Actions I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment? I am also looking at a potential codegen difference between 2* CopyBlock<32> and CopyBlock<64>, will get back to you next week on this. avieira: I am looking at this one, the reason I wrote out this code was because we want to align for…
				gchateletUnsubmitted Done Reply Inline Actions I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment? Rereading this comment, it seems there's a mistake in our code, it should be source aligned as you suggest and not destination aligned. I've sent D93457 to fix it. gchatelet: > I am looking at this one, the reason I wrote out this code was because we want to align for…
				avieiraAuthorUnsubmitted Done Reply Inline Actions I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning. I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both. I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this? Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two. avieira: I see, I was under the impression some targets might prefer destination aligning, but I have no…
				gchateletUnsubmitted Not Done Reply Inline Actions I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning. That might happen and we can add a template parameter to the function if it comes up. I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both. 👍 I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this? Definitely, nothing is carved in stone, we can adapt the framework as needed. Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two. Interesting! I wonder if this translates to gains on x86 as well. I'll make tests as well on my side. BTW the benchmarking framework has been updated in D93210. Let me know if you need help to use it. gchatelet: > I see, I was under the impression some targets might prefer destination aligning, but I have…
				}

				void LLVM_LIBC_ENTRYPOINT(memcpy)(void __restrict dst,
				const void *__restrict src, size_t size) {
				memcpy_aarch64(reinterpret_cast<char *>(dst),
				reinterpret_cast<const char *>(src), size);
				return dst;
				}

				} // namespace __llvm_libc