Page MenuHomePhabricator

[LIBC] Add optimized memcpy routine for AArch64
ClosedPublic

Authored by avieira on Nov 27 2020, 9:12 AM.

Details

Reviewers
sivachandra
dxf
gchatelet
Group Reviewers
Restricted Project
Summary

Hi all,

This is an optimized memcpy routine for AArch64 using lessons learned from Arm's Optimized Routines(AOR) memcpy (the aarch64 Advanced SIMD one to be precise).
Benchmarked this on a Neoverse-N1 and it beats the current default implementation for both small and big configurations.

I am not entirely familiar with how the compile-time libc implementation selection works, so as you can see from this patch I've enabled this for 'AArch64', though maybe the community may want to choose different implementations for other AArch64 cores? I believe the non-Advanced SIMD variant of AOR's memcpy has been found to work better for earlier AArch64 cores for instance.

I'm curious to hear how the maintainers/community feel about this.

Kind Regards,
Andre

Diff Detail

Event Timeline

avieira created this revision.Nov 27 2020, 9:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 27 2020, 9:12 AM
avieira requested review of this revision.Nov 27 2020, 9:12 AM

Thanks for the patch. I am adding @gchatelet as a reviewer as he is the current owner/maintainer of the mem* functions.

My visual scan says the structuring is alright. Also, if libc builds with this patch applied and the tests pass, I guess that is proof enough that the structuring is alright. But, I will let @gchatelet drive the code review here.

gchatelet added inline comments.Dec 1 2020, 4:20 AM
libc/src/string/CMakeLists.txt
218

This change is not needed: it will be handled by the else() clause.
We have a special case for x86 to be able to support 32 and 64 bits architectures with the same code.

libc/src/string/aarch64/memcpy.cpp
39

Please use the likely and unlikely macros from src/__support/common.h.def.
Here and below.

41–42

Pairs of CopyBlock used to produce an overlapping copy can make use CopyBlockOverlap from src/string/memory_utils/memcpy_utils.h instead.
Simply replace there two lines with CopyBlockOverlap<32>(dst, src, count)

52

ditto CopyBlockOverlap

55

ditto CopyBlockOverlap

68

ditto CopyBlockOverlap

72–104

I think the code here could be replaced with a call to CopyAlignedBlocks<64> from src/string/memory_utils/memcpy_utils.h which essentially aligns the destination pointers, copies blocks of 64 bytes and handles the trailing bytes with an overlapping copy.

If it's important that the alignment is 16B instead of 64B I can change the implementation but since the code is heavily tested I'd be in favor of reusing it as much as possible.

Note: I believe that the compiler will generate the same code whether using two CopyBlock<32> or a single CopyBlock<64>.

83

const

95

const

Hi Guillaume,

Thanks for the review! I left some further questions/comments as replies. I agree with all other changes I didn't reply to. I'll update the patch once I've figured out the pending queries.

libc/src/string/CMakeLists.txt
218

I'm confused by this one. If I don't do this, then a call to __llvm_libc::memcpy in the benchmark will lead to the 'default' 'src/string/memcpy.cpp' implementation rather than the aarch64 one.

libc/src/string/aarch64/memcpy.cpp
72–104

I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment?

I am also looking at a potential codegen difference between 2* CopyBlock<32> and CopyBlock<64>, will get back to you next week on this.

95

Can't make this const as it is updated in the loop below.

gchatelet added inline comments.Dec 7 2020, 6:47 AM
libc/src/string/CMakeLists.txt
218

Oh my bad you're right, this code has changed (it used to work differently).

gchatelet added inline comments.Dec 17 2020, 6:02 AM
libc/src/string/aarch64/memcpy.cpp
72–104

I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment?

Rereading this comment, it seems there's a mistake in our code, it should be source aligned as you suggest and not destination aligned.
I've sent D93457 to fix it.

avieira added inline comments.Dec 17 2020, 7:29 AM
libc/src/string/aarch64/memcpy.cpp
72–104

I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning.

I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both.

I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this?

Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two.

gchatelet added inline comments.Dec 17 2020, 8:42 AM
libc/src/string/aarch64/memcpy.cpp
72–104

I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning.

That might happen and we can add a template parameter to the function if it comes up.

I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both.

👍

I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this?

Definitely, nothing is carved in stone, we can adapt the framework as needed.

Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two.

Interesting! I wonder if this translates to gains on x86 as well.
I'll make tests as well on my side.

BTW the benchmarking framework has been updated in D93210. Let me know if you need help to use it.

avieira updated this revision to Diff 317812.Jan 20 2021, 2:14 AM

Hi,

So here is an updated version for an optimized memcpy routine for AArch64. This one basically uses the same as the default memcpy, but picks a different block size and alignment for copies > 128.
I also disable tail merging as I found it was leading to worse code. This new memcpy seems to show improvements accross the board for both sweep and distribution benchmarks.

I am continuing to investigate a better organization of the copies smaller than 128bytes, as I had before, using the new benchmarks. Using the same code I had before I am seeing an improvement in Uniform1024 (new uniform distribution I added for sizes 0-1024), I also see an improvement in Memcpy Distributions A, M, Q and U, but a regression in B, L, S and W. For distribution D the optimized version beats the older version but shows a regression compared to the version in this patch.

I'll spend a few extra cycles trying to see if I can find a sweet spot, but I might leave it like this.

Is this OK for main?

Also I have two patches downstream for:

  1. Uniform1024 distribution, an uniform distribution for sizes 0-1024
  2. Options to define a Sweep 'min size' and 'step'.

Let me know if you are interested in either of these.

Kind regards,
Andre

gchatelet accepted this revision.Fri, Jan 29, 8:17 AM

Hey Andre, my apologies for the lag. Answers inlined

So here is an updated version for an optimized memcpy routine for AArch64. This one basically uses the same as the default memcpy, but picks a different block size and alignment for copies > 128.

👍

I also disable tail merging as I found it was leading to worse code. This new memcpy seems to show improvements accross the board for both sweep and distribution benchmarks.

Interesting! I never played with it but it might be interesting for all implementations as well. Thx for the feedback, really appreciate it.

I am continuing to investigate a better organization of the copies smaller than 128bytes, as I had before, using the new benchmarks. Using the same code I had before I am seeing an improvement in Uniform1024 (new uniform distribution I added for sizes 0-1024), I also see an improvement in Memcpy Distributions A, M, Q and U, but a regression in B, L, S and W. For distribution D the optimized version beats the older version but shows a regression compared to the version in this patch.

Yeah the letters distributions are likely to swing since they refer to individual applications that prefer certain sizes. I would like to have an even bigger bucket (1K to 4K?) but then some processors may run out of L1 and hitting L2 will bias the benchmark...
Which processors did you benchmark it on? How likely it is to be representative of all aarch64 cpus? Can you leave this information in the implementation file for the record?

I'll spend a few extra cycles trying to see if I can find a sweet spot, but I might leave it like this.

Fine by me.

Is this OK for main?

Definitely looks good to me!

Also I have two patches downstream for:

  1. Uniform1024 distribution, an uniform distribution for sizes 0-1024
  2. Options to define a Sweep 'min size' and 'step'.

Let me know if you are interested in either of these.

Yes why not.
For 2) I believe you had to store more information in the json file and adapt the python script as well right?

libc/src/string/aarch64/memcpy.cpp
34

Can you add a quick note on which processor(s) the implementation has been tuned and how it is representative of other aarch64 cpus? (If you happen to know)

56

Can you format the file? There's a space missing right after the coma in the template instantiation.

This revision is now accepted and ready to land.Fri, Jan 29, 8:17 AM
avieira closed this revision.Thu, Feb 4, 4:17 AM
avieira marked an inline comment as done.

I've committed the patch in https://reviews.llvm.org/rG369f7de3135a517a69c45084d4b175f7b0d5e6f5 but it seems when copying the revision link I may have hit ctrl + x in vim and linked it to D92235 :(