Page MenuHomePhabricator

[libc] Optimized version of memmove
Needs ReviewPublic

Authored by gchatelet on Nov 26 2021, 6:09 AM.

Details

Summary

This implementation relies on storing data in registers for sizes up to 128B.
Then depending on whether dst is less (resp. greater) than src we move data forward (resp. backward) by chunks of 32B.
We first make sure one of the pointers is aligned to increase performance on large move sizes.

Diff Detail

Event Timeline

gchatelet created this revision.Nov 26 2021, 6:09 AM
gchatelet requested review of this revision.Nov 26 2021, 6:09 AM
sivachandra added inline comments.Dec 1 2021, 11:08 AM
libc/test/src/string/memmove_test.cpp
15

Including cstdio and functional from tests is not allowed.

23

For things like this, you should use a matcher. Look at this for example: https://github.com/llvm/llvm-project/blob/main/libc/utils/UnitTest/FPMatcher.h#L26.

You can probably implement a matcher which can used like this: EXPECT_MEM_EQ(mem1, mem2, size)

Hi @gchatelet,

I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
...

I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

One of the things I am looking at now is weaving the loads and stores, as we've seen some improvements in some of our cores using interleaved series of LDP-STPs(64-byte loads and stores). Obviously that means that for backwards copies I need to do the end ones first so I'd probably only do it for the loop and use an aarch64-only elements one with a forwards weaving _64::Move and one backwards.

gchatelet updated this revision to Diff 398576.Mon, Jan 10, 4:43 AM
  • rebase and conform to new style

Hi @gchatelet,

I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
...

Do you have an idea of why this is yielding suboptimal results?
In the code I generated for x86-64, using this pyramid shape offset pattern reduced the number of instructions (the compiler could outline the last store across different functions).
I'm not sure this translated into better performance though, only slightly smaller function size.

I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

Could you share the resulting asm?

libc/test/src/string/memmove_test.cpp
23

I've been lazy :D Thx for pointing this out.

gchatelet updated this revision to Diff 398870.Tue, Jan 11, 1:36 AM
  • Add MemoryMatcher util and simplify tests
gchatelet updated this revision to Diff 398872.Tue, Jan 11, 1:37 AM
  • Fix typo