This implementation relies on storing data in registers for sizes up to 128B.
Then depending on whether dst is less (resp. greater) than src we move data forward (resp. backward) by chunks of 32B.
We first make sure one of the pointers is aligned to increase performance on large move sizes.
Details
- Reviewers
sivachandra avieira - Commits
- rG83f9b13d8cc2: [libc] Optimized version of memmove
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libc/test/src/string/memmove_test.cpp | ||
---|---|---|
15 | Including cstdio and functional from tests is not allowed. | |
23 | For things like this, you should use a matcher. Look at this for example: https://github.com/llvm/llvm-project/blob/main/libc/utils/UnitTest/FPMatcher.h#L26. You can probably implement a matcher which can used like this: EXPECT_MEM_EQ(mem1, mem2, size) |
Hi @gchatelet,
I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
...
I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.
One of the things I am looking at now is weaving the loads and stores, as we've seen some improvements in some of our cores using interleaved series of LDP-STPs(64-byte loads and stores). Obviously that means that for backwards copies I need to do the end ones first so I'd probably only do it for the loop and use an aarch64-only elements one with a forwards weaving _64::Move and one backwards.
Do you have an idea of why this is yielding suboptimal results?
In the code I generated for x86-64, using this pyramid shape offset pattern reduced the number of instructions (the compiler could outline the last store across different functions).
I'm not sure this translated into better performance though, only slightly smaller function size.
I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.
Could you share the resulting asm?
libc/test/src/string/memmove_test.cpp | ||
---|---|---|
23 | I've been lazy :D Thx for pointing this out. |
- rebase
- squash
- tune the implementation a bit further (less cases, smaller alignment and larger loop size)
@avieira Shall I transform
static void move(char *dst, const char *src) { const auto value = Element::load(src); Repeated<Element, ElementCount - 1>::move(dst + Element::SIZE, src + Element::SIZE); Element::store(dst, value); }
into
static void move(char *dst, const char *src) { const auto value = load(src); store(dst, value); }
before submitting?
As discussed offline, this seems to help on aarch64 but doesn't seem to affect x86.
Sorry I hadn't seen this earlier, notification must have fallen through the cracks, but I'll share here the same I shared with you, I won't share the full memmove function as that is a lot of code, but basically the difference in codegen between before and your change is that in the forward and backward loops the stores go from:
40fb14: ad011da6 stp q6, q7, [x13, #32]
40fb18: ad0015a4 stp q4, q5, [x13]
to:
40fb14: ad0015a4 stp q4, q5, [x13]
40fb18: ad011da6 stp q6, q7, [x13, #32]
And the latter is preferred on AArch64.
Including cstdio and functional from tests is not allowed.