This is an archive of the discontinued LLVM Phabricator instance.

[libc] Optimized version of memmove
ClosedPublic

Authored by gchatelet on Nov 26 2021, 6:09 AM.

Details

Summary

This implementation relies on storing data in registers for sizes up to 128B.
Then depending on whether dst is less (resp. greater) than src we move data forward (resp. backward) by chunks of 32B.
We first make sure one of the pointers is aligned to increase performance on large move sizes.

Diff Detail

Event Timeline

gchatelet created this revision.Nov 26 2021, 6:09 AM
gchatelet requested review of this revision.Nov 26 2021, 6:09 AM
sivachandra added inline comments.Dec 1 2021, 11:08 AM
libc/test/src/string/memmove_test.cpp
15

Including cstdio and functional from tests is not allowed.

23

For things like this, you should use a matcher. Look at this for example: https://github.com/llvm/llvm-project/blob/main/libc/utils/UnitTest/FPMatcher.h#L26.

You can probably implement a matcher which can used like this: EXPECT_MEM_EQ(mem1, mem2, size)

Hi @gchatelet,

I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
...

I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

One of the things I am looking at now is weaving the loads and stores, as we've seen some improvements in some of our cores using interleaved series of LDP-STPs(64-byte loads and stores). Obviously that means that for backwards copies I need to do the end ones first so I'd probably only do it for the loop and use an aarch64-only elements one with a forwards weaving _64::Move and one backwards.

gchatelet updated this revision to Diff 398576.Jan 10 2022, 4:43 AM
  • rebase and conform to new style

Hi @gchatelet,

I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
...

Do you have an idea of why this is yielding suboptimal results?
In the code I generated for x86-64, using this pyramid shape offset pattern reduced the number of instructions (the compiler could outline the last store across different functions).
I'm not sure this translated into better performance though, only slightly smaller function size.

I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

Could you share the resulting asm?

libc/test/src/string/memmove_test.cpp
23

I've been lazy :D Thx for pointing this out.

gchatelet updated this revision to Diff 398870.Jan 11 2022, 1:36 AM
  • Add MemoryMatcher util and simplify tests
gchatelet updated this revision to Diff 398872.Jan 11 2022, 1:37 AM
  • Fix typo
gchatelet updated this revision to Diff 405235.Feb 2 2022, 5:50 AM
gchatelet marked 2 inline comments as done.
  • rebase
gchatelet updated this revision to Diff 405285.Feb 2 2022, 8:00 AM
  • rebase
  • squash
  • tune the implementation a bit further (less cases, smaller alignment and larger loop size)

@avieira Shall I transform

static void move(char *dst, const char *src) {
  const auto value = Element::load(src);
  Repeated<Element, ElementCount - 1>::move(dst + Element::SIZE, src + Element::SIZE);
  Element::store(dst, value);
}

into

static void move(char *dst, const char *src) {
  const auto value = load(src);
  store(dst, value);
}

before submitting?
As discussed offline, this seems to help on aarch64 but doesn't seem to affect x86.

@avieira Shall I transform

static void move(char *dst, const char *src) {
  const auto value = Element::load(src);
  Repeated<Element, ElementCount - 1>::move(dst + Element::SIZE, src + Element::SIZE);
  Element::store(dst, value);
}

into

static void move(char *dst, const char *src) {
  const auto value = load(src);
  store(dst, value);
}

before submitting?
As discussed offline, this seems to help on aarch64 but doesn't seem to affect x86.

Yes please!

Hi @gchatelet,

I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
...

Do you have an idea of why this is yielding suboptimal results?
In the code I generated for x86-64, using this pyramid shape offset pattern reduced the number of instructions (the compiler could outline the last store across different functions).
I'm not sure this translated into better performance though, only slightly smaller function size.

I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

Could you share the resulting asm?

Sorry I hadn't seen this earlier, notification must have fallen through the cracks, but I'll share here the same I shared with you, I won't share the full memmove function as that is a lot of code, but basically the difference in codegen between before and your change is that in the forward and backward loops the stores go from:
40fb14: ad011da6 stp q6, q7, [x13, #32]
40fb18: ad0015a4 stp q4, q5, [x13]
to:
40fb14: ad0015a4 stp q4, q5, [x13]
40fb18: ad011da6 stp q6, q7, [x13, #32]

And the latter is preferred on AArch64.

gchatelet updated this revision to Diff 406753.Feb 8 2022, 2:50 AM
  • Non recursive Repeated::move implementation
This revision was not accepted when it landed; it landed in state Needs Review.Feb 8 2022, 3:55 AM
This revision was automatically updated to reflect the committed changes.