This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add optimized memset for AArch64
ClosedPublic

Authored by avieira on Aug 10 2021, 10:56 AM.

Details

Summary

Hi,

This is an optimized version of memset for AArch64, improving on the general implementation.

I do believe there is still room for improvement on the generated code though, but I suggest I look at those as follow ups.

Things I'd look to look at are:

  • a different Bump for single operands like SplatSet as if you only need to align a single pointer I believe it is simpler to just mask away the bottom bits;
  • introducing a DoLoop where we use a do-while-loop rather than a for-loop to use in situations where we know to have at least a single iteration, ideally the compiler would figure this one out through some valuerange analysis, but unfortunately it isn't currently, so maybe we can help it out;
  • changing Chained to do the Tail last, thus creating chains of stores & loads that are contiguous as I suspect that could potentially help with fetching behaviour in loops
  • trying to get the non dc zva loop in memset to use stores with post-increment as that could help reduce the loop to two stores, a cmp and a branch. This will probably require compiler changes though.

Diff Detail

Event Timeline

avieira created this revision.Aug 10 2021, 10:56 AM
avieira requested review of this revision.Aug 10 2021, 10:56 AM
Matt added a subscriber: Matt.Aug 10 2021, 1:24 PM

@gchatelet is on vacation and will be back early September. I would prefer that he reviews this patch. Is it OK to wait?

Yeah no problem!

Thx for the patch Andre!

Quick questions:

  • Are dc zva and mrs broadly available on aarch64 or should they be guarded by some #define?
  • Could we use compiler builtins instead?

PS: I'll be available this week (then in vacation again).

libc/src/string/aarch64/memset.cpp
18–29

Hmm the unaligned Tail<_64> is preventing the reuse of the Loop logic...
I'll update the code to separate the looping element from the tail element.

libc/src/string/memory_utils/elements_aarch64.h
22–42

if __ARM_NEON is undefined, the aarch64_memset namespace will not exist which will make the code fail to compile in libc/src/string/aarch64/memset.cpp.

38–39

maybe

using _32 = Repeated<_16, 2>;
using _64 = Repeated<_16, 4>;
gchatelet added inline comments.Aug 17 2021, 1:56 AM
libc/src/string/aarch64/memset.cpp
18–29

I've submitted rG8e4efad9917ce0b7d1751c34a8d6907e610050e6.

You can now write:

struct ZVA {
  static constexpr size_t kSize = 64;
  static void SplatSet(char *dst, const unsigned char value, size_t size) {
      assert(value == 0);
      asm("dc zva, %[dst]" : : [dst] "r"(dst) : "memory");
  }
};

And use SplatSet<Align<_64, Arg::_1>::Then<Loop<ZVA, _64>>>(dst, 0, count);

avieira updated this revision to Diff 373229.Sep 17 2021, 8:23 AM
avieira marked 4 inline comments as done.

Are dc zva and mrs broadly available on aarch64 or should they be guarded by some #define?
They are available for any AArch64 ISA, so no define required since this is for AArch64 only.

Could we use compiler builtins instead?
There are currently no compiler builtins for these.

gchatelet accepted this revision.Sep 21 2021, 7:29 AM
This revision is now accepted and ready to land.Sep 21 2021, 7:29 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2021, 1:22 AM