This is an archive of the discontinued LLVM Phabricator instance.

[libc] Align functions to 64B to lower benchmarking variance
Needs ReviewPublic

Authored by gchatelet on May 4 2022, 1:30 PM.

Details

Reviewers
sivachandra

Diff Detail

Event Timeline

gchatelet created this revision.May 4 2022, 1:30 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 4 2022, 1:30 PM
gchatelet requested review of this revision.May 4 2022, 1:30 PM

@sivachandra at first I thought I should use LLVM_LIBC_FUNCTION_ATTR but it does not work. Should it be moved to the definition instead of the declaration?

I have a few questions about this patch:

  1. Do you want all libc functions to be aligned to 64 bytes, or all public functions to be aligned to 64 bytes? Or, do you just want the memory functions to be aligned to 64 bytes?
  2. What effect should I see with and without this patch? For example, with memcpy, I notice that it gets an address of 0 with or without this patch.

I have a few questions about this patch:

  1. Do you want all libc functions to be aligned to 64 bytes, or all public functions to be aligned to 64 bytes? Or, do you just want the memory functions to be aligned to 64 bytes?

This was my original answer

We've measured a significant (+30%) swing in performance in microbenchmarks for several x86 microarchitectures (Intel Haswell, Intel Skylake, AMD Rome) and for a variety of memory functions: read only functions (e.g. memcmp, bcmp), write only functions (e.g. memset, bzero) and read/write functions (e.g. memmove, memcpy).
So I'd be inclined to think that at least all functions touching memory are subject to this swing for x86.
Now, I don't see a specific link between alignment of the code and performance of read/write operations so I'd be tempted to conclude that this behaviour is generalizable to all functions.
Do you want me to gather evidence of this behaviour for other functions before moving forward with this patch?
Also maybe we can lower the alignment requirement to 32B, by default it is 16B.

but taking a step back here, it seems that the swing happens mostly for distributions that exercise large sizes (namely uniform 384 to 4096).
This corresponds to code running in a loop and using vector instructions.
On x86 these instructions take many bytes to encode and the CPU's frontend can only decode up to 16B per cycle (32B for Rome AFAIR).
Usually the decoded instructions are cached to prevent tight loops to be frontend bounds but we know that under certain circumstances the cache is evicted, leading to decoding to occur again.
We also know that the caching is based on instruction addresses so aligning the function may just - by chance - help with this.
I'd need to do more tests to check this assumption though.

(source https://stackoverflow.com/a/26003659)


  1. What effect should I see with and without this patch? For example, with memcpy, I notice that it gets an address of 0 with or without this patch.

You cannot see the effect of this patch by looking at the asm in the generated .o or .a, the effect is only visible once linked in the final binary.
You can witness it by using objdump though. The -h option drops the section data where you can see the required alignment 2**6 (64), without this patch the alignment is 2**4 (16).

 % objdump -h ~/llvm-project_rel_compiled-with-clang/projects/libc/src/string/CMakeFiles/libc.src.string.memcpy_opt_host.__internal__.dir/memcpy.cpp.o       

/redacted/llvm-project_rel_compiled-with-clang/projects/libc/src/string/CMakeFiles/libc.src.string.memcpy_opt_host.__internal__.dir/memcpy.cpp.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000000  0000000000000000  0000000000000000  00000040  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text._ZN11__llvm_libc6memcpyEPvPKvm 00000183  0000000000000000  0000000000000000  00000040  2**6
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  2 .rodata._ZN11__llvm_libc6memcpyEPvPKvm 00000014  0000000000000000  0000000000000000  000001c4  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  3 .comment      00000067  0000000000000000  0000000000000000  000001d8  2**0
                  CONTENTS, READONLY
  4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  0000023f  2**0
                  CONTENTS, READONLY
  5 .llvm_addrsig 00000000  0000000000000000  0000000000000000  00000348  2**0
                  CONTENTS, READONLY, EXCLUDE