Details
- Reviewers
sivachandra
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
@sivachandra at first I thought I should use LLVM_LIBC_FUNCTION_ATTR but it does not work. Should it be moved to the definition instead of the declaration?
I have a few questions about this patch:
- Do you want all libc functions to be aligned to 64 bytes, or all public functions to be aligned to 64 bytes? Or, do you just want the memory functions to be aligned to 64 bytes?
- What effect should I see with and without this patch? For example, with memcpy, I notice that it gets an address of 0 with or without this patch.
This was my original answer
We've measured a significant (+30%) swing in performance in microbenchmarks for several x86 microarchitectures (Intel Haswell, Intel Skylake, AMD Rome) and for a variety of memory functions: read only functions (e.g. memcmp, bcmp), write only functions (e.g. memset, bzero) and read/write functions (e.g. memmove, memcpy).
So I'd be inclined to think that at least all functions touching memory are subject to this swing for x86.
Now, I don't see a specific link between alignment of the code and performance of read/write operations so I'd be tempted to conclude that this behaviour is generalizable to all functions.
Do you want me to gather evidence of this behaviour for other functions before moving forward with this patch?
Also maybe we can lower the alignment requirement to 32B, by default it is 16B.
but taking a step back here, it seems that the swing happens mostly for distributions that exercise large sizes (namely uniform 384 to 4096).
This corresponds to code running in a loop and using vector instructions.
On x86 these instructions take many bytes to encode and the CPU's frontend can only decode up to 16B per cycle (32B for Rome AFAIR).
Usually the decoded instructions are cached to prevent tight loops to be frontend bounds but we know that under certain circumstances the cache is evicted, leading to decoding to occur again.
We also know that the caching is based on instruction addresses so aligning the function may just - by chance - help with this.
I'd need to do more tests to check this assumption though.
(source https://stackoverflow.com/a/26003659)
- What effect should I see with and without this patch? For example, with memcpy, I notice that it gets an address of 0 with or without this patch.
You cannot see the effect of this patch by looking at the asm in the generated .o or .a, the effect is only visible once linked in the final binary.
You can witness it by using objdump though. The -h option drops the section data where you can see the required alignment 2**6 (64), without this patch the alignment is 2**4 (16).
% objdump -h ~/llvm-project_rel_compiled-with-clang/projects/libc/src/string/CMakeFiles/libc.src.string.memcpy_opt_host.__internal__.dir/memcpy.cpp.o /redacted/llvm-project_rel_compiled-with-clang/projects/libc/src/string/CMakeFiles/libc.src.string.memcpy_opt_host.__internal__.dir/memcpy.cpp.o: file format elf64-x86-64 Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .text._ZN11__llvm_libc6memcpyEPvPKvm 00000183 0000000000000000 0000000000000000 00000040 2**6 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 2 .rodata._ZN11__llvm_libc6memcpyEPvPKvm 00000014 0000000000000000 0000000000000000 000001c4 2**2 CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA 3 .comment 00000067 0000000000000000 0000000000000000 000001d8 2**0 CONTENTS, READONLY 4 .note.GNU-stack 00000000 0000000000000000 0000000000000000 0000023f 2**0 CONTENTS, READONLY 5 .llvm_addrsig 00000000 0000000000000000 0000000000000000 00000348 2**0 CONTENTS, READONLY, EXCLUDE