This patch adds the huge pages support (-hugify) for PIE/no-PIE
binaries. Also returned functionality to support the kernels < 5.10
where there is a problem in a dynamic loader with the alignment of
pages addresses.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for working on improving hugify!
bolt/include/bolt/Passes/Hugify.h | ||
---|---|---|
4–7 | This is the correct license | |
bolt/lib/Passes/Hugify.cpp | ||
3–7 | Outdated license | |
56 | update name here | |
bolt/lib/RuntimeLibs/HugifyRuntimeLibrary.cpp | ||
64–87 | I think we can delete all of this, right? emitBinary() for HugyfiRuntimeLibrary doesn't need to do anything if you are already creating a new function in a new pass that will be called from the runtime. And then go to HugifyRuntimeLibrary.h and delete SecNames.push_back(".bolt.hugify.entries"); | |
bolt/runtime/common.h | ||
416 | capitalize Buf to follow LLVM coding style | |
524–525 | capitalize Option, Arg2, etc. to follow LLVM coding style | |
bolt/runtime/hugify.cpp | ||
1–9 | This is the wrong license. LLVM license has updated to Apache v2.0 with LLVM Exceptions (the one that was previously in the header). Could you revert back this change? | |
22 | Can you remove "_ptr" from the name, since it is not a function pointer anymore? I suggest naming "__bolt_hugify_start_program" (but it's just a suggestion) | |
29–75 | Move to common.h, close to "hexToLong" capitalize variables to follow llvm style (see hexToLong example). | |
48 | getKernelVersion(Val) to follow LLVM coding style (I realize this file doesn't follow LLVM style, but it is small enough that we can just update it) | |
76–77 | hasPagecacheTHPSupport | |
77–79 | Buf | |
81 | FD | |
90–97 | struct KernelVersionTy { uint32_t major; uint32_t minor; uint32_t release; }; KernelVersionTy KernelVersion; | |
109 | hugifyForOldKernel(From, To .. |
Hi @rafauler, is it possible to get user-func-reorder.c.tmp & user-func-reorder.c.tmp.exe for analyze?
Those should be available in your build folder, here:
$ ninja check-bolt
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp.exe
I have 2 different environments, first one is ubuntu with 5.10 kernel and the second one is EulerOS with 4.19 kernel, the user-func-reorder test is successfully passed on the both platform.
@rafaelauler Hi Rafael, can you look at again, code style issues are fixed, all tests are successfully passed.
Thanks!
bolt/lib/Rewrite/RewriteInstance.cpp | ||
---|---|---|
494–496 ↗ | (On Diff #447289) | Why is that needed? |
3626–3628 ↗ | (On Diff #447289) | Same |
bolt/runtime/common.h | ||
82 | Why is that needed? |
bolt/lib/Passes/Hugify.cpp | ||
---|---|---|
62 | add a newline at the end of file |
@rafauler Hi Rafael, let me know if you need more details
bolt/lib/Rewrite/RewriteInstance.cpp | ||
---|---|---|
494–496 ↗ | (On Diff #447289) | It's needed due to HUGEPAGE allocation policy and also due to the bug for old kernels where dynamic loader doesn't take into account p_align field. |
bolt/runtime/common.h | ||
82 | good question :-) the user-func-reoder test fails and it was hard to reproduce the cause locally with this attribute we have the following assembly for memcpy: .Loop: ... movzbl (%rsi,%rdi,1),%ecx mov %cl,(%rax,%rdi,1) add $0x1,%rdi cmp %rdi,%r9 jne a004a0 <_fini+0x2c4> ... mov %r14,%rdi mov %r15,%rsi mov %rbx,%rdx callq .Loop copying is performed by byte with verification without this attribute I see the following: .Loop: ... movzbl 0x0(%r13,%rax,1),%edx mov %dl,(%rbx,%rax,1) movzbl 0x1(%r13,%rax,1),%edx mov %dl,0x1(%rbx,%rax,1) movzbl 0x2(%r13,%rax,1),%edx mov %dl,0x2(%rbx,%rax,1) movzbl 0x3(%r13,%rax,1),%edx mov %dl,0x3(%rbx,%rax,1) movzbl 0x4(%r13,%rax,1),%edx mov %dl,0x4(%rbx,%rax,1) movzbl 0x5(%r13,%rax,1),%edx mov %dl,0x5(%rbx,%rax,1) movzbl 0x6(%r13,%rax,1),%edx mov %dl,0x6(%rbx,%rax,1) movzbl 0x7(%r13,%rax,1),%edx mov %dl,0x7(%rbx,%rax,1) add $0x8,%rax cmp %rax,%rcx jne a007f0 <_fini+0x614> copying is performed with unrolling and test fails due to overlapping dst and src addresses for size which is not aligned to 8 bytes |
bolt/lib/Rewrite/RewriteInstance.cpp | ||
---|---|---|
494–496 ↗ | (On Diff #447289) | From the left side, this is already aligned via BinaryContext::PageAlign. This is not just setting p_align, but actually setting the start address to be aligned at 2MB boundary. So this line here is inserting an extra empty 2MB page, but I'm not sure I get the reason why. |
3626–3628 ↗ | (On Diff #447289) | For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 (using this diff's lines), where we pad the end of the code section until it is aligned at 2MB. I understand that other code sections might be allocated to the huge page if we don't have these lines added by this diff, but I'm not sure why is that a problem. If you have space left in a huge page, why wouldn't you put code there? |
bolt/runtime/common.h | ||
82 | Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of copying the entire contents of the huge page, it is copying only hot functions. I go back to my original point, why do we need to avoid overlapping segments? If you roll back to previous behavior (copying all contents of the page, including cold text segments), you won't need to insert one extra alignment at the end of each code section. |
Hi Rafael let me try to explain it from a loader point of view on the different kernels
I have 2 kernels: 5.10 and 4.18
"From the left side, this is already aligned via BinaryContext::PageAlign. This is not just setting p_align, but actually setting the start address to be aligned at 2MB boundary. So this line here is inserting an extra empty 2MB page, but I'm not sure I get the reason why."
Here the debug log from runnnig application with --hugify
kernel 5.10
./redis-server.gold.pie.bolt.test2
[hugify] hot start: 563c2fe00000
[hugify] hot end: 563c2fe0df97
[hugify] aligned huge page from: 563c2fe00000
[hugify] aligned huge page to: 563c30000000
kernel 4.18
./redis-server.gold.pie.bolt.test2
[hugify] hot start: 5616f85c3000
[hugify] hot end: 5616f85d0f97
[hugify] aligned huge page from: 5616f8400000
[hugify] aligned huge page to: 5616f8600000
[hugify] workaround with memory alignment for kernel < 5.10
[hugify] allocated temporary address: 7f127eefb000
[hugify] allocated aligned size: 200000
[hugify] allocated size: df97
As fas as you can see the hot start addresses are different.
for 5.10 the one is aligned to 2MB and we can just call madvise directly.
But for 4.18 we see that address is 4 KB aligned, OS does not give us the huge page due to incorrect address.
In order to fix it I put one extra page from left side (between 2 load segments)
it allows me to remap text section to 2MB page.
"For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 ..."
Yes, you are right, but it is padding after all execution sections.
I am not sure exactly but probably it's not possibly to have 2 RE regions with different page sizes (2MB & 4KB)
inside of one huge page, but without padding from right side I have SEGV_MAPPER error (it can be OS specific)
[25] .bss NOBITS 0000000000004038 003038 000008 00 WA 0 0 1 [26] .text PROGBITS 0000000000600000 600000 000085 00 AX 0 0 2097152 [27] .text.injected PROGBITS 00000000006000c0 6000c0 000005 00 AX 0 0 64 [28] .text.cold PROGBITS 0000000000600100 600100 0001a9 00 AX 0 0 64 [29] .eh_frame PROGBITS 0000000000800000 800000 000260 00 A 0 0 8 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x600000} --- +++ killed by SIGSEGV +++ Segmentation fault
"bolt/runtime/common.h
Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of copying the entire contents of the huge page, ..."
I think it redundant functionality to copy cold text and get the huge page for it.
Thanks for explaining, Alexey.
Here's what I understood: from the left side, if we take the last allocated address according to PT_LOAD segments, and then try to put a new segment right after it, aligned to 2MB, old loader somehow ignores p_align and loads that into a 4KB address. But if we give 2MB more space, it does align. This seems quite arbitrary, do you confirm this is what you're seeing and that this is the right fix? If yes, this is odd enough that I think it makes sense for us to put this under a special option -hugify=oldkernel.
From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.
And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.
When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.
Yes, you are right, it's the bug inside kernel loader. Maybe we should exclude the support for old kernel, until someone ask about it?
From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.
And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.
When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.
Rafael are you sure that we should copy all section from RE segment?
Below the screenshot of .text and .text.cold sections for one of the binaries which I have
Sorry, I wasn't clear. Not all sections, but restore the previous behavior (the behavior that we have right now in trunk), which fills the remaining of the huge page with the start of cold. If we don't do this, we crash. If you remove the right side padding, you will notice the crash.
I'm fine with either (keeping the old kernel support under a flag, or removing it). Btw thanks for your work, I appreciate it.
@rafauler hello, sorry about long wait with patch updating, based on our previous discussion I changed -hugify option which one now takes 2 parameters, 5.10 or 4.18.
For old kernel there is extra padding from left and right sides.
Can you look it once again?
Thanks in advance :-)
Can you check the failed tests?
Failed Tests (1):
BOLT :: runtime/X86/user-func-reorder.c
For some reason I'm getting "Context not available." when reviewing this diff, which makes it hard to review. Can you re-submit it with full context?
bolt/runtime/hugify.cpp | ||
---|---|---|
119–121 | I think we can just use the old behavior here (work with two pointers From, To - they're both aligned to the page, we don't ever need to know the original hot sizes in this function. This makes this function more confusing to read. | |
128 | Mistake is probably here, copying less bytes than are actually needed, program will crash. | |
bolt/test/runtime/X86/user-func-reorder.c | ||
33 ↗ | (On Diff #467656) | You don't need 4.18 option here. The reason this test is failing is because in 5.10, you're not copying all contents of the page. You can't just copy hot contents. You need to restore the old behavior that will copy the entire page, and not just the first bytes that correspond to hot code. In other words, as it is, the 5.10 behavior is broken. I left comments in the function where I suspect the problem is from the last time I debugged this. |
Also, can we change the flag to be a hidden option? Instead of hugify=oldkernel, having the old -hugify and requiring the user to use an extra hidden flag --extra-hugify-padding
Sure, we can, it will be much better for usage, thank for review. I need few time to check this changes on local environment and I upload fixes soon.
Hello Rafael, I am little bit confused with clang and pie binaries,
I use gcc mostly, my local env:
gcc (GCC) 7.3.0
linux 4.18.0 x86_64
cat test.c
#include <stdio.h>
#include <dlfcn.h>
#include <unistd.h>
#include <stdint.h>
#include "foo.h"
int main(void) {
void (*foo_func)(uint64_t); void *fd = dlopen("./libfoo.so", RTLD_LAZY); if (!fd) { printf("libfoo.so load failed\n"); return -1; } foo_func = dlsym(fd, "foo"); if (!foo_func) { printf("function loading is failed\n"); return -1; } for(size_t i = 0; i < 3; ++i) { foo_func(i); } dlclose(fd); return 0;
}
cat foo.c
cat foo.c
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
attribute((constructor)) void foo_init(void) {
printf("foo_init ctr\n");
}
attribute((destructor)) void foo_fini(void) {
printf("foo_init destr\n");
}
void foo(uint64_t num) {
printf("iter %lld\n", num);
}
cat foo.h
#ifndef FOO_H
#define FOO_H
void foo(uint64_t);
#endif //FOO_H
$CC foo.c -O0 -g3 -fPIC -Wl,--emit-relocs -shared -o libfoo.so
$CC test.c -O0 -g3 -fPIE -I. -Wl,-pie -Wl,--emit-relocs -ldl -Wl,-pie -o test_pie
llvm-bolt ./test_pie -instrument -o ./test_pie.inst -instrumentation-file=./test_pie.fdata -instrumentation-no-counters-clear -instrumentation-sleep-time=1
./test_pie.inst
llvm-bolt -hugify ./test_pie -o ./test_pie.hugify -data=./test_pie.fdata
file ./test_pie.hugify
./test_pie.hugify: ELF 64-bit LSB shared object ....
./test_pie.hugify
[hugify] hot start: 55a316b65000
[hugify] hot end: 55a316b65218
[hugify] aligned huge page from: 55a316a00000
[hugify] aligned huge page to: 55a316c00000
[hugify] workaround with memory alignment for kernel < 5.10
[hugify] allocated temporary space: 7f8321362000
foo_init ctr
iter 0
iter 1
iter 2
foo_init destr
As far as you can see the hot start address (0x55a316b65000) is 4KB aligned,
adding extra padding from left and right sides allows remap it correctly in runtime to
0x55a316a00000 address.
That why I need 4 arguments for hugifyForOldKernel(HotStart, HotEnd, AlignedFrom, AlignedTo)
HotStart, HotEnd are used to get real .text section size and copy it to temp memory, 0x218 bytes
AlignedFrom, AlignedTo are used to get the size from new area with 2MB address alignment,
it's needed because OS doesn't take huge page without it
I am just trying to reproduce the same for clang but can't get the output like this
[hugify] hot start: 55a316b65000
[hugify] hot end: 55a316b65218
I get the SEG_MAPPER error for src address during memcpy to temp area.
Can you give me piece of advice how do you get pie binaries using clang?
I would be appreciated for any help, thanks in advance.
I tried to repro your test case and I can't actually get it to work, then I realized that the problem is the -fPIE flag to build hugify.cpp. This is also a good reason why we probably should put a special 'pie' test case here (a new line in user-func-reorder.c that also builds it with -pie). If we had that, we would more easily detect this problem earlier in the review.
Here's the problem. If you are going to build hugify.cpp with -fPIE, you have to use the same trick that @yota9 used here https://github.com/llvm/llvm-project/commit/af58da4ef3fbd766b2c44cfdbdb859a21022d10a (reviewed here https://github.com/facebookincubator/BOLT/pull/192).
When building a PIE object file, the compiler will address data objects through a GOT table, and our rudimentary linker that brings in hugify.o into the final binary is unable to properly handle GOT table creation. So, at least in my environment, when I try to run BOLT with hugify option using as input a PIE binary, I end up with hugify trying to copy the wrong start pointers because we failed to create runtime relocations to fix the addresses in the GOT table.
Hopefully by using the "#pragma GCC visibility push(hidden)" trick, you should easily fix this.
By the way, now that I used the pragma fix, I was able to repro the alignment problem you mention and we are investigating. We believe it is a consequence of using ASLR in the kernel (that the 2MB page aligned is ignored and the kernel uses a 4KB one).
Ok, because of the weird ASLR thing on older kernels, now it is clear that the extra padding is going to be necessary for every PIE binary. So instead of having --hugify-extra-padding, we can kill this flag and always insert the extra padding whenever a PIE binary is being processed with -hugify.
To detect whether the input is PIE or not, use !HasFixedLoadAddress:
https://github.com/llvm/llvm-project/blob/main/bolt/include/bolt/Core/BinaryContext.h#L606
So just replace opts::HugifyExtraPadding with !BC.hasFixedLoadAddress.
So now it's clear the case for hugifyForOldKernel(HotStart, HotEnd, AlignedFrom, AlignedTo) with 4 args. But.. it still should copy the entire page contents instead of just hot code. When you ask the kernel to unmap all these addresses, we need first to copy whatever was there before, even if it was all zeroes (because in the no-PIC case, it won't be just zeroes).
bolt/lib/Rewrite/RewriteInstance.cpp | ||
---|---|---|
3690–3691 ↗ | (On Diff #468828) | Another thing is that we should right pad here just for hot section, not for every code section |
Rafael thanks so much for the tips and the review, checked locally and now everything works as it should be
Rafael I will create new separate test for hugify PIE/non-PIE binaries and upload it tomorrow
Thanks for working on this! Let's sync one last time our understanding of the implementation of hugifyForOldKernel. Sorry if being repetitive, but it is important now to be on the same page regarding what is happening during runtime in both PIC and no-PIC cases. See if you agree with me with respect to the AlignedFrom/AlignedTo/AlignedSize usages in the suggestions, and please point me any issues in my understanding.
If we look at hugifyForOldKernel(), the code suggested here currently copies only a part of the page that is determined by From, To. We now know that "From", because of ASLR ignoring our alignment requirements, may not be aligned. Now suppose it lands in the middle of the page and that "To" (the end of hot code section) lands in the middle of the next page.
2MB huge page virtual memory map:
page1 - 0x400000:
hot start: 0x500000
page2 - 0x600000:
hot end: 0x700000
page 3 - 0x800000
In this case, according to lines 146-149, you will align "hot start" and "hot end" to 0x400000 and 0x800000, respectively, and ask the kernel to unmap these pages. So you will be unmapping 4MB of code. However, the code will be memcpy-ing 2MB of code from 0x500000 to 0x700000, and then copying it back after the kernel successfully mmaps the requested region into two huge pages.
Now, because you inserted extra padding in RewriteInstance:cpp:103 and 308, the fact that you are leaving 1MB before hot start not copied, and 1MB after hot end as well, is not really a problem.
However, in the non-PIC code, we are not inserting any extra padding. After hot end, at address 0x700000, we will have a large amount of code (coming from cold code of hot functions, those that were split). We will also have a bunch of extra code including the hugify runtime library itself, in some cases.
If we memcpy from 0x400000 to 0x800000 instead of the original 500000 to 700000, we will be erring on the safe side by always copying any memory contents that are being essentially erased after you ask the kernel to unmap them. That's why when using this mmap calls, we typically copy all page contents instead of just a subset of the (hot) bytes. It's also safe to reference these memory addresses (from 700000 to 800000) without the risk of segfaulting because BOLT will always pad the last code section in no-PIE -- the padding won't be correct for PIE because ASLR loader will misalign the start, but luckily we are inserting one extra page at the end in these cases, so the addresses from 700000 to 800000 will be filled with zeroes and won't segfault.
What you did in the last iteration was to expand hot_end towards one extra 4KB page, but that is not enough as in line 115 we are asking the kernel to unmap whole 2MB regions of text.
Does that sound reasonable? anything I'm missing?
bolt/lib/Rewrite/RewriteInstance.cpp | ||
---|---|---|
102 ↗ | (On Diff #471146) | Also add && opts::Hugify |
306 ↗ | (On Diff #471146) | Also add && opts::Hugify |
bolt/runtime/hugify.cpp | ||
98 | here we can pass AlignedFrom and Alignedto only | |
115 | Here use AlignedSize | |
128 | Here use AlignedFrom, AlignedSize | |
137 | ...because here you are using Aligned Size | |
163 | I'll suggest doing something else in line 97. |
Yes, you are right, thank you for clarifying, part of the problem was with copying to the time area, changed the size at 97 line and removed the redundant function argument. Added simple test for both types of binaries.
This is failing on our builders with the following error:
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:3: error: unknown type name 'uint8_t' uint8_t *HotStart = (uint8_t *)&__hot_start; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:24: error: use of undeclared identifier 'uint8_t' uint8_t *HotStart = (uint8_t *)&__hot_start; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:33: error: expected expression uint8_t *HotStart = (uint8_t *)&__hot_start; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:35: error: use of undeclared identifier '__hot_start' uint8_t *HotStart = (uint8_t *)&__hot_start; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:3: error: unknown type name 'uint8_t' uint8_t *HotEnd = (uint8_t *)&__hot_end; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:22: error: use of undeclared identifier 'uint8_t' uint8_t *HotEnd = (uint8_t *)&__hot_end; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:31: error: expected expression uint8_t *HotEnd = (uint8_t *)&__hot_end; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:33: error: use of undeclared identifier '__hot_end' uint8_t *HotEnd = (uint8_t *)&__hot_end; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:145:9: error: unknown type name 'size_t' const size_t HugePageBytes = 2L * 1024 * 1024; ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:3: error: unknown type name 'uint8_t' uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1)); ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:32: error: use of undeclared identifier 'intptr_t' uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1)); ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:147:3: error: unknown type name 'uint8_t' uint8_t *To = HotEnd + (HugePageBytes - 1); ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:148:10: error: use of undeclared identifier 'intptr_t' To -= (intptr_t)To & (HugePageBytes - 1); ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:150:47: error: use of undeclared identifier 'uint64_t' DEBUG(reportNumber("[hugify] hot start: ", (uint64_t)HotStart, 16);) ^ /opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:172:24: error: expected string literal in 'asm' __asm__ __volatile__(SAVE_ALL "call __bolt_hugify_self_impl\n" RESTORE_ALL ^ 15 errors generated.
Looks like hugify.cpp is missing cstdint include? Would it be possible to take a look?
I can't repro. It's probably these wonderful/unreadable ifdefs we have here, because we do include cstdint in common.h. What is the system your builder is running?
bolt/runtime/CMakeLists.txt | ||
---|---|---|
28 | As for hugify it is OK to use fPIE the rest of the libs must use fPIC flag. I suggest to use fPIC here, there should be no real difference |
This is the correct license