This is an archive of the discontinued LLVM Phabricator instance.

[BOLT][HUGIFY] adds huge pages support of PIE/no-PIE binaries
ClosedPublic

Authored by yavtuk on Jul 5 2022, 12:57 AM.

Details

Summary

This patch adds the huge pages support (-hugify) for PIE/no-PIE
binaries. Also returned functionality to support the kernels < 5.10
where there is a problem in a dynamic loader with the alignment of
pages addresses.

Diff Detail

Event Timeline

yavtuk created this revision.Jul 5 2022, 12:57 AM
Herald added a reviewer: maksfb. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
yavtuk requested review of this revision.Jul 5 2022, 12:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2022, 12:57 AM
yavtuk updated this revision to Diff 442794.Jul 6 2022, 11:54 PM

upload small fixes based on https://reviews.llvm.org/D129168 commit

yavtuk updated this revision to Diff 442809.Jul 7 2022, 12:55 AM

Full diff has uploaded

yavtuk updated this revision to Diff 442876.Jul 7 2022, 5:58 AM
yavtuk updated this revision to Diff 442897.Jul 7 2022, 6:52 AM

Thanks for working on improving hugify!

bolt/include/bolt/Passes/Hugify.h
4–7

This is the correct license

bolt/lib/Passes/Hugify.cpp
3–7

Outdated license

56

update name here

bolt/lib/RuntimeLibs/HugifyRuntimeLibrary.cpp
62–63

I think we can delete all of this, right? emitBinary() for HugyfiRuntimeLibrary doesn't need to do anything if you are already creating a new function in a new pass that will be called from the runtime.

And then go to HugifyRuntimeLibrary.h and delete

SecNames.push_back(".bolt.hugify.entries");
bolt/runtime/common.h
437

capitalize Buf to follow LLVM coding style

545–546

capitalize Option, Arg2, etc. to follow LLVM coding style

bolt/runtime/hugify.cpp
1–6

This is the wrong license. LLVM license has updated to Apache v2.0 with LLVM Exceptions (the one that was previously in the header). Could you revert back this change?

29

Can you remove "_ptr" from the name, since it is not a function pointer anymore? I suggest naming "__bolt_hugify_start_program" (but it's just a suggestion)

36–62

Move to common.h, close to "hexToLong"

capitalize variables to follow llvm style (see hexToLong example).

55

getKernelVersion(Val)

to follow LLVM coding style

(I realize this file doesn't follow LLVM style, but it is small enough that we can just update it)

62–63

hasPagecacheTHPSupport

63–65

Buf

67

FD

93–94

struct KernelVersionTy {

  uint32_t major;
  uint32_t minor;
  uint32_t release;
};

KernelVersionTy KernelVersion;

107

hugifyForOldKernel(From, To ..

Hello @rafaelauler, thanks for the comments, I add the fixes soon

yavtuk updated this revision to Diff 445735.Jul 19 2022, 1:57 AM
yavtuk updated this revision to Diff 445737.Jul 19 2022, 2:03 AM
yavtuk updated this revision to Diff 445742.Jul 19 2022, 2:07 AM
yavtuk updated this revision to Diff 445778.Jul 19 2022, 4:55 AM
yavtuk marked 15 inline comments as done.
yavtuk updated this revision to Diff 445784.Jul 19 2022, 5:04 AM
yavtuk updated this revision to Diff 446089.Jul 20 2022, 2:33 AM
This comment was removed by yavtuk.
bolt/lib/Passes/Hugify.cpp
3–7

updated

bolt/lib/RuntimeLibs/HugifyRuntimeLibrary.cpp
62–63

yes, you are right, thanks

Hi @rafauler, is it possible to get user-func-reorder.c.tmp & user-func-reorder.c.tmp.exe for analyze?

Those should be available in your build folder, here:

$ ninja check-bolt
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp.exe

yavtuk updated this revision to Diff 446362.Jul 20 2022, 11:33 PM
yavtuk updated this revision to Diff 446363.Jul 20 2022, 11:42 PM

Those should be available in your build folder, here:

$ ninja check-bolt
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp.exe

I have 2 different environments, first one is ubuntu with 5.10 kernel and the second one is EulerOS with 4.19 kernel, the user-func-reorder test is successfully passed on the both platform.

yavtuk updated this revision to Diff 446381.Jul 21 2022, 1:23 AM

removed debugging information from the user-func-order test

finally I've reproduced the problem

yavtuk updated this revision to Diff 446427.Jul 21 2022, 4:24 AM

removed -no-pie for user-func-order test, this test uses --hugify option

yavtuk updated this revision to Diff 446436.Jul 21 2022, 4:46 AM
yavtuk updated this revision to Diff 447273.Jul 25 2022, 4:28 AM

added checking of address overlapping for memcpy implementation

yavtuk updated this revision to Diff 447280.Jul 25 2022, 4:57 AM

git-clang-format updates

yavtuk updated this revision to Diff 447282.Jul 25 2022, 5:12 AM
yavtuk updated this revision to Diff 447289.Jul 25 2022, 5:41 AM

added noinline attribute for memcpy

@rafaelauler Hi Rafael, can you look at again, code style issues are fixed, all tests are successfully passed.

Thanks!

bolt/lib/Rewrite/RewriteInstance.cpp
494–496

Why is that needed?

3626–3628

Same

bolt/runtime/common.h
82

Why is that needed?

rafauler added inline comments.Jul 27 2022, 4:28 PM
bolt/lib/Passes/Hugify.cpp
51

add a newline at the end of file

yavtuk marked an inline comment as done.Aug 8 2022, 12:14 AM

@rafauler Hi Rafael, let me know if you need more details

bolt/lib/Rewrite/RewriteInstance.cpp
494–496

It's needed due to HUGEPAGE allocation policy and also due to the bug for old kernels where dynamic loader doesn't take into account p_align field.
Dynamic loader allocates and maps the segments sequentially with 4KB addresses alignment. If we want to get HUGEPAGE from OS we have to have the address for page with 2MB alignment. For that, I add padding from left and right sides in order to exclude overlapping between segments.

bolt/runtime/common.h
82

good question :-) the user-func-reoder test fails and it was hard to reproduce the cause locally
since it's related to compiler

with this attribute we have the following assembly for memcpy:

.Loop:

...
movzbl (%rsi,%rdi,1),%ecx
mov    %cl,(%rax,%rdi,1)
add    $0x1,%rdi
cmp    %rdi,%r9
jne    a004a0 <_fini+0x2c4>
...


mov    %r14,%rdi
mov    %r15,%rsi
mov    %rbx,%rdx
callq  .Loop

copying is performed by byte with verification

without this attribute I see the following:

.Loop:

... 
movzbl 0x0(%r13,%rax,1),%edx
mov    %dl,(%rbx,%rax,1)
movzbl 0x1(%r13,%rax,1),%edx
mov    %dl,0x1(%rbx,%rax,1)
movzbl 0x2(%r13,%rax,1),%edx
mov    %dl,0x2(%rbx,%rax,1)
movzbl 0x3(%r13,%rax,1),%edx
mov    %dl,0x3(%rbx,%rax,1)
movzbl 0x4(%r13,%rax,1),%edx
mov    %dl,0x4(%rbx,%rax,1)
movzbl 0x5(%r13,%rax,1),%edx
mov    %dl,0x5(%rbx,%rax,1)
movzbl 0x6(%r13,%rax,1),%edx
mov    %dl,0x6(%rbx,%rax,1)
movzbl 0x7(%r13,%rax,1),%edx
mov    %dl,0x7(%rbx,%rax,1)
add    $0x8,%rax
cmp    %rax,%rcx
jne    a007f0 <_fini+0x614>

copying is performed with unrolling and test fails due to overlapping dst and src addresses for size which is not aligned to 8 bytes

rafauler added inline comments.Aug 8 2022, 5:09 PM
bolt/lib/Rewrite/RewriteInstance.cpp
494–496

From the left side, this is already aligned via BinaryContext::PageAlign. This is not just setting p_align, but actually setting the start address to be aligned at 2MB boundary. So this line here is inserting an extra empty 2MB page, but I'm not sure I get the reason why.

3626–3628

For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 (using this diff's lines), where we pad the end of the code section until it is aligned at 2MB. I understand that other code sections might be allocated to the huge page if we don't have these lines added by this diff, but I'm not sure why is that a problem. If you have space left in a huge page, why wouldn't you put code there?

bolt/runtime/common.h
82

Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of copying the entire contents of the huge page, it is copying only hot functions. I go back to my original point, why do we need to avoid overlapping segments? If you roll back to previous behavior (copying all contents of the page, including cold text segments), you won't need to insert one extra alignment at the end of each code section.

Hi Rafael let me try to explain it from a loader point of view on the different kernels
I have 2 kernels: 5.10 and 4.18

"From the left side, this is already aligned via BinaryContext::PageAlign. This is not just setting p_align, but actually setting the start address to be aligned at 2MB boundary. So this line here is inserting an extra empty 2MB page, but I'm not sure I get the reason why."

Here the debug log from runnnig application with --hugify
kernel 5.10
./redis-server.gold.pie.bolt.test2
[hugify] hot start: 563c2fe00000
[hugify] hot end: 563c2fe0df97
[hugify] aligned huge page from: 563c2fe00000
[hugify] aligned huge page to: 563c30000000

kernel 4.18
./redis-server.gold.pie.bolt.test2
[hugify] hot start: 5616f85c3000
[hugify] hot end: 5616f85d0f97
[hugify] aligned huge page from: 5616f8400000
[hugify] aligned huge page to: 5616f8600000
[hugify] workaround with memory alignment for kernel < 5.10
[hugify] allocated temporary address: 7f127eefb000
[hugify] allocated aligned size: 200000
[hugify] allocated size: df97

As fas as you can see the hot start addresses are different.
for 5.10 the one is aligned to 2MB and we can just call madvise directly.
But for 4.18 we see that address is 4 KB aligned, OS does not give us the huge page due to incorrect address.
In order to fix it I put one extra page from left side (between 2 load segments)
it allows me to remap text section to 2MB page.

"For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 ..."

Yes, you are right, but it is padding after all execution sections.
I am not sure exactly but probably it's not possibly to have 2 RE regions with different page sizes (2MB & 4KB)
inside of one huge page, but without padding from right side I have SEGV_MAPPER error (it can be OS specific)

[25] .bss              NOBITS          0000000000004038 003038 000008 00  WA  0   0  1
[26] .text             PROGBITS        0000000000600000 600000 000085 00  AX  0   0 2097152
[27] .text.injected    PROGBITS        00000000006000c0 6000c0 000005 00  AX  0   0 64
[28] .text.cold        PROGBITS        0000000000600100 600100 0001a9 00  AX  0   0 64
[29] .eh_frame         PROGBITS        0000000000800000 800000 000260 00   A  0   0  8

  --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x600000} ---
  +++ killed by SIGSEGV +++
  Segmentation fault

"bolt/runtime/common.h
Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of copying the entire contents of the huge page, ..."

I think it redundant functionality to copy cold text and get the huge page for it.

Thanks for explaining, Alexey.

Here's what I understood: from the left side, if we take the last allocated address according to PT_LOAD segments, and then try to put a new segment right after it, aligned to 2MB, old loader somehow ignores p_align and loads that into a 4KB address. But if we give 2MB more space, it does align. This seems quite arbitrary, do you confirm this is what you're seeing and that this is the right fix? If yes, this is odd enough that I think it makes sense for us to put this under a special option -hugify=oldkernel.

From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.

And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.

When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.

Thanks for explaining, Alexey.

Here's what I understood: from the left side, if we take the last allocated address according to PT_LOAD segments, and then try to put a new segment right after it, aligned to 2MB, old loader somehow ignores p_align and loads that into a 4KB address. But if we give 2MB more space, it does align. This seems quite arbitrary, do you confirm this is what you're seeing and that this is the right fix? If yes, this is odd enough that I think it makes sense for us to put this under a special option -hugify=oldkernel.

Yes, you are right, it's the bug inside kernel loader. Maybe we should exclude the support for old kernel, until someone ask about it?

From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.

And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.

When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.

Rafael are you sure that we should copy all section from RE segment?
Below the screenshot of .text and .text.cold sections for one of the binaries which I have

Thanks for explaining, Alexey.

Here's what I understood: from the left side, if we take the last allocated address according to PT_LOAD segments, and then try to put a new segment right after it, aligned to 2MB, old loader somehow ignores p_align and loads that into a 4KB address. But if we give 2MB more space, it does align. This seems quite arbitrary, do you confirm this is what you're seeing and that this is the right fix? If yes, this is odd enough that I think it makes sense for us to put this under a special option -hugify=oldkernel.

Yes, you are right, it's the bug inside kernel loader. Maybe we should exclude the support for old kernel, until someone ask about it?

From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.

And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.

When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.

Rafael are you sure that we should copy all section from RE segment?
Below the screenshot of .text and .text.cold sections for one of the binaries which I have

Sorry, I wasn't clear. Not all sections, but restore the previous behavior (the behavior that we have right now in trunk), which fills the remaining of the huge page with the start of cold. If we don't do this, we crash. If you remove the right side padding, you will notice the crash.

I'm fine with either (keeping the old kernel support under a flag, or removing it). Btw thanks for your work, I appreciate it.

yavtuk updated this revision to Diff 467373.Oct 13 2022, 12:02 AM
yavtuk updated this revision to Diff 467378.Oct 13 2022, 12:33 AM
yavtuk updated this revision to Diff 467380.Oct 13 2022, 12:43 AM
yavtuk updated this revision to Diff 467382.Oct 13 2022, 12:50 AM

@rafauler hello, sorry about long wait with patch updating, based on our previous discussion I changed -hugify option which one now takes 2 parameters, 5.10 or 4.18.
For old kernel there is extra padding from left and right sides.
Can you look it once again?
Thanks in advance :-)

yavtuk updated this revision to Diff 467401.Oct 13 2022, 1:45 AM

Can you check the failed tests?

Failed Tests (1):

BOLT :: runtime/X86/user-func-reorder.c

For some reason I'm getting "Context not available." when reviewing this diff, which makes it hard to review. Can you re-submit it with full context?

yavtuk updated this revision to Diff 467656.Oct 13 2022, 6:43 PM

For some reason I'm getting "Context not available." when reviewing this diff, which makes it hard to review. Can you re-submit it with full context?

Done

rafauler added inline comments.Oct 14 2022, 4:12 PM
bolt/runtime/hugify.cpp
102–105

I think we can just use the old behavior here (work with two pointers From, To - they're both aligned to the page, we don't ever need to know the original hot sizes in this function. This makes this function more confusing to read.

108

Mistake is probably here, copying less bytes than are actually needed, program will crash.

bolt/test/runtime/X86/user-func-reorder.c
33 ↗(On Diff #467656)

You don't need 4.18 option here. The reason this test is failing is because in 5.10, you're not copying all contents of the page. You can't just copy hot contents. You need to restore the old behavior that will copy the entire page, and not just the first bytes that correspond to hot code.

In other words, as it is, the 5.10 behavior is broken. I left comments in the function where I suspect the problem is from the last time I debugged this.

Also, can we change the flag to be a hidden option? Instead of hugify=oldkernel, having the old -hugify and requiring the user to use an extra hidden flag --extra-hugify-padding

Also, can we change the flag to be a hidden option? Instead of hugify=oldkernel, having the old -hugify and requiring the user to use an extra hidden flag --extra-hugify-padding

Sure, we can, it will be much better for usage, thank for review. I need few time to check this changes on local environment and I upload fixes soon.

yavtuk updated this revision to Diff 468828.Oct 19 2022, 2:37 AM
yavtuk added a comment.EditedOct 19 2022, 2:41 AM

Hello Rafael, I am little bit confused with clang and pie binaries,
I use gcc mostly, my local env:
gcc (GCC) 7.3.0
linux 4.18.0 x86_64

cat test.c

#include <stdio.h>
#include <dlfcn.h>
#include <unistd.h>
#include <stdint.h>
#include "foo.h"

int main(void) {

void (*foo_func)(uint64_t);

void *fd = dlopen("./libfoo.so", RTLD_LAZY);
if (!fd) {
    printf("libfoo.so load failed\n");
    return -1;
}
foo_func = dlsym(fd, "foo");
if (!foo_func) {
    printf("function loading is failed\n");
    return -1;
}

for(size_t i = 0; i < 3; ++i) {
    foo_func(i);
}

dlclose(fd);
return 0;

}


cat foo.c

cat foo.c
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>

attribute((constructor)) void foo_init(void) {
printf("foo_init ctr\n");
}

attribute((destructor)) void foo_fini(void) {
printf("foo_init destr\n");
}

void foo(uint64_t num) {
printf("iter %lld\n", num);
}


cat foo.h

#ifndef FOO_H
#define FOO_H

void foo(uint64_t);

#endif //FOO_H


$CC foo.c -O0 -g3 -fPIC -Wl,--emit-relocs -shared -o libfoo.so
$CC test.c -O0 -g3 -fPIE -I. -Wl,-pie -Wl,--emit-relocs -ldl -Wl,-pie -o test_pie

llvm-bolt ./test_pie -instrument -o ./test_pie.inst -instrumentation-file=./test_pie.fdata -instrumentation-no-counters-clear -instrumentation-sleep-time=1

./test_pie.inst

llvm-bolt -hugify ./test_pie -o ./test_pie.hugify -data=./test_pie.fdata

file ./test_pie.hugify
./test_pie.hugify: ELF 64-bit LSB shared object ....

./test_pie.hugify
[hugify] hot start: 55a316b65000
[hugify] hot end: 55a316b65218
[hugify] aligned huge page from: 55a316a00000
[hugify] aligned huge page to: 55a316c00000
[hugify] workaround with memory alignment for kernel < 5.10
[hugify] allocated temporary space: 7f8321362000
foo_init ctr
iter 0
iter 1
iter 2
foo_init destr

As far as you can see the hot start address (0x55a316b65000) is 4KB aligned,
adding extra padding from left and right sides allows remap it correctly in runtime to
0x55a316a00000 address.

That why I need 4 arguments for hugifyForOldKernel(HotStart, HotEnd, AlignedFrom, AlignedTo)

HotStart, HotEnd are used to get real .text section size and copy it to temp memory, 0x218 bytes

AlignedFrom, AlignedTo are used to get the size from new area with 2MB address alignment,
it's needed because OS doesn't take huge page without it

I am just trying to reproduce the same for clang but can't get the output like this
[hugify] hot start: 55a316b65000
[hugify] hot end: 55a316b65218

I get the SEG_MAPPER error for src address during memcpy to temp area.

Can you give me piece of advice how do you get pie binaries using clang?
I would be appreciated for any help, thanks in advance.

I tried to repro your test case and I can't actually get it to work, then I realized that the problem is the -fPIE flag to build hugify.cpp. This is also a good reason why we probably should put a special 'pie' test case here (a new line in user-func-reorder.c that also builds it with -pie). If we had that, we would more easily detect this problem earlier in the review.

Here's the problem. If you are going to build hugify.cpp with -fPIE, you have to use the same trick that @yota9 used here https://github.com/llvm/llvm-project/commit/af58da4ef3fbd766b2c44cfdbdb859a21022d10a (reviewed here https://github.com/facebookincubator/BOLT/pull/192).

When building a PIE object file, the compiler will address data objects through a GOT table, and our rudimentary linker that brings in hugify.o into the final binary is unable to properly handle GOT table creation. So, at least in my environment, when I try to run BOLT with hugify option using as input a PIE binary, I end up with hugify trying to copy the wrong start pointers because we failed to create runtime relocations to fix the addresses in the GOT table.

Hopefully by using the "#pragma GCC visibility push(hidden)" trick, you should easily fix this.

By the way, now that I used the pragma fix, I was able to repro the alignment problem you mention and we are investigating. We believe it is a consequence of using ASLR in the kernel (that the 2MB page aligned is ignored and the kernel uses a 4KB one).

Ok, because of the weird ASLR thing on older kernels, now it is clear that the extra padding is going to be necessary for every PIE binary. So instead of having --hugify-extra-padding, we can kill this flag and always insert the extra padding whenever a PIE binary is being processed with -hugify.

To detect whether the input is PIE or not, use !HasFixedLoadAddress:
https://github.com/llvm/llvm-project/blob/main/bolt/include/bolt/Core/BinaryContext.h#L606

So just replace opts::HugifyExtraPadding with !BC.hasFixedLoadAddress.

So now it's clear the case for hugifyForOldKernel(HotStart, HotEnd, AlignedFrom, AlignedTo) with 4 args. But.. it still should copy the entire page contents instead of just hot code. When you ask the kernel to unmap all these addresses, we need first to copy whatever was there before, even if it was all zeroes (because in the no-PIC case, it won't be just zeroes).

rafauler added inline comments.Oct 25 2022, 6:47 PM
bolt/lib/Rewrite/RewriteInstance.cpp
3690–3691

Another thing is that we should right pad here just for hot section, not for every code section

yavtuk updated this revision to Diff 471090.Oct 27 2022, 2:40 AM

Rafael thanks so much for the tips and the review, checked locally and now everything works as it should be

yavtuk updated this revision to Diff 471123.Oct 27 2022, 4:45 AM
yavtuk updated this revision to Diff 471139.Oct 27 2022, 5:39 AM
yavtuk updated this revision to Diff 471146.Oct 27 2022, 6:09 AM

Rafael I will create new separate test for hugify PIE/non-PIE binaries and upload it tomorrow

Thanks for working on this! Let's sync one last time our understanding of the implementation of hugifyForOldKernel. Sorry if being repetitive, but it is important now to be on the same page regarding what is happening during runtime in both PIC and no-PIC cases. See if you agree with me with respect to the AlignedFrom/AlignedTo/AlignedSize usages in the suggestions, and please point me any issues in my understanding.

If we look at hugifyForOldKernel(), the code suggested here currently copies only a part of the page that is determined by From, To. We now know that "From", because of ASLR ignoring our alignment requirements, may not be aligned. Now suppose it lands in the middle of the page and that "To" (the end of hot code section) lands in the middle of the next page.

2MB huge page virtual memory map:

page1 - 0x400000:
hot start: 0x500000
page2 - 0x600000:
hot end: 0x700000
page 3 - 0x800000

In this case, according to lines 146-149, you will align "hot start" and "hot end" to 0x400000 and 0x800000, respectively, and ask the kernel to unmap these pages. So you will be unmapping 4MB of code. However, the code will be memcpy-ing 2MB of code from 0x500000 to 0x700000, and then copying it back after the kernel successfully mmaps the requested region into two huge pages.

Now, because you inserted extra padding in RewriteInstance:cpp:103 and 308, the fact that you are leaving 1MB before hot start not copied, and 1MB after hot end as well, is not really a problem.

However, in the non-PIC code, we are not inserting any extra padding. After hot end, at address 0x700000, we will have a large amount of code (coming from cold code of hot functions, those that were split). We will also have a bunch of extra code including the hugify runtime library itself, in some cases.

If we memcpy from 0x400000 to 0x800000 instead of the original 500000 to 700000, we will be erring on the safe side by always copying any memory contents that are being essentially erased after you ask the kernel to unmap them. That's why when using this mmap calls, we typically copy all page contents instead of just a subset of the (hot) bytes. It's also safe to reference these memory addresses (from 700000 to 800000) without the risk of segfaulting because BOLT will always pad the last code section in no-PIE -- the padding won't be correct for PIE because ASLR loader will misalign the start, but luckily we are inserting one extra page at the end in these cases, so the addresses from 700000 to 800000 will be filled with zeroes and won't segfault.

What you did in the last iteration was to expand hot_end towards one extra 4KB page, but that is not enough as in line 115 we are asking the kernel to unmap whole 2MB regions of text.

Does that sound reasonable? anything I'm missing?

bolt/lib/Rewrite/RewriteInstance.cpp
102

Also add && opts::Hugify

306

Also add && opts::Hugify

bolt/runtime/hugify.cpp
95–96

here we can pass AlignedFrom and Alignedto only

99

Here use AlignedSize

108

Here use AlignedFrom, AlignedSize

116–122

...because here you are using Aligned Size

145

I'll suggest doing something else in line 97.

yavtuk updated this revision to Diff 471459.Oct 28 2022, 2:46 AM
yavtuk updated this revision to Diff 471489.Oct 28 2022, 4:11 AM
yavtuk updated this revision to Diff 471499.Oct 28 2022, 5:07 AM
yavtuk updated this revision to Diff 471503.Oct 28 2022, 5:17 AM

Thanks for working on this! Let's sync one last time our understanding of the implementation of hugifyForOldKernel. Sorry if being repetitive, but it is important now to be on the same page regarding what is happening during runtime in both PIC and no-PIC cases. See if you agree with me with respect to the AlignedFrom/AlignedTo/AlignedSize usages in the suggestions, and please point me any issues in my understanding.

If we look at hugifyForOldKernel(), the code suggested here currently copies only a part of the page that is determined by From, To. We now know that "From", because of ASLR ignoring our alignment requirements, may not be aligned. Now suppose it lands in the middle of the page and that "To" (the end of hot code section) lands in the middle of the next page.

2MB huge page virtual memory map:

page1 - 0x400000:
hot start: 0x500000
page2 - 0x600000:
hot end: 0x700000
page 3 - 0x800000

In this case, according to lines 146-149, you will align "hot start" and "hot end" to 0x400000 and 0x800000, respectively, and ask the kernel to unmap these pages. So you will be unmapping 4MB of code. However, the code will be memcpy-ing 2MB of code from 0x500000 to 0x700000, and then copying it back after the kernel successfully mmaps the requested region into two huge pages.

Now, because you inserted extra padding in RewriteInstance:cpp:103 and 308, the fact that you are leaving 1MB before hot start not copied, and 1MB after hot end as well, is not really a problem.

However, in the non-PIC code, we are not inserting any extra padding. After hot end, at address 0x700000, we will have a large amount of code (coming from cold code of hot functions, those that were split). We will also have a bunch of extra code including the hugify runtime library itself, in some cases.

If we memcpy from 0x400000 to 0x800000 instead of the original 500000 to 700000, we will be erring on the safe side by always copying any memory contents that are being essentially erased after you ask the kernel to unmap them. That's why when using this mmap calls, we typically copy all page contents instead of just a subset of the (hot) bytes. It's also safe to reference these memory addresses (from 700000 to 800000) without the risk of segfaulting because BOLT will always pad the last code section in no-PIE -- the padding won't be correct for PIE because ASLR loader will misalign the start, but luckily we are inserting one extra page at the end in these cases, so the addresses from 700000 to 800000 will be filled with zeroes and won't segfault.

What you did in the last iteration was to expand hot_end towards one extra 4KB page, but that is not enough as in line 115 we are asking the kernel to unmap whole 2MB regions of text.

Does that sound reasonable? anything I'm missing?

Yes, you are right, thank you for clarifying, part of the problem was with copying to the time area, changed the size at 97 line and removed the redundant function argument. Added simple test for both types of binaries.

Thanks! I think we can simplify this a bit further. Let me know what you think.

bolt/runtime/hugify.cpp
95–96

Remove From

97

Remove line, use AlignedSize only

158

Here pass only From,To

yavtuk updated this revision to Diff 471697.Oct 28 2022, 6:29 PM
yavtuk updated this revision to Diff 471700.Oct 28 2022, 6:49 PM

clang-format

Thanks! I think we can simplify this a bit further. Let me know what you think.

Yes, it's the good suggestion. Also I removed "Aligned" from names.

This revision is now accepted and ready to land.Oct 31 2022, 3:19 PM
This revision was automatically updated to reflect the committed changes.
phosek added a subscriber: phosek.Nov 4 2022, 12:17 PM

This is failing on our builders with the following error:

/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:3: error: unknown type name 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:24: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                       ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:33: error: expected expression
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:35: error: use of undeclared identifier '__hot_start'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:3: error: unknown type name 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:22: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                     ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:31: error: expected expression
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:33: error: use of undeclared identifier '__hot_end'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:145:9: error: unknown type name 'size_t'
  const size_t HugePageBytes = 2L * 1024 * 1024;
        ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:3: error: unknown type name 'uint8_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:32: error: use of undeclared identifier 'intptr_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
                               ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:147:3: error: unknown type name 'uint8_t'
  uint8_t *To = HotEnd + (HugePageBytes - 1);
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:148:10: error: use of undeclared identifier 'intptr_t'
  To -= (intptr_t)To & (HugePageBytes - 1);
         ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:150:47: error: use of undeclared identifier 'uint64_t'
  DEBUG(reportNumber("[hugify] hot start: ", (uint64_t)HotStart, 16);)
                                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:172:24: error: expected string literal in 'asm'
  __asm__ __volatile__(SAVE_ALL "call __bolt_hugify_self_impl\n" RESTORE_ALL
                       ^
15 errors generated.

Looks like hugify.cpp is missing cstdint include? Would it be possible to take a look?

This is failing on our builders with the following error:

/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:3: error: unknown type name 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:24: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                       ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:33: error: expected expression
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:35: error: use of undeclared identifier '__hot_start'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:3: error: unknown type name 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:22: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                     ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:31: error: expected expression
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:33: error: use of undeclared identifier '__hot_end'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:145:9: error: unknown type name 'size_t'
  const size_t HugePageBytes = 2L * 1024 * 1024;
        ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:3: error: unknown type name 'uint8_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:32: error: use of undeclared identifier 'intptr_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
                               ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:147:3: error: unknown type name 'uint8_t'
  uint8_t *To = HotEnd + (HugePageBytes - 1);
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:148:10: error: use of undeclared identifier 'intptr_t'
  To -= (intptr_t)To & (HugePageBytes - 1);
         ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:150:47: error: use of undeclared identifier 'uint64_t'
  DEBUG(reportNumber("[hugify] hot start: ", (uint64_t)HotStart, 16);)
                                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:172:24: error: expected string literal in 'asm'
  __asm__ __volatile__(SAVE_ALL "call __bolt_hugify_self_impl\n" RESTORE_ALL
                       ^
15 errors generated.

Looks like hugify.cpp is missing cstdint include? Would it be possible to take a look?

I can't repro. It's probably these wonderful/unreadable ifdefs we have here, because we do include cstdint in common.h. What is the system your builder is running?

Reproduced. Setting APPLE gives me the warnings your builders are seeing.

@phosek landed a fix to trunk. Let me know if it doesn't fix your builders.

yota9 added inline comments.Nov 4 2022, 1:22 PM
bolt/runtime/CMakeLists.txt
31

As for hugify it is OK to use fPIE the rest of the libs must use fPIC flag. I suggest to use fPIC here, there should be no real difference

yavtuk added a comment.Nov 7 2022, 4:24 AM

@phosek landed a fix to trunk. Let me know if it doesn't fix your builders.

@rafauler Thank you so much for the quick fix