This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
bolt/
-
include/bolt/
-
bolt/
-
Passes/
1/1
Hugify.h
-
RuntimeLibs/
-
HugifyRuntimeLibrary.h
-
Utils/
-
CommandLineOpts.h
-
lib/
-
Passes/
-
CMakeLists.txt
3/4
Hugify.cpp
-
Rewrite/
-
BinaryPassManager.cpp
2/6
RewriteInstance.cpp
-
RuntimeLibs/
2/2
HugifyRuntimeLibrary.cpp
-
runtime/
1
CMakeLists.txt
3/5
common.h
9/19
hugify.cpp

Differential D129107

[BOLT][HUGIFY] adds huge pages support of PIE/no-PIE binaries
ClosedPublic

Authored by yavtuk on Jul 5 2022, 12:57 AM.

Download Raw Diff

Details

Reviewers

Amir
rafaelauler
rafauler
maksfb

Commits

rG1fb186198af5: adds huge pages support of PIE/no-PIE binaries

Summary

This patch adds the huge pages support (-hugify) for PIE/no-PIE
binaries. Also returned functionality to support the kernels < 5.10
where there is a problem in a dynamic loader with the alignment of
pages addresses.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yavtuk created this revision.Jul 5 2022, 12:57 AM

Herald added a reviewer: rafauler. · View Herald TranscriptJul 5 2022, 12:57 AM

Herald added a reviewer: maksfb. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: ayermolo, mgorny. · View Herald Transcript

yavtuk requested review of this revision.Jul 5 2022, 12:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2022, 12:57 AM

Herald added subscribers: llvm-commits, yota9. · View Herald Transcript

Harbormaster completed remote builds in B173630: Diff 442193.Jul 5 2022, 1:03 AM

upload small fixes based on https://reviews.llvm.org/D129168 commit

Harbormaster completed remote builds in B174067: Diff 442794.Jul 6 2022, 11:55 PM

yavtuk added a parent revision: D129168: [BOLT] Add runtime functions required by freestanding environment.Jul 6 2022, 11:57 PM

Full diff has uploaded

Harbormaster completed remote builds in B174077: Diff 442809.Jul 7 2022, 12:56 AM

yavtuk updated this revision to Diff 442876.Jul 7 2022, 5:58 AM

Harbormaster completed remote builds in B174130: Diff 442876.Jul 7 2022, 6:05 AM

yavtuk updated this revision to Diff 442897.Jul 7 2022, 6:52 AM

Harbormaster completed remote builds in B174148: Diff 442897.Jul 7 2022, 6:58 AM

yavtuk edited parent revisions, added: D129321: [BOLT][Runtime] Fix memset definition; removed: D129168: [BOLT] Add runtime functions required by freestanding environment.Jul 8 2022, 3:36 AM

Thanks for working on improving hugify!

bolt/include/bolt/Passes/Hugify.h
4–7	This is the correct license
bolt/lib/Passes/Hugify.cpp
3–7	Outdated license
56	update name here
bolt/lib/RuntimeLibs/HugifyRuntimeLibrary.cpp
62–63	I think we can delete all of this, right? emitBinary() for HugyfiRuntimeLibrary doesn't need to do anything if you are already creating a new function in a new pass that will be called from the runtime. And then go to HugifyRuntimeLibrary.h and delete SecNames.push_back(".bolt.hugify.entries");
bolt/runtime/common.h
437	capitalize Buf to follow LLVM coding style
545–546	capitalize Option, Arg2, etc. to follow LLVM coding style
bolt/runtime/hugify.cpp
1–6	This is the wrong license. LLVM license has updated to Apache v2.0 with LLVM Exceptions (the one that was previously in the header). Could you revert back this change?
19	Can you remove "_ptr" from the name, since it is not a function pointer anymore? I suggest naming "__bolt_hugify_start_program" (but it's just a suggestion)
26–52	Move to common.h, close to "hexToLong" capitalize variables to follow llvm style (see hexToLong example).
45	getKernelVersion(Val) to follow LLVM coding style (I realize this file doesn't follow LLVM style, but it is small enough that we can just update it)
52–53	hasPagecacheTHPSupport
53–57	Buf
59	FD
69–74	struct KernelVersionTy { uint32_t major; uint32_t minor; uint32_t release; }; KernelVersionTy KernelVersion;
86	hugifyForOldKernel(From, To ..

Hello @rafaelauler, thanks for the comments, I add the fixes soon

yavtuk updated this revision to Diff 445735.Jul 19 2022, 1:57 AM

yavtuk updated this revision to Diff 445737.Jul 19 2022, 2:03 AM

yavtuk updated this revision to Diff 445742.Jul 19 2022, 2:07 AM

Harbormaster completed remote builds in B176199: Diff 445742.Jul 19 2022, 2:26 AM

yavtuk updated this revision to Diff 445778.Jul 19 2022, 4:55 AM

yavtuk marked 15 inline comments as done.

Harbormaster completed remote builds in B176226: Diff 445778.Jul 19 2022, 5:02 AM

yavtuk updated this revision to Diff 445784.Jul 19 2022, 5:04 AM

Harbormaster completed remote builds in B176231: Diff 445784.Jul 19 2022, 5:11 AM

yavtuk updated this revision to Diff 446089.Jul 20 2022, 2:33 AM

Harbormaster completed remote builds in B176458: Diff 446089.Jul 20 2022, 2:45 AM

yavtuk added a comment.Jul 20 2022, 5:57 AM

This comment was removed by yavtuk.

bolt/lib/Passes/Hugify.cpp
3–7	updated
bolt/lib/RuntimeLibs/HugifyRuntimeLibrary.cpp
62–63	yes, you are right, thanks

Hi @rafauler, is it possible to get user-func-reorder.c.tmp & user-func-reorder.c.tmp.exe for analyze?

Those should be available in your build folder, here:

$ ninja check-bolt
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp.exe

yavtuk updated this revision to Diff 446362.Jul 20 2022, 11:33 PM

Herald added a subscriber: ormris. · View Herald TranscriptJul 20 2022, 11:33 PM

Harbormaster completed remote builds in B176663: Diff 446362.Jul 20 2022, 11:40 PM

yavtuk updated this revision to Diff 446363.Jul 20 2022, 11:42 PM

Harbormaster completed remote builds in B176664: Diff 446363.Jul 20 2022, 11:48 PM

In D129107#3667166, @rafauler wrote:

Those should be available in your build folder, here:

$ ninja check-bolt
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp
$ file tools/bolt/test/runtime/X86/Output/user-func-reorder.c.tmp.exe

I have 2 different environments, first one is ubuntu with 5.10 kernel and the second one is EulerOS with 4.19 kernel, the user-func-reorder test is successfully passed on the both platform.

removed debugging information from the user-func-order test

Harbormaster completed remote builds in B176676: Diff 446381.Jul 21 2022, 1:31 AM

finally I've reproduced the problem

removed -no-pie for user-func-order test, this test uses --hugify option

Harbormaster completed remote builds in B176713: Diff 446427.Jul 21 2022, 4:31 AM

yavtuk updated this revision to Diff 446436.Jul 21 2022, 4:46 AM

Harbormaster completed remote builds in B176719: Diff 446436.Jul 21 2022, 4:53 AM

added checking of address overlapping for memcpy implementation

Harbormaster completed remote builds in B177332: Diff 447273.Jul 25 2022, 4:36 AM

git-clang-format updates

Harbormaster completed remote builds in B177338: Diff 447280.Jul 25 2022, 5:03 AM

yavtuk updated this revision to Diff 447282.Jul 25 2022, 5:12 AM

Harbormaster completed remote builds in B177339: Diff 447282.Jul 25 2022, 5:19 AM

yavtuk updated this revision to Diff 447289.Jul 25 2022, 5:41 AM

added noinline attribute for memcpy

Harbormaster completed remote builds in B177344: Diff 447289.Jul 25 2022, 5:48 AM

@rafaelauler Hi Rafael, can you look at again, code style issues are fixed, all tests are successfully passed.

Thanks!

bolt/lib/Rewrite/RewriteInstance.cpp
494–496	Why is that needed?
3626–3628	Same
bolt/runtime/common.h
82	Why is that needed?

rafauler added inline comments.Jul 27 2022, 4:28 PM

bolt/lib/Passes/Hugify.cpp
52	add a newline at the end of file

@rafauler Hi Rafael, let me know if you need more details

bolt/lib/Rewrite/RewriteInstance.cpp
494–496	It's needed due to HUGEPAGE allocation policy and also due to the bug for old kernels where dynamic loader doesn't take into account p_align field. Dynamic loader allocates and maps the segments sequentially with 4KB addresses alignment. If we want to get HUGEPAGE from OS we have to have the address for page with 2MB alignment. For that, I add padding from left and right sides in order to exclude overlapping between segments.
bolt/runtime/common.h
82	good question :-) the user-func-reoder test fails and it was hard to reproduce the cause locally since it's related to compiler with this attribute we have the following assembly for memcpy: .Loop: ... movzbl (%rsi,%rdi,1),%ecx mov %cl,(%rax,%rdi,1) add $0x1,%rdi cmp %rdi,%r9 jne a004a0 <_fini+0x2c4> ... mov %r14,%rdi mov %r15,%rsi mov %rbx,%rdx callq .Loop copying is performed by byte with verification without this attribute I see the following: .Loop: ... movzbl 0x0(%r13,%rax,1),%edx mov %dl,(%rbx,%rax,1) movzbl 0x1(%r13,%rax,1),%edx mov %dl,0x1(%rbx,%rax,1) movzbl 0x2(%r13,%rax,1),%edx mov %dl,0x2(%rbx,%rax,1) movzbl 0x3(%r13,%rax,1),%edx mov %dl,0x3(%rbx,%rax,1) movzbl 0x4(%r13,%rax,1),%edx mov %dl,0x4(%rbx,%rax,1) movzbl 0x5(%r13,%rax,1),%edx mov %dl,0x5(%rbx,%rax,1) movzbl 0x6(%r13,%rax,1),%edx mov %dl,0x6(%rbx,%rax,1) movzbl 0x7(%r13,%rax,1),%edx mov %dl,0x7(%rbx,%rax,1) add $0x8,%rax cmp %rax,%rcx jne a007f0 <_fini+0x614> copying is performed with unrolling and test fails due to overlapping dst and src addresses for size which is not aligned to 8 bytes

rafauler added inline comments.Aug 8 2022, 5:09 PM

bolt/lib/Rewrite/RewriteInstance.cpp
494–496	From the left side, this is already aligned via BinaryContext::PageAlign. This is not just setting p_align, but actually setting the start address to be aligned at 2MB boundary. So this line here is inserting an extra empty 2MB page, but I'm not sure I get the reason why.
3626–3628	For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 (using this diff's lines), where we pad the end of the code section until it is aligned at 2MB. I understand that other code sections might be allocated to the huge page if we don't have these lines added by this diff, but I'm not sure why is that a problem. If you have space left in a huge page, why wouldn't you put code there?
bolt/runtime/common.h
82	Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of copying the entire contents of the huge page, it is copying only hot functions. I go back to my original point, why do we need to avoid overlapping segments? If you roll back to previous behavior (copying all contents of the page, including cold text segments), you won't need to insert one extra alignment at the end of each code section.

Hi Rafael let me try to explain it from a loader point of view on the different kernels
I have 2 kernels: 5.10 and 4.18

"From the left side, this is already aligned via BinaryContext::PageAlign. This is not just setting p_align, but actually setting the start address to be aligned at 2MB boundary. So this line here is inserting an extra empty 2MB page, but I'm not sure I get the reason why."

Here the debug log from runnnig application with --hugify
kernel 5.10
./redis-server.gold.pie.bolt.test2
[hugify] hot start: 563c2fe00000
[hugify] hot end: 563c2fe0df97
[hugify] aligned huge page from: 563c2fe00000
[hugify] aligned huge page to: 563c30000000

kernel 4.18
./redis-server.gold.pie.bolt.test2
[hugify] hot start: 5616f85c3000
[hugify] hot end: 5616f85d0f97
[hugify] aligned huge page from: 5616f8400000
[hugify] aligned huge page to: 5616f8600000
[hugify] workaround with memory alignment for kernel < 5.10
[hugify] allocated temporary address: 7f127eefb000
[hugify] allocated aligned size: 200000
[hugify] allocated size: df97

As fas as you can see the hot start addresses are different.
for 5.10 the one is aligned to 2MB and we can just call madvise directly.
But for 4.18 we see that address is 4 KB aligned, OS does not give us the huge page due to incorrect address.
In order to fix it I put one extra page from left side (between 2 load segments)
it allows me to remap text section to 2MB page.

"For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 ..."

Yes, you are right, but it is padding after all execution sections.
I am not sure exactly but probably it's not possibly to have 2 RE regions with different page sizes (2MB & 4KB)
inside of one huge page, but without padding from right side I have SEGV_MAPPER error (it can be OS specific)

[25] .bss              NOBITS          0000000000004038 003038 000008 00  WA  0   0  1
[26] .text             PROGBITS        0000000000600000 600000 000085 00  AX  0   0 2097152
[27] .text.injected    PROGBITS        00000000006000c0 6000c0 000005 00  AX  0   0 64
[28] .text.cold        PROGBITS        0000000000600100 600100 0001a9 00  AX  0   0 64
[29] .eh_frame         PROGBITS        0000000000800000 800000 000260 00   A  0   0  8

  --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x600000} ---
  +++ killed by SIGSEGV +++
  Segmentation fault

"bolt/runtime/common.h
Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of copying the entire contents of the huge page, ..."

I think it redundant functionality to copy cold text and get the huge page for it.

Thanks for explaining, Alexey.

Here's what I understood: from the left side, if we take the last allocated address according to PT_LOAD segments, and then try to put a new segment right after it, aligned to 2MB, old loader somehow ignores p_align and loads that into a 4KB address. But if we give 2MB more space, it does align. This seems quite arbitrary, do you confirm this is what you're seeing and that this is the right fix? If yes, this is odd enough that I think it makes sense for us to put this under a special option -hugify=oldkernel.

From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.

And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.

When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.

In D129107#3730258, @rafauler wrote:

Thanks for explaining, Alexey.

Here's what I understood: from the left side, if we take the last allocated address according to PT_LOAD segments, and then try to put a new segment right after it, aligned to 2MB, old loader somehow ignores p_align and loads that into a 4KB address. But if we give 2MB more space, it does align. This seems quite arbitrary, do you confirm this is what you're seeing and that this is the right fix? If yes, this is odd enough that I think it makes sense for us to put this under a special option -hugify=oldkernel.

Yes, you are right, it's the bug inside kernel loader. Maybe we should exclude the support for old kernel, until someone ask about it?

From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.

And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.

When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.

Rafael are you sure that we should copy all section from RE segment?
Below the screenshot of .text and .text.cold sections for one of the binaries which I have

In D129107#3731109, @yavtuk wrote:

In D129107#3730258, @rafauler wrote:

Thanks for explaining, Alexey.

Here's what I understood: from the left side, if we take the last allocated address according to PT_LOAD segments, and then try to put a new segment right after it, aligned to 2MB, old loader somehow ignores p_align and loads that into a 4KB address. But if we give 2MB more space, it does align. This seems quite arbitrary, do you confirm this is what you're seeing and that this is the right fix? If yes, this is odd enough that I think it makes sense for us to put this under a special option -hugify=oldkernel.

Yes, you are right, it's the bug inside kernel loader. Maybe we should exclude the support for old kernel, until someone ask about it?

From the right side, we create multiple ELF sections in the same segment and that crashes the loader. The loader would like to see a single segment aligned at 2MB. Is that it? This also would be better under a special -hugify=oldkernel option.

And then restore the behavior in the hugify library to copy all code instead of just hot code, otherwise we're going to crash if -hugify is used instead of -hugify=oldkernel.

When you guard the code under -hugify=oldkernel, please add a comment linking to this diff for the discussion, so somebody reading the code has the background to understand why this was added in the first place. Weird behaviors such as this need to be thoroughly documented, so it is clear we're compensating for a bug in the system.

Rafael are you sure that we should copy all section from RE segment?
Below the screenshot of .text and .text.cold sections for one of the binaries which I have

Sorry, I wasn't clear. Not all sections, but restore the previous behavior (the behavior that we have right now in trunk), which fills the remaining of the huge page with the start of cold. If we don't do this, we crash. If you remove the right side padding, you will notice the crash.

I'm fine with either (keeping the old kernel support under a flag, or removing it). Btw thanks for your work, I appreciate it.

yavtuk updated this revision to Diff 467373.Oct 13 2022, 12:02 AM

Herald added a subscriber: treapster. · View Herald TranscriptOct 13 2022, 12:02 AM

Harbormaster completed remote builds in B191896: Diff 467373.Oct 13 2022, 12:02 AM

yavtuk updated this revision to Diff 467378.Oct 13 2022, 12:33 AM

Harbormaster completed remote builds in B191900: Diff 467378.Oct 13 2022, 12:33 AM

yavtuk updated this revision to Diff 467380.Oct 13 2022, 12:43 AM

Harbormaster completed remote builds in B191902: Diff 467380.Oct 13 2022, 12:44 AM

yavtuk updated this revision to Diff 467382.Oct 13 2022, 12:50 AM

Harbormaster completed remote builds in B191905: Diff 467382.Oct 13 2022, 12:50 AM

@rafauler hello, sorry about long wait with patch updating, based on our previous discussion I changed -hugify option which one now takes 2 parameters, 5.10 or 4.18.
For old kernel there is extra padding from left and right sides.
Can you look it once again?
Thanks in advance :-)

yavtuk removed a parent revision: D129321: [BOLT][Runtime] Fix memset definition.Oct 13 2022, 1:16 AM

yavtuk updated this revision to Diff 467401.Oct 13 2022, 1:45 AM

Harbormaster completed remote builds in B191918: Diff 467401.Oct 13 2022, 1:52 AM

Can you check the failed tests?

Failed Tests (1):

BOLT :: runtime/X86/user-func-reorder.c

For some reason I'm getting "Context not available." when reviewing this diff, which makes it hard to review. Can you re-submit it with full context?

yavtuk updated this revision to Diff 467656.Oct 13 2022, 6:43 PM

Harbormaster completed remote builds in B192097: Diff 467656.Oct 13 2022, 6:50 PM

In D129107#3857321, @rafauler wrote:

For some reason I'm getting "Context not available." when reviewing this diff, which makes it hard to review. Can you re-submit it with full context?

Done

rafauler added inline comments.Oct 14 2022, 4:12 PM

bolt/runtime/hugify.cpp
88–102	I think we can just use the old behavior here (work with two pointers From, To - they're both aligned to the page, we don't ever need to know the original hot sizes in this function. This makes this function more confusing to read.
110	Mistake is probably here, copying less bytes than are actually needed, program will crash.
bolt/test/runtime/X86/user-func-reorder.c
33 ↗	(On Diff #467656)	You don't need 4.18 option here. The reason this test is failing is because in 5.10, you're not copying all contents of the page. You can't just copy hot contents. You need to restore the old behavior that will copy the entire page, and not just the first bytes that correspond to hot code. In other words, as it is, the 5.10 behavior is broken. I left comments in the function where I suspect the problem is from the last time I debugged this.

Also, can we change the flag to be a hidden option? Instead of hugify=oldkernel, having the old -hugify and requiring the user to use an extra hidden flag --extra-hugify-padding

In D129107#3860016, @rafauler wrote:

Also, can we change the flag to be a hidden option? Instead of hugify=oldkernel, having the old -hugify and requiring the user to use an extra hidden flag --extra-hugify-padding

Sure, we can, it will be much better for usage, thank for review. I need few time to check this changes on local environment and I upload fixes soon.

yavtuk updated this revision to Diff 468828.Oct 19 2022, 2:37 AM

Hello Rafael, I am little bit confused with clang and pie binaries,
I use gcc mostly, my local env:
gcc (GCC) 7.3.0
linux 4.18.0 x86_64

cat test.c

#include <stdio.h>
#include <dlfcn.h>
#include <unistd.h>
#include <stdint.h>
#include "foo.h"

int main(void) {

void (*foo_func)(uint64_t);

void *fd = dlopen("./libfoo.so", RTLD_LAZY);
if (!fd) {
    printf("libfoo.so load failed\n");
    return -1;
}
foo_func = dlsym(fd, "foo");
if (!foo_func) {
    printf("function loading is failed\n");
    return -1;
}

for(size_t i = 0; i < 3; ++i) {
    foo_func(i);
}

dlclose(fd);
return 0;

}

cat foo.c

cat foo.c
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>

attribute((constructor)) void foo_init(void) {
printf("foo_init ctr\n");
}

attribute((destructor)) void foo_fini(void) {
printf("foo_init destr\n");
}

void foo(uint64_t num) {
printf("iter %lld\n", num);
}

cat foo.h

#ifndef FOO_H
#define FOO_H

void foo(uint64_t);

#endif //FOO_H

$CC foo.c -O0 -g3 -fPIC -Wl,--emit-relocs -shared -o libfoo.so
$CC test.c -O0 -g3 -fPIE -I. -Wl,-pie -Wl,--emit-relocs -ldl -Wl,-pie -o test_pie

llvm-bolt ./test_pie -instrument -o ./test_pie.inst -instrumentation-file=./test_pie.fdata -instrumentation-no-counters-clear -instrumentation-sleep-time=1

./test_pie.inst

llvm-bolt -hugify ./test_pie -o ./test_pie.hugify -data=./test_pie.fdata

file ./test_pie.hugify
./test_pie.hugify: ELF 64-bit LSB shared object ....

./test_pie.hugify
[hugify] hot start: 55a316b65000
[hugify] hot end: 55a316b65218
[hugify] aligned huge page from: 55a316a00000
[hugify] aligned huge page to: 55a316c00000
[hugify] workaround with memory alignment for kernel < 5.10
[hugify] allocated temporary space: 7f8321362000
foo_init ctr
iter 0
iter 1
iter 2
foo_init destr

As far as you can see the hot start address (0x55a316b65000) is 4KB aligned,
adding extra padding from left and right sides allows remap it correctly in runtime to
0x55a316a00000 address.

That why I need 4 arguments for hugifyForOldKernel(HotStart, HotEnd, AlignedFrom, AlignedTo)

HotStart, HotEnd are used to get real .text section size and copy it to temp memory, 0x218 bytes

AlignedFrom, AlignedTo are used to get the size from new area with 2MB address alignment,
it's needed because OS doesn't take huge page without it

I am just trying to reproduce the same for clang but can't get the output like this
[hugify] hot start: 55a316b65000
[hugify] hot end: 55a316b65218

I get the SEG_MAPPER error for src address during memcpy to temp area.

Can you give me piece of advice how do you get pie binaries using clang?
I would be appreciated for any help, thanks in advance.

Harbormaster completed remote builds in B192945: Diff 468828.Oct 19 2022, 2:48 AM

I tried to repro your test case and I can't actually get it to work, then I realized that the problem is the -fPIE flag to build hugify.cpp. This is also a good reason why we probably should put a special 'pie' test case here (a new line in user-func-reorder.c that also builds it with -pie). If we had that, we would more easily detect this problem earlier in the review.

Here's the problem. If you are going to build hugify.cpp with -fPIE, you have to use the same trick that @yota9 used here https://github.com/llvm/llvm-project/commit/af58da4ef3fbd766b2c44cfdbdb859a21022d10a (reviewed here https://github.com/facebookincubator/BOLT/pull/192).

When building a PIE object file, the compiler will address data objects through a GOT table, and our rudimentary linker that brings in hugify.o into the final binary is unable to properly handle GOT table creation. So, at least in my environment, when I try to run BOLT with hugify option using as input a PIE binary, I end up with hugify trying to copy the wrong start pointers because we failed to create runtime relocations to fix the addresses in the GOT table.

Hopefully by using the "#pragma GCC visibility push(hidden)" trick, you should easily fix this.

By the way, now that I used the pragma fix, I was able to repro the alignment problem you mention and we are investigating. We believe it is a consequence of using ASLR in the kernel (that the 2MB page aligned is ignored and the kernel uses a 4KB one).

Ok, because of the weird ASLR thing on older kernels, now it is clear that the extra padding is going to be necessary for every PIE binary. So instead of having --hugify-extra-padding, we can kill this flag and always insert the extra padding whenever a PIE binary is being processed with -hugify.

To detect whether the input is PIE or not, use !HasFixedLoadAddress:
https://github.com/llvm/llvm-project/blob/main/bolt/include/bolt/Core/BinaryContext.h#L606

So just replace opts::HugifyExtraPadding with !BC.hasFixedLoadAddress.

So now it's clear the case for hugifyForOldKernel(HotStart, HotEnd, AlignedFrom, AlignedTo) with 4 args. But.. it still should copy the entire page contents instead of just hot code. When you ask the kernel to unmap all these addresses, we need first to copy whatever was there before, even if it was all zeroes (because in the no-PIC case, it won't be just zeroes).

rafauler added inline comments.Oct 25 2022, 6:47 PM

bolt/lib/Rewrite/RewriteInstance.cpp
3690–3691	Another thing is that we should right pad here just for hot section, not for every code section

yavtuk updated this revision to Diff 471090.Oct 27 2022, 2:40 AM

Rafael thanks so much for the tips and the review, checked locally and now everything works as it should be

Harbormaster completed remote builds in B194602: Diff 471090.Oct 27 2022, 3:11 AM

yavtuk updated this revision to Diff 471123.Oct 27 2022, 4:45 AM

Harbormaster completed remote builds in B194627: Diff 471123.Oct 27 2022, 5:04 AM

yavtuk updated this revision to Diff 471139.Oct 27 2022, 5:39 AM

Harbormaster completed remote builds in B194636: Diff 471139.Oct 27 2022, 5:46 AM

yavtuk updated this revision to Diff 471146.Oct 27 2022, 6:09 AM

Harbormaster completed remote builds in B194643: Diff 471146.Oct 27 2022, 6:16 AM

Rafael I will create new separate test for hugify PIE/non-PIE binaries and upload it tomorrow

Thanks for working on this! Let's sync one last time our understanding of the implementation of hugifyForOldKernel. Sorry if being repetitive, but it is important now to be on the same page regarding what is happening during runtime in both PIC and no-PIC cases. See if you agree with me with respect to the AlignedFrom/AlignedTo/AlignedSize usages in the suggestions, and please point me any issues in my understanding.

If we look at hugifyForOldKernel(), the code suggested here currently copies only a part of the page that is determined by From, To. We now know that "From", because of ASLR ignoring our alignment requirements, may not be aligned. Now suppose it lands in the middle of the page and that "To" (the end of hot code section) lands in the middle of the next page.

2MB huge page virtual memory map:

page1 - 0x400000:
hot start: 0x500000
page2 - 0x600000:
hot end: 0x700000
page 3 - 0x800000

In this case, according to lines 146-149, you will align "hot start" and "hot end" to 0x400000 and 0x800000, respectively, and ask the kernel to unmap these pages. So you will be unmapping 4MB of code. However, the code will be memcpy-ing 2MB of code from 0x500000 to 0x700000, and then copying it back after the kernel successfully mmaps the requested region into two huge pages.

Now, because you inserted extra padding in RewriteInstance:cpp:103 and 308, the fact that you are leaving 1MB before hot start not copied, and 1MB after hot end as well, is not really a problem.

However, in the non-PIC code, we are not inserting any extra padding. After hot end, at address 0x700000, we will have a large amount of code (coming from cold code of hot functions, those that were split). We will also have a bunch of extra code including the hugify runtime library itself, in some cases.

If we memcpy from 0x400000 to 0x800000 instead of the original 500000 to 700000, we will be erring on the safe side by always copying any memory contents that are being essentially erased after you ask the kernel to unmap them. That's why when using this mmap calls, we typically copy all page contents instead of just a subset of the (hot) bytes. It's also safe to reference these memory addresses (from 700000 to 800000) without the risk of segfaulting because BOLT will always pad the last code section in no-PIE -- the padding won't be correct for PIE because ASLR loader will misalign the start, but luckily we are inserting one extra page at the end in these cases, so the addresses from 700000 to 800000 will be filled with zeroes and won't segfault.

What you did in the last iteration was to expand hot_end towards one extra 4KB page, but that is not enough as in line 115 we are asking the kernel to unmap whole 2MB regions of text.

Does that sound reasonable? anything I'm missing?

bolt/lib/Rewrite/RewriteInstance.cpp
102	Also add && opts::Hugify
306	Also add && opts::Hugify
bolt/runtime/hugify.cpp
75	here we can pass AlignedFrom and Alignedto only
78	Here use AlignedSize
110	Here use AlignedFrom, AlignedSize
114–120	...because here you are using Aligned Size
145	I'll suggest doing something else in line 97.

yavtuk updated this revision to Diff 471459.Oct 28 2022, 2:46 AM

Harbormaster completed remote builds in B194876: Diff 471459.Oct 28 2022, 3:01 AM

yavtuk updated this revision to Diff 471489.Oct 28 2022, 4:11 AM

Harbormaster completed remote builds in B194894: Diff 471489.Oct 28 2022, 4:19 AM

yavtuk updated this revision to Diff 471499.Oct 28 2022, 5:07 AM

Harbormaster completed remote builds in B194904: Diff 471499.Oct 28 2022, 5:13 AM

yavtuk updated this revision to Diff 471503.Oct 28 2022, 5:17 AM

Harbormaster completed remote builds in B194906: Diff 471503.Oct 28 2022, 5:24 AM

In D129107#3889977, @rafauler wrote:

Thanks for working on this! Let's sync one last time our understanding of the implementation of hugifyForOldKernel. Sorry if being repetitive, but it is important now to be on the same page regarding what is happening during runtime in both PIC and no-PIC cases. See if you agree with me with respect to the AlignedFrom/AlignedTo/AlignedSize usages in the suggestions, and please point me any issues in my understanding.

If we look at hugifyForOldKernel(), the code suggested here currently copies only a part of the page that is determined by From, To. We now know that "From", because of ASLR ignoring our alignment requirements, may not be aligned. Now suppose it lands in the middle of the page and that "To" (the end of hot code section) lands in the middle of the next page.

2MB huge page virtual memory map:

page1 - 0x400000:
hot start: 0x500000
page2 - 0x600000:
hot end: 0x700000
page 3 - 0x800000

In this case, according to lines 146-149, you will align "hot start" and "hot end" to 0x400000 and 0x800000, respectively, and ask the kernel to unmap these pages. So you will be unmapping 4MB of code. However, the code will be memcpy-ing 2MB of code from 0x500000 to 0x700000, and then copying it back after the kernel successfully mmaps the requested region into two huge pages.

Now, because you inserted extra padding in RewriteInstance:cpp:103 and 308, the fact that you are leaving 1MB before hot start not copied, and 1MB after hot end as well, is not really a problem.

However, in the non-PIC code, we are not inserting any extra padding. After hot end, at address 0x700000, we will have a large amount of code (coming from cold code of hot functions, those that were split). We will also have a bunch of extra code including the hugify runtime library itself, in some cases.

If we memcpy from 0x400000 to 0x800000 instead of the original 500000 to 700000, we will be erring on the safe side by always copying any memory contents that are being essentially erased after you ask the kernel to unmap them. That's why when using this mmap calls, we typically copy all page contents instead of just a subset of the (hot) bytes. It's also safe to reference these memory addresses (from 700000 to 800000) without the risk of segfaulting because BOLT will always pad the last code section in no-PIE -- the padding won't be correct for PIE because ASLR loader will misalign the start, but luckily we are inserting one extra page at the end in these cases, so the addresses from 700000 to 800000 will be filled with zeroes and won't segfault.

What you did in the last iteration was to expand hot_end towards one extra 4KB page, but that is not enough as in line 115 we are asking the kernel to unmap whole 2MB regions of text.

Does that sound reasonable? anything I'm missing?

Yes, you are right, thank you for clarifying, part of the problem was with copying to the time area, changed the size at 97 line and removed the redundant function argument. Added simple test for both types of binaries.

Thanks! I think we can simplify this a bit further. Let me know what you think.

bolt/runtime/hugify.cpp
75	Remove From
77	Remove line, use AlignedSize only
170	Here pass only From,To

yavtuk updated this revision to Diff 471697.Oct 28 2022, 6:29 PM

Harbormaster completed remote builds in B195046: Diff 471697.Oct 28 2022, 6:43 PM

clang-format

In D129107#3892560, @rafauler wrote:

Thanks! I think we can simplify this a bit further. Let me know what you think.

Yes, it's the good suggestion. Also I removed "Aligned" from names.

Harbormaster completed remote builds in B195048: Diff 471700.Oct 28 2022, 6:56 PM

LGTM

This revision is now accepted and ready to land.Oct 31 2022, 3:19 PM

Closed by commit rG1fb186198af5: adds huge pages support of PIE/no-PIE binaries (authored by yavtuk). · Explain WhyNov 4 2022, 5:41 AM

This revision was automatically updated to reflect the committed changes.

yavtuk added a commit: rG1fb186198af5: adds huge pages support of PIE/no-PIE binaries.

This is failing on our builders with the following error:

/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:3: error: unknown type name 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:24: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                       ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:33: error: expected expression
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:35: error: use of undeclared identifier '__hot_start'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:3: error: unknown type name 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:22: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                     ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:31: error: expected expression
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:33: error: use of undeclared identifier '__hot_end'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:145:9: error: unknown type name 'size_t'
  const size_t HugePageBytes = 2L * 1024 * 1024;
        ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:3: error: unknown type name 'uint8_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:32: error: use of undeclared identifier 'intptr_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
                               ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:147:3: error: unknown type name 'uint8_t'
  uint8_t *To = HotEnd + (HugePageBytes - 1);
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:148:10: error: use of undeclared identifier 'intptr_t'
  To -= (intptr_t)To & (HugePageBytes - 1);
         ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:150:47: error: use of undeclared identifier 'uint64_t'
  DEBUG(reportNumber("[hugify] hot start: ", (uint64_t)HotStart, 16);)
                                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:172:24: error: expected string literal in 'asm'
  __asm__ __volatile__(SAVE_ALL "call __bolt_hugify_self_impl\n" RESTORE_ALL
                       ^
15 errors generated.

Looks like hugify.cpp is missing cstdint include? Would it be possible to take a look?

In D129107#3908991, @phosek wrote:

This is failing on our builders with the following error:

/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:3: error: unknown type name 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:24: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                       ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:33: error: expected expression
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:142:35: error: use of undeclared identifier '__hot_start'
  uint8_t *HotStart = (uint8_t *)&__hot_start;
                                  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:3: error: unknown type name 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:22: error: use of undeclared identifier 'uint8_t'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                     ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:31: error: expected expression
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:143:33: error: use of undeclared identifier '__hot_end'
  uint8_t *HotEnd = (uint8_t *)&__hot_end;
                                ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:145:9: error: unknown type name 'size_t'
  const size_t HugePageBytes = 2L * 1024 * 1024;
        ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:3: error: unknown type name 'uint8_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:146:32: error: use of undeclared identifier 'intptr_t'
  uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
                               ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:147:3: error: unknown type name 'uint8_t'
  uint8_t *To = HotEnd + (HugePageBytes - 1);
  ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:148:10: error: use of undeclared identifier 'intptr_t'
  To -= (intptr_t)To & (HugePageBytes - 1);
         ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:150:47: error: use of undeclared identifier 'uint64_t'
  DEBUG(reportNumber("[hugify] hot start: ", (uint64_t)HotStart, 16);)
                                              ^
/opt/s/w/ir/x/w/llvm-llvm-project/bolt/runtime/hugify.cpp:172:24: error: expected string literal in 'asm'
  __asm__ __volatile__(SAVE_ALL "call __bolt_hugify_self_impl\n" RESTORE_ALL
                       ^
15 errors generated.

Looks like hugify.cpp is missing cstdint include? Would it be possible to take a look?

I can't repro. It's probably these wonderful/unreadable ifdefs we have here, because we do include cstdint in common.h. What is the system your builder is running?

Reproduced. Setting APPLE gives me the warnings your builders are seeing.

rafauler mentioned this in rG687ce3dec132: [BOLT][Hugify] Fix apple builds.Nov 4 2022, 1:10 PM

@phosek landed a fix to trunk. Let me know if it doesn't fix your builders.

yota9 added inline comments.Nov 4 2022, 1:22 PM

bolt/runtime/CMakeLists.txt
28	As for hugify it is OK to use fPIE the rest of the libs must use fPIC flag. I suggest to use fPIC here, there should be no real difference

In D129107#3909120, @rafauler wrote:

@phosek landed a fix to trunk. Let me know if it doesn't fix your builders.

@rafauler Thank you so much for the quick fix

Revision Contents

Path

Size

bolt/

include/

bolt/

Passes/

Hugify.h

29 lines

RuntimeLibs/

HugifyRuntimeLibrary.h

6 lines

Utils/

CommandLineOpts.h

1 line

lib/

Passes/

CMakeLists.txt

1 line

Hugify.cpp

51 lines

Rewrite/

BinaryPassManager.cpp

3 lines

RewriteInstance.cpp

8 lines

RuntimeLibs/

HugifyRuntimeLibrary.cpp

29 lines

runtime/

CMakeLists.txt

5 lines

common.h

78 lines

hugify.cpp

177 lines

Diff 447273

bolt/include/bolt/Passes/Hugify.h

This file was added.

				//===- bolt/Passes/Hugify.h -------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				rafaulerUnsubmitted Done Reply Inline Actions This is the correct license rafauler: This is the correct license

				#ifndef BOLT_PASSES_HUGIFY_H
				#define BOLT_PASSES_HUGIFY_H

				#include "bolt/Passes/BinaryPasses.h"

				namespace llvm {
				namespace bolt {

				class HugePage : public BinaryFunctionPass {
				public:
				HugePage(const cl::opt<bool> &PrintPass) : BinaryFunctionPass(PrintPass) {}

				void runOnFunctions(BinaryContext &BC) override;

				const char *getName() const override { return "HugePage"; }
				};

				} // namespace bolt
				} // namespace llvm

				#endif

bolt/include/bolt/RuntimeLibs/HugifyRuntimeLibrary.h

	Show All 16 Lines

	namespace llvm {			namespace llvm {
	namespace bolt {			namespace bolt {

	class HugifyRuntimeLibrary : public RuntimeLibrary {			class HugifyRuntimeLibrary : public RuntimeLibrary {
	public:			public:
	/// Add custom section names generated by the runtime libraries to \p			/// Add custom section names generated by the runtime libraries to \p
	/// SecNames.			/// SecNames.
	void addRuntimeLibSections(std::vector<std::string> &SecNames) const final {			void addRuntimeLibSections(std::vector<std::string> &SecNames) const final {}
	SecNames.push_back(".bolt.hugify.entries");
	}

	void adjustCommandLineOptions(const BinaryContext &BC) const final;			void adjustCommandLineOptions(const BinaryContext &BC) const final;

	void emitBinary(BinaryContext &BC, MCStreamer &Streamer) final;			void emitBinary(BinaryContext &BC, MCStreamer &Streamer) final {}

	void link(BinaryContext &BC, StringRef ToolPath, RuntimeDyld &RTDyld,			void link(BinaryContext &BC, StringRef ToolPath, RuntimeDyld &RTDyld,
	std::function<void(RuntimeDyld &)> OnLoad) final;			std::function<void(RuntimeDyld &)> OnLoad) final;
	};			};

	} // namespace bolt			} // namespace bolt
	} // namespace llvm			} // namespace llvm

	#endif			#endif

bolt/include/bolt/Utils/CommandLineOpts.h

	Show All 38 Lines
	extern llvm::cl::opt<bool> RemoveSymtab;			extern llvm::cl::opt<bool> RemoveSymtab;
	extern llvm::cl::opt<unsigned> ExecutionCountThreshold;			extern llvm::cl::opt<unsigned> ExecutionCountThreshold;
	extern llvm::cl::opt<unsigned> HeatmapBlock;			extern llvm::cl::opt<unsigned> HeatmapBlock;
	extern llvm::cl::opt<unsigned long long> HeatmapMaxAddress;			extern llvm::cl::opt<unsigned long long> HeatmapMaxAddress;
	extern llvm::cl::opt<unsigned long long> HeatmapMinAddress;			extern llvm::cl::opt<unsigned long long> HeatmapMinAddress;
	extern llvm::cl::opt<bool> HotData;			extern llvm::cl::opt<bool> HotData;
	extern llvm::cl::opt<bool> HotFunctionsAtEnd;			extern llvm::cl::opt<bool> HotFunctionsAtEnd;
	extern llvm::cl::opt<bool> HotText;			extern llvm::cl::opt<bool> HotText;
				extern llvm::cl::opt<bool> Hugify;
	extern llvm::cl::opt<bool> Instrument;			extern llvm::cl::opt<bool> Instrument;
	extern llvm::cl::opt<std::string> OutputFilename;			extern llvm::cl::opt<std::string> OutputFilename;
	extern llvm::cl::opt<std::string> PerfData;			extern llvm::cl::opt<std::string> PerfData;
	extern llvm::cl::opt<bool> PrintCacheMetrics;			extern llvm::cl::opt<bool> PrintCacheMetrics;
	extern llvm::cl::opt<bool> PrintSections;			extern llvm::cl::opt<bool> PrintSections;
	extern llvm::cl::opt<bool> SplitEH;			extern llvm::cl::opt<bool> SplitEH;
	extern llvm::cl::opt<bool> StrictMode;			extern llvm::cl::opt<bool> StrictMode;
	extern llvm::cl::opt<bool> TimeOpts;			extern llvm::cl::opt<bool> TimeOpts;
	Show All 28 Lines

bolt/lib/Passes/CMakeLists.txt

Show All 10 Lines	add_llvm_library(LLVMBOLTPasses
CallGraphWalker.cpp		CallGraphWalker.cpp
DataflowAnalysis.cpp		DataflowAnalysis.cpp
DataflowInfoManager.cpp		DataflowInfoManager.cpp
ExtTSPReorderAlgorithm.cpp		ExtTSPReorderAlgorithm.cpp
FrameAnalysis.cpp		FrameAnalysis.cpp
FrameOptimizer.cpp		FrameOptimizer.cpp
HFSort.cpp		HFSort.cpp
HFSortPlus.cpp		HFSortPlus.cpp
		Hugify.cpp
IdenticalCodeFolding.cpp		IdenticalCodeFolding.cpp
IndirectCallPromotion.cpp		IndirectCallPromotion.cpp
Inliner.cpp		Inliner.cpp
Instrumentation.cpp		Instrumentation.cpp
JTFootprintReduction.cpp		JTFootprintReduction.cpp
LongJmp.cpp		LongJmp.cpp
LoopInversionPass.cpp		LoopInversionPass.cpp
LivenessAnalysis.cpp		LivenessAnalysis.cpp
Show All 36 Lines

bolt/lib/Passes/Hugify.cpp

This file was added.

				//===--- bolt/Passes/Hugify.cpp -------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				rafaulerUnsubmitted Done Reply Inline Actions Outdated license rafauler: Outdated license
				yavtukAuthorUnsubmitted Done Reply Inline Actions updated yavtuk: updated

				#include "bolt/Passes/Hugify.h"
				#include "llvm/Support/CommandLine.h"

				#define DEBUG_TYPE "bolt-hugify"

				using namespace llvm;

				namespace llvm {
				namespace bolt {

				void HugePage::runOnFunctions(BinaryContext &BC) {
				auto *RtLibrary = BC.getRuntimeLibrary();
				if (!RtLibrary \|\| !BC.isELF() \|\| !BC.StartFunctionAddress) {
				return;
				}

				auto createSimpleFunction =
				[&](std::string Title, std::vector<MCInst> Instrs) -> BinaryFunction * {
				BinaryFunction *Func = BC.createInjectedBinaryFunction(Title);

				std::vector<std::unique_ptr<BinaryBasicBlock>> BBs;
				BBs.emplace_back(Func->createBasicBlock(nullptr));
				BBs.back()->addInstructions(Instrs.begin(), Instrs.end());
				BBs.back()->setCFIState(0);
				BBs.back()->setOffset(BinaryBasicBlock::INVALID_OFFSET);

				Func->insertBasicBlocks(nullptr, std::move(BBs),
				/UpdateLayout=/true,
				/UpdateCFIState=/false);
				Func->updateState(BinaryFunction::State::CFG_Finalized);
				return Func;
				};

				const BinaryFunction *const Start =
				BC.getBinaryFunctionAtAddress(*BC.StartFunctionAddress);
				assert(Start && "Entry point function not found");
				const MCSymbol *StartSym = Start->getSymbol();
				createSimpleFunction("__bolt_hugify_start_program",
				BC.MIB->createSymbolTrampoline(StartSym, BC.Ctx.get()));

				}
				} // namespace bolt
				} // namespace llvm
				No newline at end of file
				rafaulerUnsubmitted Done Reply Inline Actions update name here rafauler: update name here
				rafaulerUnsubmitted Not Done Reply Inline Actions add a newline at the end of file rafauler: add a newline at the end of file

bolt/lib/Rewrite/BinaryPassManager.cpp

	//===- bolt/Rewrite/BinaryPassManager.cpp - Binary-level pass manager -----===//			//===- bolt/Rewrite/BinaryPassManager.cpp - Binary-level pass manager -----===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "bolt/Rewrite/BinaryPassManager.h"			#include "bolt/Rewrite/BinaryPassManager.h"
	#include "bolt/Passes/ADRRelaxationPass.h"			#include "bolt/Passes/ADRRelaxationPass.h"
	#include "bolt/Passes/Aligner.h"			#include "bolt/Passes/Aligner.h"
	#include "bolt/Passes/AllocCombiner.h"			#include "bolt/Passes/AllocCombiner.h"
	#include "bolt/Passes/AsmDump.h"			#include "bolt/Passes/AsmDump.h"
	#include "bolt/Passes/CMOVConversion.h"			#include "bolt/Passes/CMOVConversion.h"
	#include "bolt/Passes/FrameOptimizer.h"			#include "bolt/Passes/FrameOptimizer.h"
				#include "bolt/Passes/Hugify.h"
	#include "bolt/Passes/IdenticalCodeFolding.h"			#include "bolt/Passes/IdenticalCodeFolding.h"
	#include "bolt/Passes/IndirectCallPromotion.h"			#include "bolt/Passes/IndirectCallPromotion.h"
	#include "bolt/Passes/Inliner.h"			#include "bolt/Passes/Inliner.h"
	#include "bolt/Passes/Instrumentation.h"			#include "bolt/Passes/Instrumentation.h"
	#include "bolt/Passes/JTFootprintReduction.h"			#include "bolt/Passes/JTFootprintReduction.h"
	#include "bolt/Passes/LongJmp.h"			#include "bolt/Passes/LongJmp.h"
	#include "bolt/Passes/LoopInversionPass.h"			#include "bolt/Passes/LoopInversionPass.h"
	#include "bolt/Passes/PLTCall.h"			#include "bolt/Passes/PLTCall.h"
	▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines
	opts::AsmDump.getNumOccurrences());			opts::AsmDump.getNumOccurrences());

	if (BC.isAArch64())			if (BC.isAArch64())
	Manager.registerPass(			Manager.registerPass(
	std::make_unique<VeneerElimination>(PrintVeneerElimination));			std::make_unique<VeneerElimination>(PrintVeneerElimination));

	if (opts::Instrument)			if (opts::Instrument)
	Manager.registerPass(std::make_unique<Instrumentation>(NeverPrint));			Manager.registerPass(std::make_unique<Instrumentation>(NeverPrint));
				else if (opts::Hugify)
				Manager.registerPass(std::make_unique<HugePage>(NeverPrint));

	// Here we manage dependencies/order manually, since passes are run in the			// Here we manage dependencies/order manually, since passes are run in the
	// order they're registered.			// order they're registered.

	// Run this pass first to use stats for the original functions.			// Run this pass first to use stats for the original functions.
	Manager.registerPass(std::make_unique<PrintProgramStats>(NeverPrint));			Manager.registerPass(std::make_unique<PrintProgramStats>(NeverPrint));

	if (opts::PrintProfileStats)			if (opts::PrintProfileStats)
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

bolt/lib/Rewrite/RewriteInstance.cpp

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	outs() << "BOLT-INFO: first alloc address is 0x"			outs() << "BOLT-INFO: first alloc address is 0x"
	<< Twine::utohexstr(BC->FirstAllocAddress) << '\n';			<< Twine::utohexstr(BC->FirstAllocAddress) << '\n';

	FirstNonAllocatableOffset = NextAvailableOffset;			FirstNonAllocatableOffset = NextAvailableOffset;

	NextAvailableAddress = alignTo(NextAvailableAddress, BC->PageAlign);			NextAvailableAddress = alignTo(NextAvailableAddress, BC->PageAlign);
	NextAvailableOffset = alignTo(NextAvailableOffset, BC->PageAlign);			NextAvailableOffset = alignTo(NextAvailableOffset, BC->PageAlign);

				// Hugify: Additional huge page from left side
				if (opts::Hugify)
				NextAvailableAddress += BC->PageAlign;
				rafaulerUnsubmitted Not Done Reply Inline Actions Why is that needed? rafauler: Why is that needed?
				yavtukAuthorUnsubmitted Done Reply Inline Actions It's needed due to HUGEPAGE allocation policy and also due to the bug for old kernels where dynamic loader doesn't take into account p_align field. Dynamic loader allocates and maps the segments sequentially with 4KB addresses alignment. If we want to get HUGEPAGE from OS we have to have the address for page with 2MB alignment. For that, I add padding from left and right sides in order to exclude overlapping between segments. yavtuk: It's needed due to HUGEPAGE allocation policy and also due to the bug for old kernels where…
				rafaulerUnsubmitted Not Done Reply Inline Actions From the left side, this is already aligned via BinaryContext::PageAlign. This is not just setting p_align, but actually setting the start address to be aligned at 2MB boundary. So this line here is inserting an extra empty 2MB page, but I'm not sure I get the reason why. rafauler: From the left side, this is already aligned via BinaryContext::PageAlign. This is not just…

	if (!opts::UseGnuStack) {			if (!opts::UseGnuStack) {
	// This is where the black magic happens. Creating PHDR table in a segment			// This is where the black magic happens. Creating PHDR table in a segment
	// other than that containing ELF header is tricky. Some loaders and/or			// other than that containing ELF header is tricky. Some loaders and/or
	// parts of loaders will apply e_phoff from ELF header assuming both are in			// parts of loaders will apply e_phoff from ELF header assuming both are in
	// the same segment, while others will do the proper calculation.			// the same segment, while others will do the proper calculation.
	// We create the new PHDR table in such a way that both of the methods			// We create the new PHDR table in such a way that both of the methods
	// of loading and locating the table work. There's a slight file size			// of loading and locating the table work. There's a slight file size
	// overhead because of that.			// overhead because of that.
	▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines
	uint64_t PaddingSize = 0; // size of padding required at the end			uint64_t PaddingSize = 0; // size of padding required at the end

	// Allocate sections starting at a given Address.			// Allocate sections starting at a given Address.
	auto allocateAt = [&](uint64_t Address) {			auto allocateAt = [&](uint64_t Address) {
	for (BinarySection *Section : CodeSections) {			for (BinarySection *Section : CodeSections) {
	Address = alignTo(Address, Section->getAlignment());			Address = alignTo(Address, Section->getAlignment());
	Section->setOutputAddress(Address);			Section->setOutputAddress(Address);
	Address += Section->getOutputSize();			Address += Section->getOutputSize();

				// Hugify: Additional huge page from right side
				if (opts::Hugify)
				Address = alignTo(Address, Section->getAlignment());
				rafaulerUnsubmitted Done Reply Inline Actions Same rafauler: Same
				rafaulerUnsubmitted Not Done Reply Inline Actions For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 (using this diff's lines), where we pad the end of the code section until it is aligned at 2MB. I understand that other code sections might be allocated to the huge page if we don't have these lines added by this diff, but I'm not sure why is that a problem. If you have space left in a huge page, why wouldn't you put code there? rafauler: For the right side, this alignment is accomplished by lines RewriteInstance.cpp:3635 (using…
	}			}

	// Make sure we allocate enough space for huge pages.			// Make sure we allocate enough space for huge pages.
	if (opts::HotText) {			if (opts::HotText) {
	uint64_t HotTextEnd =			uint64_t HotTextEnd =
	TextSection->getOutputAddress() + TextSection->getOutputSize();			TextSection->getOutputAddress() + TextSection->getOutputSize();
	HotTextEnd = alignTo(HotTextEnd, BC->PageAlign);			HotTextEnd = alignTo(HotTextEnd, BC->PageAlign);
	if (HotTextEnd > Address) {			if (HotTextEnd > Address) {
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	<< " to accommodate hot text\n";			<< " to accommodate hot text\n";

	return;			return;
	}			}

	// Processing in non-relocation mode.			// Processing in non-relocation mode.
	uint64_t NewTextSectionStartAddress = NextAvailableAddress;			uint64_t NewTextSectionStartAddress = NextAvailableAddress;

	for (auto &BFI : BC->getBinaryFunctions()) {			for (auto &BFI : BC->getBinaryFunctions()) {
	BinaryFunction &Function = BFI.second;			BinaryFunction &Function = BFI.second;
				rafaulerUnsubmitted Not Done Reply Inline Actions Another thing is that we should right pad here just for hot section, not for every code section rafauler: Another thing is that we should right pad here just for hot section, not for every code section
	if (!Function.isEmitted())			if (!Function.isEmitted())
	continue;			continue;

	bool TooLarge = false;			bool TooLarge = false;
	ErrorOr<BinarySection &> FuncSection = Function.getCodeSection();			ErrorOr<BinarySection &> FuncSection = Function.getCodeSection();
	assert(FuncSection && "cannot find section for function");			assert(FuncSection && "cannot find section for function");
	FuncSection->setOutputAddress(Function.getAddress());			FuncSection->setOutputAddress(Function.getAddress());
	LLVM_DEBUG(dbgs() << "BOLT: mapping 0x"			LLVM_DEBUG(dbgs() << "BOLT: mapping 0x"
	Show All 28 Lines

bolt/lib/RuntimeLibs/HugifyRuntimeLibrary.cpp

Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	void HugifyRuntimeLibrary::adjustCommandLineOptions(
opts::HotText = true;		opts::HotText = true;
if (!BC.StartFunctionAddress) {		if (!BC.StartFunctionAddress) {
errs() << "BOLT-ERROR: hugify runtime libraries require a known entry "		errs() << "BOLT-ERROR: hugify runtime libraries require a known entry "
"point of "		"point of "
"the input binary\n";		"the input binary\n";
exit(1);		exit(1);
}		}
}		}

void HugifyRuntimeLibrary::emitBinary(BinaryContext &BC, MCStreamer &Streamer) {
const BinaryFunction *StartFunction =
BC.getBinaryFunctionAtAddress(*(BC.StartFunctionAddress));
assert(!StartFunction->isFragment() && "expected main function fragment");
if (!StartFunction) {
errs() << "BOLT-ERROR: failed to locate function at binary start address\n";
exit(1);
}

const auto Flags = BinarySection::getFlags(/IsReadOnly=/false,
/IsText=/false,
/IsAllocatable=/true);
MCSectionELF *Section =
BC.Ctx->getELFSection(".bolt.hugify.entries", ELF::SHT_PROGBITS, Flags);

// __bolt_hugify_init_ptr stores the poiter the hugify library needs to
// jump to after finishing the init code.
MCSymbol *InitPtr = BC.Ctx->getOrCreateSymbol("__bolt_hugify_init_ptr");

Section->setAlignment(llvm::Align(BC.RegularPageSize));
Streamer.switchSection(Section);

Streamer.emitLabel(InitPtr);
Streamer.emitSymbolAttribute(InitPtr, MCSymbolAttr::MCSA_Global);
Streamer.emitValue(
MCSymbolRefExpr::create(StartFunction->getSymbol(), *(BC.Ctx)),
/Size=/8);
}

void HugifyRuntimeLibrary::link(BinaryContext &BC, StringRef ToolPath,		void HugifyRuntimeLibrary::link(BinaryContext &BC, StringRef ToolPath,
		rafaulerUnsubmitted Done Reply Inline Actions I think we can delete all of this, right? emitBinary() for HugyfiRuntimeLibrary doesn't need to do anything if you are already creating a new function in a new pass that will be called from the runtime. And then go to HugifyRuntimeLibrary.h and delete SecNames.push_back(".bolt.hugify.entries"); rafauler: I think we can delete all of this, right? emitBinary() for HugyfiRuntimeLibrary doesn't need to…
		yavtukAuthorUnsubmitted Done Reply Inline Actions yes, you are right, thanks yavtuk: yes, you are right, thanks
RuntimeDyld &RTDyld,		RuntimeDyld &RTDyld,
std::function<void(RuntimeDyld &)> OnLoad) {		std::function<void(RuntimeDyld &)> OnLoad) {
std::string LibPath = getLibPath(ToolPath, opts::RuntimeHugifyLib);		std::string LibPath = getLibPath(ToolPath, opts::RuntimeHugifyLib);
loadLibrary(LibPath, RTDyld);		loadLibrary(LibPath, RTDyld);
OnLoad(RTDyld);		OnLoad(RTDyld);
RTDyld.finalizeWithMemoryManagerLocking();		RTDyld.finalizeWithMemoryManagerLocking();
if (RTDyld.hasError()) {		if (RTDyld.hasError()) {
outs() << "BOLT-ERROR: RTDyld failed: " << RTDyld.getErrorString() << "\n";		outs() << "BOLT-ERROR: RTDyld failed: " << RTDyld.getErrorString() << "\n";
Show All 13 Lines

bolt/runtime/CMakeLists.txt

Show All 18 Lines	add_library(bolt_rt_hugify STATIC
${CMAKE_CURRENT_BINARY_DIR}/config.h		${CMAKE_CURRENT_BINARY_DIR}/config.h
)		)

set(BOLT_RT_FLAGS		set(BOLT_RT_FLAGS
-ffreestanding		-ffreestanding
-fno-exceptions		-fno-exceptions
-fno-rtti		-fno-rtti
-fno-stack-protector		-fno-stack-protector
-mno-sse)		-mno-sse
		-fPIE)
		yota9Unsubmitted Not Done Reply Inline Actions As for hugify it is OK to use fPIE the rest of the libs must use fPIC flag. I suggest to use fPIC here, there should be no real difference yota9: As for hugify it is OK to use fPIE the rest of the libs must use fPIC flag. I suggest to use…

# Don't let the compiler think it can create calls to standard libs		# Don't let the compiler think it can create calls to standard libs
target_compile_options(bolt_rt_instr PRIVATE ${BOLT_RT_FLAGS} -fPIE)		target_compile_options(bolt_rt_instr PRIVATE ${BOLT_RT_FLAGS})
target_include_directories(bolt_rt_instr PRIVATE ${CMAKE_CURRENT_BINARY_DIR})		target_include_directories(bolt_rt_instr PRIVATE ${CMAKE_CURRENT_BINARY_DIR})
target_compile_options(bolt_rt_hugify PRIVATE ${BOLT_RT_FLAGS})		target_compile_options(bolt_rt_hugify PRIVATE ${BOLT_RT_FLAGS})
target_include_directories(bolt_rt_hugify PRIVATE ${CMAKE_CURRENT_BINARY_DIR})		target_include_directories(bolt_rt_hugify PRIVATE ${CMAKE_CURRENT_BINARY_DIR})

install(TARGETS bolt_rt_instr DESTINATION lib)		install(TARGETS bolt_rt_instr DESTINATION lib)
install(TARGETS bolt_rt_hugify DESTINATION lib)		install(TARGETS bolt_rt_hugify DESTINATION lib)

if (CMAKE_CXX_COMPILER_ID MATCHES ".Clang.")		if (CMAKE_CXX_COMPILER_ID MATCHES ".Clang.")
Show All 10 Lines

bolt/runtime/common.h

Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	#define RESTORE_ALL \
"pop %%rdx\n" \		"pop %%rdx\n" \
"pop %%rcx\n" \		"pop %%rcx\n" \
"pop %%rbx\n" \		"pop %%rbx\n" \
"pop %%rax\n"		"pop %%rax\n"

// Functions that are required by freestanding environment. Compiler may		// Functions that are required by freestanding environment. Compiler may
// generate calls to these implicitly.		// generate calls to these implicitly.
extern "C" {		extern "C" {
void memcpy(void Dest, const void *Src, size_t Len) {		void memcpy(void Dest, const void *Src, size_t Len) {
		rafaulerUnsubmitted Not Done Reply Inline Actions Why is that needed? rafauler: Why is that needed?
		yavtukAuthorUnsubmitted Done Reply Inline Actions good question :-) the user-func-reoder test fails and it was hard to reproduce the cause locally since it's related to compiler with this attribute we have the following assembly for memcpy: .Loop: ... movzbl (%rsi,%rdi,1),%ecx mov %cl,(%rax,%rdi,1) add $0x1,%rdi cmp %rdi,%r9 jne a004a0 <_fini+0x2c4> ... mov %r14,%rdi mov %r15,%rsi mov %rbx,%rdx callq .Loop copying is performed by byte with verification without this attribute I see the following: .Loop: ... movzbl 0x0(%r13,%rax,1),%edx mov %dl,(%rbx,%rax,1) movzbl 0x1(%r13,%rax,1),%edx mov %dl,0x1(%rbx,%rax,1) movzbl 0x2(%r13,%rax,1),%edx mov %dl,0x2(%rbx,%rax,1) movzbl 0x3(%r13,%rax,1),%edx mov %dl,0x3(%rbx,%rax,1) movzbl 0x4(%r13,%rax,1),%edx mov %dl,0x4(%rbx,%rax,1) movzbl 0x5(%r13,%rax,1),%edx mov %dl,0x5(%rbx,%rax,1) movzbl 0x6(%r13,%rax,1),%edx mov %dl,0x6(%rbx,%rax,1) movzbl 0x7(%r13,%rax,1),%edx mov %dl,0x7(%rbx,%rax,1) add $0x8,%rax cmp %rax,%rcx jne a007f0 <_fini+0x614> copying is performed with unrolling and test fails due to overlapping dst and src addresses for size which is not aligned to 8 bytes yavtuk: good question :-) the user-func-reoder test fails and it was hard to reproduce the cause…
		rafaulerUnsubmitted Not Done Reply Inline Actions Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of copying the entire contents of the huge page, it is copying only hot functions. I go back to my original point, why do we need to avoid overlapping segments? If you roll back to previous behavior (copying all contents of the page, including cold text segments), you won't need to insert one extra alignment at the end of each code section. rafauler: Ok, I debugged user-func-reoder and noticed that this patch is doing something else. Instead of…
uint8_t d = static_cast<uint8_t >(Dest);		uint8_t d = static_cast<uint8_t >(Dest);
const uint8_t s = static_cast<const uint8_t >(Src);		const uint8_t s = static_cast<const uint8_t >(Src);
while (Len--)		while (Len-- && d != s)
d++ = s++;		d++ = s++;
return Dest;		return Dest;
}		}

void memmove(void Dest, const void *Src, size_t Len) {		void memmove(void Dest, const void *Src, size_t Len) {
uint8_t d = static_cast<uint8_t >(Dest);		uint8_t d = static_cast<uint8_t >(Dest);
const uint8_t s = static_cast<const uint8_t >(Src);		const uint8_t s = static_cast<const uint8_t >(Src);
if (d < s) {		if (d < s) {
while (Len--)		while (Len--)
d++ = s++;		d++ = s++;
} else {		} else {
s += Len - 1;		s += Len - 1;
d += Len - 1;		d += Len - 1;
while (Len--)		while (Len--)
d-- = s--;		d-- = s--;
}		}

return Dest;		return Dest;
}		}

void memset(void Buf, int C, size_t Size) {		void memset(void Buf, int C, size_t Size) {
char S = (char )Buf;		char S = (char )Buf;
for (size_t I = 0; I < Size; ++I)		while (Size--)
*S++ = C;		*S++ = C;
return Buf;		return Buf;
}		}

int memcmp(const void s1, const void s2, size_t n) {		int memcmp(const void s1, const void s2, size_t n) {
const uint8_t c1 = static_cast<const uint8_t >(s1);		const uint8_t c1 = static_cast<const uint8_t >(s1);
const uint8_t c2 = static_cast<const uint8_t >(s2);		const uint8_t c2 = static_cast<const uint8_t >(s2);
for (; n--; c1++, c2++) {		for (; n--; c1++, c2++) {
▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines

uint32_t strLen(const char *Str) {		uint32_t strLen(const char *Str) {
uint32_t Size = 0;		uint32_t Size = 0;
while (*Str++)		while (*Str++)
++Size;		++Size;
return Size;		return Size;
}		}

		void strStr(const char const Haystack, const char *const Needle) {
		int j = 0;

		for (int i = 0; i < strLen(Haystack); i++) {
		if (Haystack[i] == Needle[0]) {
		for (j = 1; j < strLen(Needle); j++) {
		if (Haystack[i + j] != Needle[j])
		break;
		}
		if (j == strLen(Needle))
		return (void *)&Haystack[i];
		}
		}
		return nullptr;
		}

void reportNumber(const char *Msg, uint64_t Num, uint32_t Base) {		void reportNumber(const char *Msg, uint64_t Num, uint32_t Base) {
char Buf[BufSize];		char Buf[BufSize];
char *Ptr = Buf;		char *Ptr = Buf;
Ptr = strCopy(Ptr, Msg, BufSize - 23);		Ptr = strCopy(Ptr, Msg, BufSize - 23);
Ptr = intToStr(Ptr, Num, Base);		Ptr = intToStr(Ptr, Num, Base);
Ptr = strCopy(Ptr, "\n");		Ptr = strCopy(Ptr, "\n");
__write(2, Buf, Ptr - Buf);		__write(2, Buf, Ptr - Buf);
}		}
Show All 11 Lines	while (*Str != Terminator) {
else if ('A' <= Str && Str <= 'F')		else if ('A' <= Str && Str <= 'F')
Res += *Str++ - 'A' + 10;		Res += *Str++ - 'A' + 10;
else		else
return 0;		return 0;
}		}
return Res;		return Res;
}		}

		/// Starting from character at \p buf, find the longest consecutive sequence
		/// of digits (0-9) and convert it to uint32_t. The converted value
		/// is put into \p ret. \p end marks the end of the buffer to avoid buffer
		/// overflow. The function \returns whether a valid uint32_t value is found.
		/// \p buf will be updated to the next character right after the digits.
		static bool scanUInt32(const char &Buf, const char End, uint32_t &Ret) {
		uint64_t Result = 0;
		const char *OldBuf = Buf;
		while (Buf < End && ((Buf) >= '0' && (Buf) <= '9')) {
		Result = Result * 10 + (*Buf) - '0';
		++Buf;
		}
		if (OldBuf != Buf && Result <= 0xFFFFFFFFu) {
		Ret = static_cast<uint32_t>(Result);
		return true;
		}
		return false;
		}

#if !defined(__APPLE__)		#if !defined(__APPLE__)
// We use a stack-allocated buffer for string manipulation in many pieces of		// We use a stack-allocated buffer for string manipulation in many pieces of
// this code, including the code that prints each line of the fdata file. This		// this code, including the code that prints each line of the fdata file. This
// buffer needs to accomodate large function names, but shouldn't be arbitrarily		// buffer needs to accomodate large function names, but shouldn't be arbitrarily
// large (dynamically allocated) for simplicity of our memory space usage.		// large (dynamically allocated) for simplicity of our memory space usage.

// Declare some syscall wrappers we use throughout this code to avoid linking		// Declare some syscall wrappers we use throughout this code to avoid linking
// against system libc.		// against system libc.
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	int __madvise(void *addr, size_t length, int advice) {
__asm__ __volatile__("movq $28, %%rax\n"		__asm__ __volatile__("movq $28, %%rax\n"
"syscall\n"		"syscall\n"
: "=a"(ret)		: "=a"(ret)
: "D"(addr), "S"(length), "d"(advice)		: "D"(addr), "S"(length), "d"(advice)
: "cc", "rcx", "r11", "memory");		: "cc", "rcx", "r11", "memory");
return ret;		return ret;
}		}

		#define _UTSNAME_LENGTH 65

		struct UtsNameTy {
		char sysname[_UTSNAME_LENGTH]; /* Operating system name (e.g., "Linux") */
		char nodename[_UTSNAME_LENGTH]; /* Name within "some implementation-defined
		network" */
		char release[_UTSNAME_LENGTH]; /* Operating system release (e.g., "2.6.28") */
		char version[_UTSNAME_LENGTH]; /* Operating system version */
		char machine[_UTSNAME_LENGTH]; /* Hardware identifier */
		char domainname[_UTSNAME_LENGTH]; /* NIS or YP domain name */
		};

		int __uname(struct UtsNameTy *Buf) {
		rafaulerUnsubmitted Done Reply Inline Actions capitalize Buf to follow LLVM coding style rafauler: capitalize Buf to follow LLVM coding style
		int Ret;
		__asm__ __volatile__("movq $63, %%rax\n"
		"syscall\n"
		: "=a"(Ret)
		: "D"(Buf)
		: "cc", "rcx", "r11", "memory");
		return Ret;
		}

struct timespec {		struct timespec {
uint64_t tv_sec; /* seconds */		uint64_t tv_sec; /* seconds */
uint64_t tv_nsec; /* nanoseconds */		uint64_t tv_nsec; /* nanoseconds */
};		};

uint64_t __nanosleep(const timespec req, timespec rem) {		uint64_t __nanosleep(const timespec req, timespec rem) {
uint64_t ret;		uint64_t ret;
__asm__ __volatile__("movq $35, %%rax\n"		__asm__ __volatile__("movq $35, %%rax\n"
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	int __fsync(int fd) {
__asm__ __volatile__("movq $74, %%rax\n"		__asm__ __volatile__("movq $74, %%rax\n"
"syscall\n"		"syscall\n"
: "=a"(ret)		: "=a"(ret)
: "D"(fd)		: "D"(fd)
: "cc", "rcx", "r11", "memory");		: "cc", "rcx", "r11", "memory");
return ret;		return ret;
}		}

		// %rdi %rsi %rdx %r10 %r8
		// sys_prctl int option unsigned unsigned unsigned unsigned
		// long arg2 long arg3 long arg4 long arg5
		int __prctl(int Option, unsigned long Arg2, unsigned long Arg3,
		unsigned long Arg4, unsigned long Arg5) {
		rafaulerUnsubmitted Done Reply Inline Actions capitalize Option, Arg2, etc. to follow LLVM coding style rafauler: capitalize Option, Arg2, etc. to follow LLVM coding style
		int Ret;
		register long rdx asm("rdx") = Arg3;
		register long r8 asm("r8") = Arg5;
		register long r10 asm("r10") = Arg4;
		__asm__ __volatile__("movq $157, %%rax\n"
		"syscall\n"
		: "=a"(Ret)
		: "D"(Option), "S"(Arg2), "d"(rdx), "r"(r10), "r"(r8)
		:);
		return Ret;
		}

#endif		#endif

void reportError(const char *Msg, uint64_t Size) {		void reportError(const char *Msg, uint64_t Size) {
__write(2, Msg, Size);		__write(2, Msg, Size);
__exit(1);		__exit(1);
}		}

void assert(bool Assertion, const char *Msg) {		void assert(bool Assertion, const char *Msg) {
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

bolt/runtime/hugify.cpp

	//===- bolt/runtime/hugify.cpp --------------------------------------------===//			//===- bolt/runtime/hugify.cpp -------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
				rafaulerUnsubmitted Done Reply Inline Actions This is the wrong license. LLVM license has updated to Apache v2.0 with LLVM Exceptions (the one that was previously in the header). Could you revert back this change? rafauler: This is the wrong license. LLVM license has updated to Apache v2.0 with LLVM Exceptions (the…
	//===----------------------------------------------------------------------===//			//===---------------------------------------------------------------------===//

	#if defined (__x86_64__)			#if defined (__x86_64__)
	#if !defined(__APPLE__)			#if !defined(__APPLE__)

	#include "common.h"			#include "common.h"
	#include <sys/mman.h>

	// Enables a very verbose logging to stderr useful when debugging			// Enables a very verbose logging to stderr useful when debugging
	//#define ENABLE_DEBUG			// #define ENABLE_DEBUG

	// Function pointers to init routines in the binary, so we can resume			// Function constains trampoline to _start,
	// regular execution of the function that we hooked.			// so we can resume regular execution of the function that we hooked.
	extern void (*__bolt_hugify_init_ptr)();			extern void __bolt_hugify_start_program();
				rafaulerUnsubmitted Done Reply Inline Actions Can you remove "_ptr" from the name, since it is not a function pointer anymore? I suggest naming "__bolt_hugify_start_program" (but it's just a suggestion) rafauler: Can you remove "_ptr" from the name, since it is not a function pointer anymore? I suggest…

	// The __hot_start and __hot_end symbols set by Bolt. We use them to figure			// The __hot_start and __hot_end symbols set by Bolt. We use them to figure
	// out the rage for marking huge pages.			// out the rage for marking huge pages.
	extern uint64_t __hot_start;			extern uint64_t __hot_start;
	extern uint64_t __hot_end;			extern uint64_t __hot_end;

	#ifdef MADV_HUGEPAGE			static void getKernelVersion(uint32_t *Val) {
				// release should be in the format: %d.%d.%d
				// major, minor, release
				struct UtsNameTy UtsName;
				int Ret = __uname(&UtsName);
				const char *Buf = UtsName.release;
				const char *End = Buf + strLen(Buf);
				const char Delims[2][2] = {".", "."};

				for (int i = 0; i < 3; ++i) {
				if (!scanUInt32(Buf, End, Val[i])) {
				return;
				}
				if (i < sizeof(Delims) / sizeof(Delims[0])) {
				const char *Ptr = Delims[i];
				while (*Ptr != '\0') {
				if (Ptr != Buf) {
				return;
				}
				++Ptr;
				rafaulerUnsubmitted Done Reply Inline Actions getKernelVersion(Val) to follow LLVM coding style (I realize this file doesn't follow LLVM style, but it is small enough that we can just update it) rafauler: getKernelVersion(Val) to follow LLVM coding style (I realize this file doesn't follow LLVM…
				++Buf;
				}
				}
				}
				}

	/// Check whether the kernel supports THP via corresponding sysfs entry.			/// Check whether the kernel supports THP via corresponding sysfs entry.
				rafaulerUnsubmitted Done Reply Inline Actions Move to common.h, close to "hexToLong" capitalize variables to follow llvm style (see hexToLong example). rafauler: Move to common.h, close to "hexToLong" capitalize variables to follow llvm style (see…
	static bool has_pagecache_thp_support() {			/// thp works only starting from 5.10
				rafaulerUnsubmitted Done Reply Inline Actions hasPagecacheTHPSupport rafauler: hasPagecacheTHPSupport
	char buf[256] = {0};			static bool hasPagecacheTHPSupport() {
	const char *madviseStr = "always [madvise] never";			char Buf[64];
				const uint64_t MadviseOptions = 2;
				const char *const MadviseOpt[MadviseOptions] = {"[always]", "[madvise]"};
				rafaulerUnsubmitted Done Reply Inline Actions Buf rafauler: Buf

	int fd = __open("/sys/kernel/mm/transparent_hugepage/enabled",			int FD = __open("/sys/kernel/mm/transparent_hugepage/enabled",
				rafaulerUnsubmitted Done Reply Inline Actions FD rafauler: FD
	0 /* O_RDONLY */, 0);			0 /* O_RDONLY */, 0);
	if (fd < 0)			if (FD < 0)
	return false;			return false;

	size_t res = __read(fd, buf, 256);			memset(Buf, 0, sizeof(Buf));
	if (res < 0)			const size_t Res = __read(FD, Buf, sizeof(Buf));
				if (Res < 0)
	return false;			return false;

	int cmp = strnCmp(buf, madviseStr, strLen(madviseStr));			struct KernelVersionTy {
	return cmp == 0;			uint32_t major;
				uint32_t minor;
				uint32_t release;
				};

				rafaulerUnsubmitted Done Reply Inline Actions struct KernelVersionTy { uint32_t major; uint32_t minor; uint32_t release; }; KernelVersionTy KernelVersion; rafauler: struct KernelVersionTy { uint32_t major; uint32_t minor; uint32_t release; }…
				KernelVersionTy KernelVersion;
				rafaulerUnsubmitted Not Done Reply Inline Actions here we can pass AlignedFrom and Alignedto only rafauler: here we can pass AlignedFrom and Alignedto only
				rafaulerUnsubmitted Not Done Reply Inline Actions Remove From rafauler: Remove From

				getKernelVersion((uint32_t *)&KernelVersion);
				rafaulerUnsubmitted Not Done Reply Inline Actions Remove line, use AlignedSize only rafauler: Remove line, use AlignedSize only

				rafaulerUnsubmitted Not Done Reply Inline Actions Here use AlignedSize rafauler: Here use AlignedSize
				for (unsigned int i = 0; i < MadviseOptions; i++) {
				if (strStr(Buf, MadviseOpt[i]) && KernelVersion.major >= 5 &&
				KernelVersion.minor >= 10) {
				return true;
				}
				}
				return false;
	}			}
				rafaulerUnsubmitted Done Reply Inline Actions hugifyForOldKernel(From, To .. rafauler: hugifyForOldKernel(From, To ..

	static void hugify_for_old_kernel(uint8_t from, uint8_t to) {			static void hugifyForOldKernel(uint8_t From, uint8_t To,
	size_t size = to - from;			uint8_t *FromAlignedPage,
				uint8_t *ToAlignedPage) {
	uint8_t mem = reinterpret_cast<uint8_t >(			const size_t HugePageBytes = 2L * 1024 * 1024;
	__mmap(0, size, 0x3 /* PROT_READ \| PROT_WRITE*/,			const size_t Size = To - From;
				const size_t SizeHugePageAligned = Size + (HugePageBytes - ((intptr_t)Size & (HugePageBytes - 1)));
				uint8_t Mem = reinterpret_cast<uint8_t >(
				__mmap(0, SizeHugePageAligned, 0x3 /* PROT_READ \| PROT_WRITE */,
	0x22 /* MAP_PRIVATE \| MAP_ANONYMOUS*/, -1, 0));			0x22 /* MAP_PRIVATE \| MAP_ANONYMOUS */, -1, 0));

	if (mem == (void *)MAP_FAILED) {			if (Mem == ((void )-1) / MAP_FAILED */) {
	char msg[] = "Could not allocate memory for text move\n";			char Msg[] = "[hugify] could not allocate memory for text move\n";
	reportError(msg, sizeof(msg));			reportError(Msg, sizeof(Msg));
	}			}

				rafaulerUnsubmitted Not Done Reply Inline Actions I think we can just use the old behavior here (work with two pointers From, To - they're both aligned to the page, we don't ever need to know the original hot sizes in this function. This makes this function more confusing to read. rafauler: I think we can just use the old behavior here (work with two pointers From, To - they're both…
	#ifdef ENABLE_DEBUG			#ifdef ENABLE_DEBUG
	reportNumber("Allocated temporary space: ", (uint64_t)mem, 16);			reportNumber("[hugify] allocated temporary address: ", (uint64_t)Mem, 16);
				reportNumber("[hugify] allocated aligned size: ", (uint64_t)SizeHugePageAligned, 16);
				reportNumber("[hugify] allocated size: ", (uint64_t)Size, 16);
	#endif			#endif

	// Copy the hot code to a temproary location.			// Copy the hot code to a temporary location.
	memcpy(mem, from, size);			memcpy(Mem, From, Size);
				rafaulerUnsubmitted Not Done Reply Inline Actions Mistake is probably here, copying less bytes than are actually needed, program will crash. rafauler: Mistake is probably here, copying less bytes than are actually needed, program will crash.
				rafaulerUnsubmitted Not Done Reply Inline Actions Here use AlignedFrom, AlignedSize rafauler: Here use AlignedFrom, AlignedSize

				__prctl(41 /* PR_SET_THP_DISABLE */, 0, 0, 0, 0);
	// Maps out the existing hot code.			// Maps out the existing hot code.
	if (__mmap(reinterpret_cast<uint64_t>(from), size,			if (__mmap(reinterpret_cast<uint64_t>(FromAlignedPage),
	PROT_READ \| PROT_WRITE \| PROT_EXEC,			ToAlignedPage - FromAlignedPage, 0x3 /* PROT_READ \| PROT_WRITE */,
	MAP_PRIVATE \| MAP_ANONYMOUS \| MAP_FIXED, -1,			0x32 /* MAP_FIXED \| MAP_ANONYMOUS \| MAP_PRIVATE */, -1,
	0) == (void *)MAP_FAILED) {			0) == ((void )-1) /MAP_FAILED*/) {
	char msg[] = "failed to mmap memory for large page move terminating\n";			char Msg[] =
	reportError(msg, sizeof(msg));			"[hugify] failed to mmap memory for large page move terminating\n";
				reportError(Msg, sizeof(Msg));
				rafaulerUnsubmitted Not Done Reply Inline Actions ...because here you are using Aligned Size rafauler: ...because here you are using Aligned Size
	}			}

	// Mark the hot code page to be huge page.			// Mark the hot code page to be huge page.
	if (__madvise(from, size, MADV_HUGEPAGE) == -1) {			if (__madvise(FromAlignedPage, ToAlignedPage - FromAlignedPage,
	char msg[] = "failed to allocate large page\n";			14 /* MADV_HUGEPAGE */) == -1) {
	reportError(msg, sizeof(msg));			char Msg[] = "[hugify] failed to allocate large page\n";
				reportError(Msg, sizeof(Msg));
	}			}

	// Copy the hot code back.			// Copy the hot code back.
	memcpy(from, mem, size);			memcpy(From, Mem, SizeHugePageAligned);

	// Change permission back to read-only, ignore failure			// Change permission back to read-only, ignore failure
	__mprotect(from, size, PROT_READ \| PROT_EXEC);			__mprotect(FromAlignedPage, ToAlignedPage - FromAlignedPage,
				0x5 /* PROT_READ \| PROT_EXEC */);

	__munmap(mem, size);			__munmap(Mem, SizeHugePageAligned);
	}			}
	#endif			#endif

	extern "C" void __bolt_hugify_self_impl() {			extern "C" void __bolt_hugify_self_impl() {
	#ifdef MADV_HUGEPAGE			uint8_t HotStart = (uint8_t )&__hot_start;
	uint8_t hotStart = (uint8_t )&__hot_start;			uint8_t HotEnd = (uint8_t )&__hot_end;
	uint8_t hotEnd = (uint8_t )&__hot_end;
	// Make sure the start and end are aligned with huge page address			// Make sure the start and end are aligned with huge page address
	const size_t hugePageBytes = 2L * 1024 * 1024;			const size_t HugePageBytes = 2L * 1024 * 1024;
				rafaulerUnsubmitted Not Done Reply Inline Actions I'll suggest doing something else in line 97. rafauler: I'll suggest doing something else in line 97.
	uint8_t *from = hotStart - ((intptr_t)hotStart & (hugePageBytes - 1));			uint8_t *From = HotStart - ((intptr_t)HotStart & (HugePageBytes - 1));
	uint8_t *to = hotEnd + (hugePageBytes - 1);			uint8_t *To = HotEnd + (HugePageBytes - 1);
	to -= (intptr_t)to & (hugePageBytes - 1);			To -= (intptr_t)To & (HugePageBytes - 1);

	#ifdef ENABLE_DEBUG			#ifdef ENABLE_DEBUG
	reportNumber("[hugify] hot start: ", (uint64_t)hotStart, 16);			reportNumber("[hugify] hot start: ", (uint64_t)HotStart, 16);
	reportNumber("[hugify] hot end: ", (uint64_t)hotEnd, 16);			reportNumber("[hugify] hot end: ", (uint64_t)HotEnd, 16);
	reportNumber("[hugify] aligned huge page from: ", (uint64_t)from, 16);			reportNumber("[hugify] aligned huge page from: ", (uint64_t)From, 16);
	reportNumber("[hugify] aligned huge page to: ", (uint64_t)to, 16);			reportNumber("[hugify] aligned huge page to: ", (uint64_t)To, 16);
	#endif			#endif

	if (!has_pagecache_thp_support()) {			if (!hasPagecacheTHPSupport()) {
	hugify_for_old_kernel(from, to);			#ifdef ENABLE_DEBUG
				report("[hugify] workaround with memory alignment for kernel < 5.10\n");
				#endif
				hugifyForOldKernel(HotStart, HotEnd, From, To);
	return;			return;
	}			}

	if (__madvise(from, (to - from), MADV_HUGEPAGE) == -1) {			if (__madvise(From, (To - From), 14 /* MADV_HUGEPAGE */) == -1) {
	char msg[] = "failed to allocate large page\n";			char Msg[] = "[hugify] failed to allocate large page\n";
	// TODO: allow user to control the failure behavior.			// TODO: allow user to control the failure behavior.
	reportError(msg, sizeof(msg));			reportError(Msg, sizeof(Msg));
	}			}
	#endif
	}			}
				rafaulerUnsubmitted Not Done Reply Inline Actions Here pass only From,To rafauler: Here pass only From,To

	/// This is hooking ELF's entry, it needs to save all machine state.			/// This is hooking ELF's entry, it needs to save all machine state.
	extern "C" __attribute((naked)) void __bolt_hugify_self() {			extern "C" __attribute((naked)) void __bolt_hugify_self() {
				#if defined(__x86_64__)
	__asm__ __volatile__(SAVE_ALL			__asm__ __volatile__(SAVE_ALL
	"call __bolt_hugify_self_impl\n"			"call __bolt_hugify_self_impl\n"
	RESTORE_ALL			RESTORE_ALL
	"jmp *__bolt_hugify_init_ptr(%%rip)\n"			"jmp __bolt_hugify_start_program\n"
	:::);			:::);
	}			#else
				exit(1);
	#endif			#endif
				}
	#endif			#endif