This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] Don't align .text to pageAlign
Needs ReviewPublic

Authored by treapster on Nov 8 2022, 1:27 AM.

Details

Summary

Setting alignment of .text to PageAlign results in a wasted space of 2mb(or 4kb with -no-huge-pages) after program header table.
I think it doesn't make sense to align text to such a huge values, and aligning the segment itself is enough.

Diff Detail

Event Timeline

treapster created this revision.Nov 8 2022, 1:27 AM
Herald added a reviewer: Amir. · View Herald Transcript
Herald added a reviewer: maksfb. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: ayermolo. · View Herald Transcript
treapster requested review of this revision.Nov 8 2022, 1:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2022, 1:27 AM
yota9 added a comment.Nov 8 2022, 1:32 AM

The virtual address % page alignment must be equal to file offset % page alignment, does this ratio remains true without this code?

The virtual address % page alignment must be equal to file offset % page alignment, does this ratio remains true without this code?

Of course it remains the same, the only difference is that .text comes right after program header.

yota9 added a comment.Nov 8 2022, 2:38 AM

The new text segment has newly allocated virtual address, offset and probably changed page alignment, I don't see how it remains the same in case the VA become aligned and the offset is don't.

treapster added a comment.EditedNov 8 2022, 2:44 AM

Before patch

PHDR           0x200000 0x0000000000200000 0x0000000000200000 0x0002a0 0x0002a0 R   0x8
LOAD           0x200000 0x0000000000200000 0x0000000000200000 0x200404 0x200404 R E 0x200000
...
[26] .text             PROGBITS        0000000000400000 400000 0001b1 00  AX  0   0 2097152

After

PHDR           0x200000 0x0000000000200000 0x0000000000200000 0x0002a0 0x0002a0 R   0x8
LOAD           0x200000 0x0000000000200000 0x0000000000200000 0x000744 0x000744 R E 0x200000
...
[26] .text             PROGBITS        0000000000200340 200340 0001b1 00  AX  0   0 64

Offset is aligned with address, we just remove empty space both in file and in memory

Btw i can't reproduce hugify failure locally

yota9 added a comment.Nov 8 2022, 2:51 AM

In both cases we are loading from the same file offset the same amount of memory. Could you please give full sections headers before and after so I could compare them?

treapster added a comment.EditedNov 8 2022, 2:57 AM

In both cases we are loading from the same file offset the same amount of memory. Could you please give full sections headers before and after so I could compare them?

Oops, LOAD segment in "After" was wrong. Edited.

Full readelf before:

There are 34 section headers, starting at offset 0x400f40:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        00000000000002a8 0002a8 00001c 00   A  0   0  1
  [ 2] .note.ABI-tag     NOTE            00000000000002c4 0002c4 000020 00   A  0   0  4
  [ 3] .gnu.hash         GNU_HASH        00000000000002e8 0002e8 000024 00   A  4   0  8
  [ 4] .dynsym           DYNSYM          0000000000000310 000310 0000a8 18   A  5   1  8
  [ 5] .dynstr           STRTAB          00000000000003b8 0003b8 000084 00   A  0   0  1
  [ 6] .gnu.version      VERSYM          000000000000043c 00043c 00000e 02   A  4   0  2
  [ 7] .gnu.version_r    VERNEED         0000000000000450 000450 000020 00   A  5   1  8
  [ 8] .rela.dyn         RELA            0000000000000470 000470 0000c0 18   A  4   0  8
  [ 9] .rela.plt         RELA            0000000000000530 000530 000018 18  AI  4  22  8
  [10] .init             PROGBITS        0000000000001000 001000 00001b 00  AX  0   0  4
  [11] .plt              PROGBITS        0000000000001020 001020 000020 10  AX  0   0 16
  [12] .plt.got          PROGBITS        0000000000001040 001040 000008 08  AX  0   0  8
  [13] .bolt.org.text    PROGBITS        0000000000001050 001050 0001a5 00  AX  0   0 16
  [14] .fini             PROGBITS        00000000000011f8 0011f8 00000d 00  AX  0   0  4
  [15] .rodata           PROGBITS        0000000000002000 002000 000012 00   A  0   0  4
  [16] .bolt.org.eh_frame_hdr PROGBITS        0000000000002014 002014 000044 00   A  0   0  4
  [17] .bolt.org.eh_frame PROGBITS        0000000000002058 002058 000110 00   A  0   0  8
  [18] .init_array       INIT_ARRAY      0000000000003de8 002de8 000008 08  WA  0   0  8
  [19] .fini_array       FINI_ARRAY      0000000000003df0 002df0 000008 08  WA  0   0  8
  [20] .dynamic          DYNAMIC         0000000000003df8 002df8 0001e0 10  WA  5   0  8
  [21] .got              PROGBITS        0000000000003fd8 002fd8 000028 08  WA  0   0  8
  [22] .got.plt          PROGBITS        0000000000004000 003000 000020 08  WA  0   0  8
  [23] .data             PROGBITS        0000000000004020 003020 000010 00  WA  0   0  8
  [24] .tm_clone_table   PROGBITS        0000000000004030 003030 000000 00  WA  0   0  8
  [25] .bss              NOBITS          0000000000004030 003030 000008 00  WA  0   0  1
  [26] .text             PROGBITS        0000000000400000 400000 0001b1 00  AX  0   0 2097152
  [27] .eh_frame         PROGBITS        00000000004001b8 4001b8 0001e0 00   A  0   0  8
  [28] .eh_frame_hdr     PROGBITS        0000000000400398 400398 00006c 00   A  0   0  1
  [29] .comment          PROGBITS        0000000000000000 400404 0000a2 01  MS  0   0  1
  [30] .symtab           SYMTAB          0000000000000000 4004a8 000648 18     31  47  8
  [31] .strtab           STRTAB          0000000000000000 400af0 000220 00      0   0  1
  [32] .shstrtab         STRTAB          0000000000000000 400d10 00015d 00      0   0  1
  [33] .note.bolt_info   NOTE            0000000000000000 400e6d 0000a8 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

Elf file type is DYN (Shared object file)
Entry point 0x40001c
There are 12 program headers, starting at offset 2097152

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x200000 0x0000000000200000 0x0000000000200000 0x0002a0 0x0002a0 R   0x8
  INTERP         0x0002a8 0x00000000000002a8 0x00000000000002a8 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000548 0x000548 R   0x1000
  LOAD           0x001000 0x0000000000001000 0x0000000000001000 0x000205 0x000205 R E 0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x000168 0x000168 R   0x1000
  LOAD           0x002de8 0x0000000000003de8 0x0000000000003de8 0x000248 0x000250 RW  0x1000
  LOAD           0x200000 0x0000000000200000 0x0000000000200000 0x200404 0x200404 R E 0x200000
  DYNAMIC        0x002df8 0x0000000000003df8 0x0000000000003df8 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x0002c4 0x00000000000002c4 0x00000000000002c4 0x000020 0x000020 R   0x4
  GNU_EH_FRAME   0x400398 0x0000000000400398 0x0000000000400398 0x00006c 0x00006c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x002de8 0x0000000000003de8 0x0000000000003de8 0x000218 0x000218 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 
   03     .init .plt .plt.got .bolt.org.text .fini 
   04     .rodata .bolt.org.eh_frame_hdr .bolt.org.eh_frame 
   05     .init_array .fini_array .dynamic .got .got.plt .data .bss 
   06     .text .eh_frame .eh_frame_hdr 
   07     .dynamic 
   08     .note.ABI-tag 
   09     .eh_frame_hdr 
   10     
   11     .init_array .fini_array .dynamic .got

Full readelf after:

There are 34 section headers, starting at offset 0x201280:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        00000000000002a8 0002a8 00001c 00   A  0   0  1
  [ 2] .note.ABI-tag     NOTE            00000000000002c4 0002c4 000020 00   A  0   0  4
  [ 3] .gnu.hash         GNU_HASH        00000000000002e8 0002e8 000024 00   A  4   0  8
  [ 4] .dynsym           DYNSYM          0000000000000310 000310 0000a8 18   A  5   1  8
  [ 5] .dynstr           STRTAB          00000000000003b8 0003b8 000084 00   A  0   0  1
  [ 6] .gnu.version      VERSYM          000000000000043c 00043c 00000e 02   A  4   0  2
  [ 7] .gnu.version_r    VERNEED         0000000000000450 000450 000020 00   A  5   1  8
  [ 8] .rela.dyn         RELA            0000000000000470 000470 0000c0 18   A  4   0  8
  [ 9] .rela.plt         RELA            0000000000000530 000530 000018 18  AI  4  22  8
  [10] .init             PROGBITS        0000000000001000 001000 00001b 00  AX  0   0  4
  [11] .plt              PROGBITS        0000000000001020 001020 000020 10  AX  0   0 16
  [12] .plt.got          PROGBITS        0000000000001040 001040 000008 08  AX  0   0  8
  [13] .bolt.org.text    PROGBITS        0000000000001050 001050 0001a5 00  AX  0   0 16
  [14] .fini             PROGBITS        00000000000011f8 0011f8 00000d 00  AX  0   0  4
  [15] .rodata           PROGBITS        0000000000002000 002000 000012 00   A  0   0  4
  [16] .bolt.org.eh_frame_hdr PROGBITS        0000000000002014 002014 000044 00   A  0   0  4
  [17] .bolt.org.eh_frame PROGBITS        0000000000002058 002058 000110 00   A  0   0  8
  [18] .init_array       INIT_ARRAY      0000000000003de8 002de8 000008 08  WA  0   0  8
  [19] .fini_array       FINI_ARRAY      0000000000003df0 002df0 000008 08  WA  0   0  8
  [20] .dynamic          DYNAMIC         0000000000003df8 002df8 0001e0 10  WA  5   0  8
  [21] .got              PROGBITS        0000000000003fd8 002fd8 000028 08  WA  0   0  8
  [22] .got.plt          PROGBITS        0000000000004000 003000 000020 08  WA  0   0  8
  [23] .data             PROGBITS        0000000000004020 003020 000010 00  WA  0   0  8
  [24] .tm_clone_table   PROGBITS        0000000000004030 003030 000000 00  WA  0   0  8
  [25] .bss              NOBITS          0000000000004030 003030 000008 00  WA  0   0  1
  [26] .text             PROGBITS        0000000000200340 200340 0001b1 00  AX  0   0 64
  [27] .eh_frame         PROGBITS        00000000002004f8 2004f8 0001e0 00   A  0   0  8
  [28] .eh_frame_hdr     PROGBITS        00000000002006d8 2006d8 00006c 00   A  0   0  1
  [29] .comment          PROGBITS        0000000000000000 200744 0000a2 01  MS  0   0  1
  [30] .symtab           SYMTAB          0000000000000000 2007e8 000648 18     31  47  8
  [31] .strtab           STRTAB          0000000000000000 200e30 000220 00      0   0  1
  [32] .shstrtab         STRTAB          0000000000000000 201050 00015d 00      0   0  1
  [33] .note.bolt_info   NOTE            0000000000000000 2011ad 0000a8 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

Elf file type is DYN (Shared object file)
Entry point 0x20035c
There are 12 program headers, starting at offset 2097152

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x200000 0x0000000000200000 0x0000000000200000 0x0002a0 0x0002a0 R   0x8
  INTERP         0x0002a8 0x00000000000002a8 0x00000000000002a8 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000548 0x000548 R   0x1000
  LOAD           0x001000 0x0000000000001000 0x0000000000001000 0x000205 0x000205 R E 0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x000168 0x000168 R   0x1000
  LOAD           0x002de8 0x0000000000003de8 0x0000000000003de8 0x000248 0x000250 RW  0x1000
  LOAD           0x200000 0x0000000000200000 0x0000000000200000 0x000744 0x000744 R E 0x200000
  DYNAMIC        0x002df8 0x0000000000003df8 0x0000000000003df8 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x0002c4 0x00000000000002c4 0x00000000000002c4 0x000020 0x000020 R   0x4
  GNU_EH_FRAME   0x2006d8 0x00000000002006d8 0x00000000002006d8 0x00006c 0x00006c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x002de8 0x0000000000003de8 0x0000000000003de8 0x000218 0x000218 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 
   03     .init .plt .plt.got .bolt.org.text .fini 
   04     .rodata .bolt.org.eh_frame_hdr .bolt.org.eh_frame 
   05     .init_array .fini_array .dynamic .got .got.plt .data .bss 
   06     .text .eh_frame .eh_frame_hdr 
   07     .dynamic 
   08     .note.ABI-tag 
   09     .eh_frame_hdr 
   10     
   11     .init_array .fini_array .dynamic .got

As you see we make LOAD segment smaller by eliminating unnecessary padding, and we don't touch neither segment address nor offset/alignment

yota9 added a comment.Nov 8 2022, 3:38 AM

After editing the message it makes sense with different segments sizes before/after.
The thing is that the text is still padded here to 0x200000 which is good, the question is why it was double padded to 400000 before. It seems that it is done automatically by the llvm emitter, that is why the manually padding here is redundant.

If you mean text segment, yes it is properly aligned. As to why .text section was aligned additionally - well, it was indeed BinaryEmitter, but not automatically, it just used alignment which was passed to it. Now we stop passing huge alignment and all the padding is gone.

yota9 added a comment.Nov 8 2022, 3:58 AM

By automatically I meant that it is job for emitter. As I see the padding is still exists - between 003030 and 200340 file offset there is no data stored. But it is OK, the LOAD segment is still properly aligned for the huge pages.

treapster updated this revision to Diff 473971.Nov 8 2022, 5:57 AM

Turns out we cannot remove padding with -hugify, so i preserve it in this case and hope all tests will pass.

yota9 added a comment.Nov 8 2022, 5:59 AM

Could you tell more what is the problem with hugify in this case? Huge pages might be used without hugify, would it affect the usage?

Hugify requires empty space before and after .text to properly remap it's contents. In abscence of -hugify it doesn't make a difference because dynamic loader operates on segments and i didn't change segment address or alignment.

The issue can also be solved if we force -hugify to remap whole segment in the same fashion as dynamic loader, but it's not trivial and we don't care that much

You can also refer to original hugify discussion to appreciate how messed up is that

yota9 added a comment.Nov 8 2022, 6:25 AM

I see, thanks. Overall as I said this (double) alignment indeed seems to be excess. Let's wait what meta team members would tell us about that.

To me, it appears there are two outstanding issues that we need to solve. One is that we use large pages by default. Without hugify/hottext it doesn't make sense unless dynamic loader supports large pages. The second issue is that we align .text regardless of how many bytes it takes, leading sometimes to ~2MB "waste". Just using 64-byte alignment and relying on the containing segment being aligned to the page size will not always work. One example of the latter is "-use-old-text". I think using the MaxAlignBytes might be better suited to address this issue. What do you think?

To me, it appears there are two outstanding issues that we need to solve. One is that we use large pages by default. Without hugify/hottext it doesn't make sense unless dynamic loader supports large pages.

My understanding is that newer kernels/loaders support large pages and can properly map them without the use of -hugify, so it makes sense to use large pages by default. I wouldn't call it an issue, if you don't like it you can always use -no-huge-pages.

Just using 64-byte alignment and relying on the containing segment being aligned to the page size will not always work. One example of the latter is "-use-old-text".

Why it will not work with -use-old-text? The whole point of this option is to put new text in place of old, and by trying to align it to 2 MB you make it unachievable unless the binary was prelinked with huge padding. If you still want 2MB aligned text, you can pass align-text=0x200000. I think this patch just makes defaults more reasonable, otherwise you can always align however you want. And i also don't understand why MaxAlignBytes might be better

To me, it appears there are two outstanding issues that we need to solve. One is that we use large pages by default. Without hugify/hottext it doesn't make sense unless dynamic loader supports large pages.

My understanding is that newer kernels/loaders support large pages and can properly map them without the use of -hugify, so it makes sense to use large pages by default. I wouldn't call it an issue, if you don't like it you can always use -no-huge-pages.

Could you point to the kernel patch/version that enables it? It will be good to test the functionality.

Just using 64-byte alignment and relying on the containing segment being aligned to the page size will not always work. One example of the latter is "-use-old-text".

Why it will not work with -use-old-text? The whole point of this option is to put new text in place of old, and by trying to align it to 2 MB you make it unachievable unless the binary was prelinked with huge padding.

What exactly is "unachievable"? For one of our apps, we use ICF to save ~20MB of code and the new code fits into the original space with 2MB alignment.

I think this patch just makes defaults more reasonable, otherwise you can always align however you want.

From the performance perspective, the default alignment of 64 bytes for .text does not sound reasonable to me . It should be page-aligned.

And i also don't understand why MaxAlignBytes might be better

It has an advantage to explicitly express the intention of not using more than MaxAlignBytes for alignment. The current patch has the assumption of having "small" program header before .text in a page-aligned segment which makes it fragile.

To me, it appears there are two outstanding issues that we need to solve. One is that we use large pages by default. Without hugify/hottext it doesn't make sense unless dynamic loader supports large pages.

My understanding is that newer kernels/loaders support large pages and can properly map them without the use of -hugify, so it makes sense to use large pages by default. I wouldn't call it an issue, if you don't like it you can always use -no-huge-pages.

Could you point to the kernel patch/version that enables it? It will be good to test the functionality.

I researched a bit and can say the following things:

  • I assumed the problem with hugify is just that older ld-linux or kernels ignored 0x200000 alignment of the segment, but turned out it is just part of the problem. It seems like it's caused by the loader, and with glibc 2.28(4.18.0 EulerOS) aligment is ignored, while on newer Ubuntu 5.10.0-1013-oem with glibc 2.31 addresses are assigned properly.
  • I was wrong about out-of-box HugePage support, because aside from big alignment HugePages must be anonymous. Which means we need to use -hugify to insert code which will remap code section as anonymous and call madvise to ensure huge page. It also means that current implementation of -hugify is useless on newer kernels because if addresses are properly aligned, it does not create anonymous mapping and calls madvise right away, which does not return huge page(it can be checked by cat /proc/PID/smaps | grep HugePage, there will be all zeroes).

Just using 64-byte alignment and relying on the containing segment being aligned to the page size will not always work. One example of the latter is "-use-old-text".

Why it will not work with -use-old-text? The whole point of this option is to put new text in place of old, and by trying to align it to 2 MB you make it unachievable unless the binary was prelinked with huge padding.

What exactly is "unachievable"? For one of our apps, we use ICF to save ~20MB of code and the new code fits into the original space with 2MB alignment.

Unachievable was wrong word, but on binaries i've seen hot + cold is usually a bit bigger than original .text even without alignment, so adding padding reduces chances to fit even more. 20MB saved by ICF must be a huge binary and relatively rare case. Anyway, it's not the matter here.

I think this patch just makes defaults more reasonable, otherwise you can always align however you want.

From the performance perspective, the default alignment of 64 bytes for .text does not sound reasonable to me . It should be page-aligned.

Aligning to regular page align seems reasonable in case of use-old-text(if we accept that for most mid-sized binaries it's not possible), even though i'm not sure it significantly affects performance without huge pages, but if we create new segment, 64 bytes seems fine, isn't it?

And i also don't understand why MaxAlignBytes might be better

It has an advantage to explicitly express the intention of not using more than MaxAlignBytes for alignment. The current patch has the assumption of having "small" program header before .text in a page-aligned segment which makes it fragile.

But we care more about alignment value than padding size. I think we're better off removing the assumption by explicitly handling use-old-text, instead of introducing new option.

It also means that current implementation of -hugify is useless on newer kernels because if addresses are properly aligned, it does not create anonymous mapping and calls madvise right away, which does not return huge page(it can be checked by cat /proc/PID/smaps | grep HugePage, there will be all zeroes).

Are you sure? They might be marked differently in /proc/pid/smaps, e.g. FilePmdMapped.

Aligning to regular page align seems reasonable in case of use-old-text(if we accept that for most mid-sized binaries it's not possible), even though i'm not sure it significantly affects performance without huge pages,

I don't have any performance data for regular-size pages. It will be good to collect some, e.g. for clang. If there's indeed no difference, we can do page-size alignment only for huge pages.

but if we create new segment, 64 bytes seems fine, isn't it?

Is it b/c we assume there's nothing but a PHDR table before the code in the new segment?

But we care more about alignment value than padding size.

Should we always align and disregarding the padding then (i.e. keep things as they are)? Sorry, if I missed your point.

It also means that current implementation of -hugify is useless on newer kernels because if addresses are properly aligned, it does not create anonymous mapping and calls madvise right away, which does not return huge page(it can be checked by cat /proc/PID/smaps | grep HugePage, there will be all zeroes).

Are you sure? They might be marked differently in /proc/pid/smaps, e.g. FilePmdMapped.

Grepping for FilePmdMapped also doesn't show any non-zero values, so i'm pretty sure. And the doc for HugePages and Madvise says clearly that HugePages only work for anonymous mappings:
From here:

Currently, Transparent Huge Pages work only with private anonymous pages (see mmap(2)).

From here:

Currently THP only works for anonymous memory mappings and tmpfs/shmem. But in the future it can expand to other filesystems.

And here is what happens in the kernel:
call to khugepaged_enter_vma
call to hugepage_vma_check
abort if not anonymous

treapster added a comment.EditedDec 7 2022, 5:10 AM

I don't have any performance data for regular-size pages. It will be good to collect some, e.g. for clang. If there's indeed no difference, we can do page-size alignment only for huge pages.

I'm not sure how to properly collect such data without polluting it with noise from OS and build system. Simply timing builds will produce a very diverse set of values without obvious correlation with the binary used.

Is it b/c we assume there's nothing but a PHDR table before the code in the new segment?

Well, if we create new segment it's guaranteed there's nothing but a PHDR table before the code because we call mapCodeSections before anything else.

Should we always align and disregarding the padding then (i.e. keep things as they are)? Sorry, if I missed your point.

My point is to align only when it makes sense, and aligning after PHDR does not. With use-old-text it's not so clear, so we can align in that case if you feel like it, but at least we shouldn't align after PHDR. And also it doesn't make sense to align to 2 MB in absence of hugify(because we won't get huge pages anyway), so i think we should act as if -no-huge-pages is true by default. What do you think?

And if we don't get Huge Pages without -hugify, why would we keep -no-huge-pages at all?

Thank you for the investigation on THPs and hugify. Indeed, it didn't work for me either. Putting munlock(2) before madvise(2) helps though. Could you please check D140425 on your end?

I don't have any performance data for regular-size pages. It will be good to collect some, e.g. for clang. If there's indeed no difference, we can do page-size alignment only for huge pages.

I'm not sure how to properly collect such data without polluting it with noise from OS and build system. Simply timing builds will produce a very diverse set of values without obvious correlation with the binary used.

If you don't have a benchmark system setup, it's fine. I'll see if I can run something quickly.

Is it b/c we assume there's nothing but a PHDR table before the code in the new segment?

Well, if we create new segment it's guaranteed there's nothing but a PHDR table before the code because we call mapCodeSections before anything else.

At the moment - yes. So this is the assumption there.

Should we always align and disregarding the padding then (i.e. keep things as they are)? Sorry, if I missed your point.

My point is to align only when it makes sense, and aligning after PHDR does not.

How do you define what makes sense?

With use-old-text it's not so clear, so we can align in that case if you feel like it, but at least we shouldn't align after PHDR.

Why? Because PHDR is "small"?

And also it doesn't make sense to align to 2 MB in absence of hugify(because we won't get huge pages anyway), so i think we should act as if -no-huge-pages is true by default. What do you think?

I agree that it doesn't make sense to align at 2MB in the absence of hugify *and* --hot-text.

And if we don't get Huge Pages without -hugify, why would we keep -no-huge-pages at all?

Internally, we use huge pages all the time, and at the time it was a sensible default choice. I agree that there's no need for 2MB alignment when huge pages are not used.

To be clear, when BOLT sees huge page symbols __hot_{start|end} on the input, it needs to use 2MB for code alignment. That's essentially the --hot-text option.

treapster added a comment.EditedDec 21 2022, 2:26 AM

Thank you for the investigation on THPs and hugify. Indeed, it didn't work for me either. Putting munlock(2) before madvise(2) helps though. Could you please check D140425 on your end?

I applied the patch, and BOLTing helloworld, launching it from GDB and running cat /proc/PID/smaps | grep Huge or grep FilePmdMapped still doesn't show any non-zero values. I tried with /sys/kernel/mm/transparent_hugepage/enabled set to always and to madvise. I also tried processing redis-server, launching it and looking at it's mappings, and in that case i was able to find 6MB of HugePages, but it was mapped somewhere at 7ffff7400000-7ffff7c00000 which is definetely not .text section and looks like some internal anonymous mapping, and i am able to find it even if processing without the patch. If, however, i modify __bolt_hugify_self_impl to always call hugifyForOldKernel, i'm able to see HugePages for .text mapping both in helloworld and redis. All of the above was done on 5.10.0-1013-oem. But i also found that in kernel sources i linked above i missed a call to file_thp_enabled before check for anonymity. For this call to succeed, READ_ONLY_THP_FOR_FS config option must be enabled, and in our setup it is not set, so it's likely the cause of the issue. Can you check if it's enabled on your system?

How do you define what makes sense?

Well, something makes sense when benefits outweigh costs. The benefits of aligning after PHDR are not clear because PHDR occupies a small fraction of a page, especially in the context of huge pages, so it's unlikely to significantly increase TLB misses or otherwise affect performance. However, a benchmark would be interesting. While the costs are 2MB in padding space + in case of AArch64, more stubs inserted during longJmp. Processing clang without any flags renders 347691 stubs, with -no-huge-pages - 327125, and difference in .text size is 176 kb. Not that much, but it think it's reasonable to keep sections together unless there is a performance penalty. And i guess those 20k additional stubs also have some performance implications.

With use-old-text it's not so clear, so we can align in that case if you feel like it, but at least we shouldn't align after PHDR.

Why? Because PHDR is "small"?

Yes + the above

And also it doesn't make sense to align to 2 MB in absence of hugify(because we won't get huge pages anyway), so i think we should act as if -no-huge-pages is true by default. What do you think?

I agree that it doesn't make sense to align at 2MB in the absence of hugify *and* --hot-text.

I'll update the patch to set pageAlign to regular page size unless there's hugify or hot-text.

treapster updated this revision to Diff 484515.Dec 21 2022, 3:23 AM

Use Huge pages only in presence of hugify or hot-text, don't align text after PHDR

For this call to succeed, READ_ONLY_THP_FOR_FS config option must be enabled, and in our setup it is not set, so it's likely the cause of the issue. Can you check if it's enabled on your system?

Yes, you are correct - I have READ_ONLY_THP_FOR_FS set on my system.

How do you define what makes sense?

Well, something makes sense when benefits outweigh costs. The benefits of aligning after PHDR are not clear because PHDR occupies a small fraction of a page, especially in the context of huge pages, so it's unlikely to significantly increase TLB misses or otherwise affect performance. However, a benchmark would be interesting. While the costs are 2MB in padding space + in case of AArch64, more stubs inserted during longJmp. Processing clang without any flags renders 347691 stubs, with -no-huge-pages - 327125, and difference in .text size is 176 kb. Not that much, but it think it's reasonable to keep sections together unless there is a performance penalty. And i guess those 20k additional stubs also have some performance implications.

Now we just have to find more or less a reasonable strict definition for the alignment threshold. Some refactoring is required, but once we can pass MaxBytesToEmit to MCStreamer::emitCodeAlignment() for .text, it will solve the problem while also being future-proof. I believe it will also be helpful for the work you were planning to do on the complete binary rewrite.

Now we just have to find more or less a reasonable strict definition for the alignment threshold. Some refactoring is required, but once we can pass MaxBytesToEmit to MCStreamer::emitCodeAlignment() for .text, it will solve the problem while also being future-proof. I believe it will also be helpful for the work you were planning to do on the complete binary rewrite.

I don't fully understand why do you want to use emitCodeAlignment in the first place. From what i can see it's intended to align functions, and for sections we have Section->setAlignment() which is used currently with AlignText. Do you mean to declare some small "official" alignment with setAlignment(which will end up in the section header) and then align additionaly with emitCodeAlignment and MaxBytesToEmit before the first function? Why bother that much and what is wrong with current version of the patch?