This is an archive of the discontinued LLVM Phabricator instance.

[ELF][AArch64] Allow PT_LOAD to have overlapping p_offset ranges
ClosedPublic

Authored by MaskRay on Jul 18 2019, 9:27 AM.

Details

Summary

Ported the D64906 technique to AArch64. It deletes 3 alignments at
PT_LOAD boundaries for the default case: the size of an aarch64 binary
can be decreased by at most 192kb.

If sh_addralign(.tdata) < sh_addralign(.tbss),
we can potentially make p_vaddr(PT_TLS)%p_align(PT_TLS) != 0.

ld.so that are known to have problems if p_vaddr%p_align!=0:

  • musl<=1.1.22
  • FreeBSD 13.0-CURRENT (and before) rtld-elf arm64

New test aarch64-tls-vaddr-align.s checks p_vaddr%p_align = 0.

Event Timeline

MaskRay created this revision.Jul 18 2019, 9:27 AM
MaskRay updated this revision to Diff 210749.Jul 19 2019, 12:14 AM

Update more tests. WIP

MaskRay updated this revision to Diff 210793.Jul 19 2019, 3:28 AM
MaskRay retitled this revision from [ELF][AArch64] Allow overlapping PT_LOAD (keep p_offset ranges adjacent) to decrease file size to [ELF][AArch64] Allow PT_LOAD to have overlapping p_offset ranges.

Update title

MaskRay updated this revision to Diff 210852.Jul 19 2019, 9:13 AM
MaskRay edited the summary of this revision. (Show Details)

Updated all aarch64 tests.

MaskRay updated this revision to Diff 210958.Jul 20 2019, 5:37 AM
MaskRay edited the summary of this revision. (Show Details)
MaskRay removed a subscriber: srhines.

Add aarch64-tls-vaddr-align.s

MaskRay updated this revision to Diff 210965.Jul 20 2019, 10:08 AM
MaskRay edited the summary of this revision. (Show Details)

Fix TP offset computation if p_vaddr%p_align!=0

Apologies, missed that this was a separate review from D64906 will check through in detail tomorrow.

Will be worth mentioning that this is dependent on D64906 and D64903, I could guess at D64903 but basic-aarch64.s didn't apply cleanly without D64903. I successfully did a 2 stage bootstrap of clang with ninja check-all on a native aarch64 machine and made a quick test to check the TP relative relocation. I struggled a bit to understand the description and the comments though. In particular the use of GAP_ABOVE_TP instead of TCB_SIZE, it also wasn't clear where the alignment calculation had come from. I've made a few suggestions.

ELF/InputSection.cpp
627–628

The comment below is just an abbreviation of the code. Given that we have a more complex calculation to do than ld.bfd or ld.gold (IIUC they force the alignment of the first TLS section to p_align which means that the p_offset will be 0 mod p_align), I think we need a bit more explanation. Something like:

At run time TP will be aligned to p_align, followed by the TCB which is the size of 2 pointers, followed by alignment padding, then the TLS blocks starting with the local block. In the ELF file the p_vaddr will be congruent to p_offset modulo p_align we therefore add additional padding so that (TP + TCB_SIZE + padding) is congruent to p_vaddr modulo p_align.

I also needed to reach for my operator precedence table more than once when looking at the expression. It may be worth splitting it up a bit, for example:

offsetIntoBlock = s.getVA(0)
tcbSize = config->wordsize * 2 
padSize = (tls->p_vaddr - tcbSize) & (tls->p_align - 1)
return tcbSize + padSize + offsetIntoBlock;
test/ELF/aarch64-tls-vaddr-align.s
16

Personally I find GAP_ABOVE_TP confusing. I think TCB_SIZE would be more useful here for example:

## At run time TP will point to a structure aligned on a p_align boundary:
## | TCB_SIZE | padding | TLS block 0 | ... |
##
## a@tprel = TCB_SIZE + padding + st_value(a)
## TCB_SIZE = 16 (2 pointers)
## padding is (p_vaddr - TCB_SIZE) & (p_align - 1)
## st_value(a) is the offset into TLS block 0
## 0x220200-0x2201cc + 16 + (0x2201cc-16 & 0x100-1) = 0x100
MaskRay added a reviewer: nsz.Jul 23 2019, 8:03 PM
MaskRay edited the summary of this revision. (Show Details)
MaskRay updated this revision to Diff 211407.Jul 23 2019, 8:42 PM

Added more comments to D64906 based on peter.smith's suggestion.

nsz added a comment.Jul 24 2019, 2:12 AM

Will be worth mentioning that this is dependent on D64906 and D64903, I could guess at D64903 but basic-aarch64.s didn't apply cleanly without D64903. I successfully did a 2 stage bootstrap of clang with ninja check-all on a native aarch64 machine and made a quick test to check the TP relative relocation. I struggled a bit to understand the description and the comments though. In particular the use of GAP_ABOVE_TP instead of TCB_SIZE, it also wasn't clear where the alignment calculation had come from. I've made a few suggestions.

it seems the description is based on the code in musl.

i think the "TCB" wording is a mistake in the original IA-64 sysv abi (and Drepper's tls paper based on it) which documents implementation internals (16bytes "thread control block" after the tp), that is not abi and thus inappropriate in an abi document. the tls abi is just 16 bytes gap reserved after tp for implementation internal use (e.g. in musl it is unused).

i think using "TCB_SIZE" is acceptable in a linker given the tls abi documentation, but in musl that would be confusing since the tcb is elsewhere. (i think tcb is confusing in glibc too since tcb refers to the the entire pthread struct as well which is below the tp, so with tls variant 1 there are now two different things called tcb).

In D64930#1598672, @nsz wrote:

Will be worth mentioning that this is dependent on D64906 and D64903, I could guess at D64903 but basic-aarch64.s didn't apply cleanly without D64903. I successfully did a 2 stage bootstrap of clang with ninja check-all on a native aarch64 machine and made a quick test to check the TP relative relocation. I struggled a bit to understand the description and the comments though. In particular the use of GAP_ABOVE_TP instead of TCB_SIZE, it also wasn't clear where the alignment calculation had come from. I've made a few suggestions.

it seems the description is based on the code in musl.

Yes, and that's why you are on the reviewers' list :) Thanks for clarification.

i think the "TCB" wording is a mistake in the original IA-64 sysv abi (and Drepper's tls paper based on it) which documents implementation internals (16bytes "thread control block" after the tp), that is not abi and thus inappropriate in an abi document. the tls abi is just 16 bytes gap reserved after tp for implementation internal use (e.g. in musl it is unused).
i think using "TCB_SIZE" is acceptable in a linker given the tls abi documentation, but in musl that would be confusing since the tcb is elsewhere. (i think tcb is confusing in glibc too since tcb refers to the the entire pthread struct as well which is below the tp, so with tls variant 1 there are now two different things called tcb).

The "TCB" term made me confused enough so I didn't use it in the comment around ELF/InputSection.cpp:611 in D64906. I just call it "gap".

My use of TCB comes from the Arm ABI description of TLS http://infocenter.arm.com/help/topic/com.arm.doc.ihi0045e/IHI0045E_ABI_addenda.pdf (See diagram in 3.3 Linux for ARM TLS addressing). I'm guessing that TCB was derived from the original Drepper documentation. If it isn't universal across all TLS implementations we might be better off with something like "Reserved". For me as someone not familiar with MUSL I interpret GAP_ABOVE_TP as the gap between TP and the first TLS block (including the alignment padding and not distinct from it).

I'm not sure about breaking the (p_vaddr % p_align == 0) invariant. ld.bfd, gold, and (for the most part) lld currently provide this invariant for TLS segments in output files.

Supporting n-mod-m alignment isn't free for dynamic TLS -- allocators typically have [posix_]memalign and free APIs, but I'm not aware of APIs that provide n-mod-m alignment. Supporting it may add extra overhead in libc (e.g. in the DTV) to accommodate the misalignment so libc can free the memory later. The n-mod-m property could conceivably leak into an API used for sanitizers (e.g. https://sourceware.org/glibc/wiki/ThreadPropertiesAPI, the proposed __libc_create_dynamic_tls API).

I think there's an interoperability/compatibility issue. Like current Bionic, I don't think the BSDs pay attention to the (p_vaddr % p_align) value, for static or dynamic TLS. glibc uses that value for static TLS segments, but not for dynamic TLS[1]. (i.e. With this change, I believe lld can generate DSOs where some .tbss variables are misaligned when dlopen'ed with glibc.)

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=24606

I'm not sure about breaking the (p_vaddr % p_align == 0) invariant. ld.bfd, gold, and (for the most part) lld currently provide this invariant for TLS segments in output files.

Supporting n-mod-m alignment isn't free for dynamic TLS -- allocators typically have [posix_]memalign and free APIs, but I'm not aware of APIs that provide n-mod-m alignment. Supporting it may add extra overhead in libc (e.g. in the DTV) to accommodate the misalignment so libc can free the memory later. The n-mod-m property could conceivably leak into an API used for sanitizers (e.g. https://sourceware.org/glibc/wiki/ThreadPropertiesAPI, the proposed __libc_create_dynamic_tls API).

I think there's an interoperability/compatibility issue. Like current Bionic, I don't think the BSDs pay attention to the (p_vaddr % p_align) value, for static or dynamic TLS. glibc uses that value for static TLS segments, but not for dynamic TLS[1]. (i.e. With this change, I believe lld can generate DSOs where some .tbss variables are misaligned when dlopen'ed with glibc.)

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=24606

I had a quick look at the PR to see if I could make it fail (with D64930 applied) with glibc without hacking the binary with the trim-pt-tls program with this change. I wasn't able to do so although that doesn't mean it is impossible. Having said that I agree that it would be safer to be conservative in our demands on the capabilities of dynamic loaders. As I understand it ld.bfd and ld.gold both set the alignment of the first TLS section to the alignment of PT_TLS (max of any TLS section progbits or nobits).

I had a quick look at the PR to see if I could make it fail (with D64930 applied) with glibc without hacking the binary with the trim-pt-tls program with this change. I wasn't able to do so although that doesn't mean it is impossible. Having said that I agree that it would be safer to be conservative in our demands on the capabilities of dynamic loaders.

Here's an demonstration of the problem on glibc. (I applied D64903, D64906, and D64930.) https://gist.github.com/rprichard/e5ad0cfa10f6b75925ded3709117500f

In the shared object, declare these variables:

__thread char tlsvar1[1] = { 1 };
__thread char tlsvar2[0x100] __attribute__((aligned(0x100)));
$ ./build.sh && readelf -lW libtls.so
...
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
...
  TLS            0x000780 0x0000000000020780 0x0000000000020780 0x000001 0x000180 R   0x100

Output on Ubuntu 16.04 AArch64:

&tlsvar1 = 0x1a63c700
&tlsvar2 = 0x1a63c780

As I understand it ld.bfd and ld.gold both set the alignment of the first TLS section to the alignment of PT_TLS (max of any TLS section progbits or nobits).

That appears to be correct. ld.bfd increases the first TLS section alignment to the maximum in _bfd_elf_tls_setup. ld.gold leaves the sh_addralign field itself alone but still aligns the p_vaddr of the output TLS segment in Output_segment::set_section_list_addresses:

// Give the first TLS section the alignment of the
// entire TLS segment.  Otherwise the TLS segment as a
// whole may be misaligned.

https://git.linaro.org/toolchain/binutils-gdb.git/commit/?id=96a2b4e4bf67a7430e7ddd06a1d0730a49e6b07e

MaskRay updated this revision to Diff 212495.Jul 30 2019, 8:05 PM
MaskRay edited the summary of this revision. (Show Details)

Update aarch64-tls-vaddr-align.s because the p_vaddr%p_align = 0 hack is used to work around some ld.so bugs

MaskRay added subscribers: kib, luporl.EditedJul 30 2019, 8:39 PM

I'm not sure about breaking the (p_vaddr % p_align == 0) invariant. ld.bfd, gold, and (for the most part) lld currently provide this invariant for TLS segments in output files.

I think p_vaddr%p_align = p_offset%p_align is the invariant we must satisfy. p_vaddr%p_align=0 is not. However, p_vaddr%p_align!=0 will expose bugs (runtime address not congruent to p_vaddr modulo p_align) in existing ld.so implementations so I added a workaround to D64906.

Below are discussions about bugs in various ld.so implementations. They should not be affected by these patches but ideally they should be fixed.

Supporting n-mod-m alignment isn't free for dynamic TLS -- allocators typically have [posix_]memalign and free APIs, but I'm not aware of APIs that provide n-mod-m alignment.

When allocating TCB+static TLS blocks, ld.so should properly pad the allocated space. See my comment added to InputSection.cpp in D64906:

// Variant 1. TP will be followed by an optional gap (which is the size of 2
// pointers on ARM/AArch64, 0 on other targets), followed by alignment
// padding, then the static TLS blocks. The alignment padding is added so that
// (TP + gap + padding) is congruent to p_vaddr modulo p_align.
//
// Variant 2. Static TLS blocks, followed by alignment padding are placed
// before TP. The alignment padding is added so that (TP - padding -
// p_memsz) is congruent to p_vaddr modulo p_align.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=24606

This is a great test case! CC @kib who reviewed https://reviews.freebsd.org/D13378 @luporl for powerpc64. CC @emaste who is already subscribed...

FreeBSD rtld-elf (FreeBSD 13.0-CURRENT #89 r350361M) doesn't make runtime address of PT_TLS congruent to p_vaddr modulo p_align on i386/amd64/arm64/powerpc64:

# amd64 case
tp          = 0x8002278d0
&tlsvar     = 0x8002212c8  # this should be 1-mod-4
tlsvar[1]   = 7            # the value is correct

glibc (i386 and x86_64) has the same bug. It gets the static TLS block offsets correct.
Its aarch64 port is correct (https://sourceware.org/bugzilla/show_bug.cgi?id=24606).

musl i386 and x86_64 has been correct for many years:

% ./test1
tp          = 0x7f2697b10cc8
&tlsvar     = 0x7f2697b113b9  # 1-mod-4
tlsvar[1]   = 7

musl aarch64 (and likely arm/powerpc/powerpc64) was incorrect before 1.1.23 .
(Fixed by nsz's "fix tls offsets when p_vaddr%p_align != 0 on TLS_ABOVE_TP targets")

I think there's an interoperability/compatibility issue. Like current Bionic, I don't think the BSDs pay attention to the (p_vaddr % p_align) value, for static or dynamic TLS. glibc uses that value for static TLS segments, but not for dynamic TLS[1]. (i.e. With this change, I believe lld can generate DSOs where some .tbss variables are misaligned when dlopen'ed with glibc.)

It'd be nice if Bionic can fix the problem :) Though it may hardly happen on Bionic arm/aarch64 because .tdata is overaligned to 64.

p_vaddr(PT_TLS) % p_align(PT_TLS) != 0 can only happen if sh_addralign(.tdata) < sh_addralign(.tbss).
So you may have to align .tbss to >=128 to verify.

FreeBSD definitely paid attention to the alignment of PT_TLS:

// lib/libc/gen/tls.c
 * +----------+--------------+--------------+-----------+------------------+
 * | pre gap  | extended TCB |     TCB      | post gap  |    TLS segment   |
 * | pre_size |  extra_size  | TLS_TCB_SIZE | post_size | tls_static_space |
 * +----------+--------------+--------------+-----------+------------------+
 *
 * where:
 *  extra_size is tcbsize - TLS_TCB_SIZE
 *  post_size is used to adjust TCB to TLS aligment for first version of TLS
 *            layout and is always 0 for second version.
 *  pre_size  is used to adjust TCB aligment for first version and to adjust
 *            TLS alignment for second version.

Though amd64 is know to have problems as to dynamic TLS blocks. aarch64 may likely have static TLS offset problems (according to the experiments in D62055).

Its aarch64 port is correct (https://sourceware.org/bugzilla/show_bug.cgi?id=24606).

I suspect that's only the case when the TLS segment is allocated into surplus static memory. glibc aarch64 still allocates the segment at (&addr % p_align) != (p_vaddr % p_align) either if it's too big (e.g 0x1000 bytes) or if the object file is compiled with gcc -mtls-dialect=trad instead of the default -mtls-dialect=desc.

Though it may hardly happen on Bionic arm/aarch64 because .tdata is overaligned to 64.

Bionic's overalignment only applies to executables, not DSOs.

MaskRay updated this revision to Diff 212780.Aug 1 2019, 4:33 AM
MaskRay edited the summary of this revision. (Show Details)

Move the formula update to the previous patch

Update description

kib added a comment.Aug 4 2019, 3:36 PM

I'm not sure about breaking the (p_vaddr % p_align == 0) invariant. ld.bfd, gold, and (for the most part) lld currently provide this invariant for TLS segments in output files.

I think p_vaddr%p_align = p_offset%p_align is the invariant we must satisfy. p_vaddr%p_align=0 is not. However, p_vaddr%p_align!=0 will expose bugs (runtime address not congruent to p_vaddr modulo p_align) in existing ld.so implementations so I added a workaround to D64906.

Below are discussions about bugs in various ld.so implementations. They should not be affected by these patches but ideally they should be fixed.

Supporting n-mod-m alignment isn't free for dynamic TLS -- allocators typically have [posix_]memalign and free APIs, but I'm not aware of APIs that provide n-mod-m alignment.

When allocating TCB+static TLS blocks, ld.so should properly pad the allocated space. See my comment added to InputSection.cpp in D64906:

// Variant 1. TP will be followed by an optional gap (which is the size of 2
// pointers on ARM/AArch64, 0 on other targets), followed by alignment
// padding, then the static TLS blocks. The alignment padding is added so that
// (TP + gap + padding) is congruent to p_vaddr modulo p_align.
//
// Variant 2. Static TLS blocks, followed by alignment padding are placed
// before TP. The alignment padding is added so that (TP - padding -
// p_memsz) is congruent to p_vaddr modulo p_align.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=24606

This is a great test case! CC @kib who reviewed https://reviews.freebsd.org/D13378 @luporl for powerpc64. CC @emaste who is already subscribed...

FreeBSD rtld-elf (FreeBSD 13.0-CURRENT #89 r350361M) doesn't make runtime address of PT_TLS congruent to p_vaddr modulo p_align on i386/amd64/arm64/powerpc64:

# amd64 case
tp          = 0x8002278d0
&tlsvar     = 0x8002212c8  # this should be 1-mod-4
tlsvar[1]   = 7            # the value is correct

I put (an experimental at this stage) review https://reviews.freebsd.org/D21163 to change the behavior of the FreeBSD dynamic linker.
With the patch applied, and with the test program from the sourceware bugzilla, I see on amd64:

orion% ~/build/bsd/DEV/obj/usr/home/kostik/work/build/bsd/DEV/src/amd64.amd64/libexec/rtld-elf/ld-elf.so.1 ./test1
tp          = 0x8010448d0
&tlsvar     = 0x80103d4a9
tlsvar[1]   = 7

The address is 1 mod 4, as requested.

glibc (i386 and x86_64) has the same bug. It gets the static TLS block offsets correct.
Its aarch64 port is correct (https://sourceware.org/bugzilla/show_bug.cgi?id=24606).

musl i386 and x86_64 has been correct for many years:

% ./test1
tp          = 0x7f2697b10cc8
&tlsvar     = 0x7f2697b113b9  # 1-mod-4
tlsvar[1]   = 7

musl aarch64 (and likely arm/powerpc/powerpc64) was incorrect before 1.1.23 .
(Fixed by nsz's "fix tls offsets when p_vaddr%p_align != 0 on TLS_ABOVE_TP targets")

I think there's an interoperability/compatibility issue. Like current Bionic, I don't think the BSDs pay attention to the (p_vaddr % p_align) value, for static or dynamic TLS. glibc uses that value for static TLS segments, but not for dynamic TLS[1]. (i.e. With this change, I believe lld can generate DSOs where some .tbss variables are misaligned when dlopen'ed with glibc.)

It'd be nice if Bionic can fix the problem :) Though it may hardly happen on Bionic arm/aarch64 because .tdata is overaligned to 64.

p_vaddr(PT_TLS) % p_align(PT_TLS) != 0 can only happen if sh_addralign(.tdata) < sh_addralign(.tbss).
So you may have to align .tbss to >=128 to verify.

FreeBSD definitely paid attention to the alignment of PT_TLS:

// lib/libc/gen/tls.c
 * +----------+--------------+--------------+-----------+------------------+
 * | pre gap  | extended TCB |     TCB      | post gap  |    TLS segment   |
 * | pre_size |  extra_size  | TLS_TCB_SIZE | post_size | tls_static_space |
 * +----------+--------------+--------------+-----------+------------------+
 *
 * where:
 *  extra_size is tcbsize - TLS_TCB_SIZE
 *  post_size is used to adjust TCB to TLS aligment for first version of TLS
 *            layout and is always 0 for second version.
 *  pre_size  is used to adjust TCB aligment for first version and to adjust
 *            TLS alignment for second version.

Though amd64 is know to have problems as to dynamic TLS blocks. aarch64 may likely have static TLS offset problems (according to the experiments in D62055).

Can you provide the test for static TLS on amd64/i386 ?

In D64930#1614131, @kib wrote:

I'm not sure about breaking the (p_vaddr % p_align == 0) invariant. ld.bfd, gold, and (for the most part) lld currently provide this invariant for TLS segments in output files.

I think p_vaddr%p_align = p_offset%p_align is the invariant we must satisfy. p_vaddr%p_align=0 is not. However, p_vaddr%p_align!=0 will expose bugs (runtime address not congruent to p_vaddr modulo p_align) in existing ld.so implementations so I added a workaround to D64906.

Below are discussions about bugs in various ld.so implementations. They should not be affected by these patches but ideally they should be fixed.

Supporting n-mod-m alignment isn't free for dynamic TLS -- allocators typically have [posix_]memalign and free APIs, but I'm not aware of APIs that provide n-mod-m alignment.

When allocating TCB+static TLS blocks, ld.so should properly pad the allocated space. See my comment added to InputSection.cpp in D64906:

// Variant 1. TP will be followed by an optional gap (which is the size of 2
// pointers on ARM/AArch64, 0 on other targets), followed by alignment
// padding, then the static TLS blocks. The alignment padding is added so that
// (TP + gap + padding) is congruent to p_vaddr modulo p_align.
//
// Variant 2. Static TLS blocks, followed by alignment padding are placed
// before TP. The alignment padding is added so that (TP - padding -
// p_memsz) is congruent to p_vaddr modulo p_align.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=24606

This is a great test case! CC @kib who reviewed https://reviews.freebsd.org/D13378 @luporl for powerpc64. CC @emaste who is already subscribed...

FreeBSD rtld-elf (FreeBSD 13.0-CURRENT #89 r350361M) doesn't make runtime address of PT_TLS congruent to p_vaddr modulo p_align on i386/amd64/arm64/powerpc64:

# amd64 case
tp          = 0x8002278d0
&tlsvar     = 0x8002212c8  # this should be 1-mod-4
tlsvar[1]   = 7            # the value is correct

I put (an experimental at this stage) review https://reviews.freebsd.org/D21163 to change the behavior of the FreeBSD dynamic linker.
With the patch applied, and with the test program from the sourceware bugzilla, I see on amd64:

orion% ~/build/bsd/DEV/obj/usr/home/kostik/work/build/bsd/DEV/src/amd64.amd64/libexec/rtld-elf/ld-elf.so.1 ./test1
tp          = 0x8010448d0
&tlsvar     = 0x80103d4a9
tlsvar[1]   = 7

The address is 1 mod 4, as requested.

glibc (i386 and x86_64) has the same bug. It gets the static TLS block offsets correct.
Its aarch64 port is correct (https://sourceware.org/bugzilla/show_bug.cgi?id=24606).

musl i386 and x86_64 has been correct for many years:

% ./test1
tp          = 0x7f2697b10cc8
&tlsvar     = 0x7f2697b113b9  # 1-mod-4
tlsvar[1]   = 7

musl aarch64 (and likely arm/powerpc/powerpc64) was incorrect before 1.1.23 .
(Fixed by nsz's "fix tls offsets when p_vaddr%p_align != 0 on TLS_ABOVE_TP targets")

I think there's an interoperability/compatibility issue. Like current Bionic, I don't think the BSDs pay attention to the (p_vaddr % p_align) value, for static or dynamic TLS. glibc uses that value for static TLS segments, but not for dynamic TLS[1]. (i.e. With this change, I believe lld can generate DSOs where some .tbss variables are misaligned when dlopen'ed with glibc.)

It'd be nice if Bionic can fix the problem :) Though it may hardly happen on Bionic arm/aarch64 because .tdata is overaligned to 64.

p_vaddr(PT_TLS) % p_align(PT_TLS) != 0 can only happen if sh_addralign(.tdata) < sh_addralign(.tbss).
So you may have to align .tbss to >=128 to verify.

FreeBSD definitely paid attention to the alignment of PT_TLS:

// lib/libc/gen/tls.c
 * +----------+--------------+--------------+-----------+------------------+
 * | pre gap  | extended TCB |     TCB      | post gap  |    TLS segment   |
 * | pre_size |  extra_size  | TLS_TCB_SIZE | post_size | tls_static_space |
 * +----------+--------------+--------------+-----------+------------------+
 *
 * where:
 *  extra_size is tcbsize - TLS_TCB_SIZE
 *  post_size is used to adjust TCB to TLS aligment for first version of TLS
 *            layout and is always 0 for second version.
 *  pre_size  is used to adjust TCB aligment for first version and to adjust
 *            TLS alignment for second version.

Though amd64 is know to have problems as to dynamic TLS blocks. aarch64 may likely have static TLS offset problems (according to the experiments in D62055).

Can you provide the test for static TLS on amd64/i386 ?

The new test aarch64-tls-vaddr-align.s checks p_vaddr%p_align = 0. Once D64906 (PPC64) is accepted, I'll create an i386 patch, which involves ~40 test changes.

If you comment out the following part included in D64906:

// ELF/Writer.cpp
      else if (Out::tlsPhdr && Out::tlsPhdr->firstSec == p->firstSec)
        cmd->addrExpr = [] {
          return alignTo(script->getDot(), config->maxPageSize) +
                 alignTo(script->getDot() % config->maxPageSize,
                         Out::tlsPhdr->p_align);
        };

aarch64-tls-vaddr-align.s will create a PT_TLS where p_vaddr%p_align!=0.

D64906 and D64930 have been mentioned by an Android NDK issue today https://github.com/android-ndk/ndk/issues/1057 today. It'd be very nice if they can be committed soon...

I'm personally in favour of accepting D64930 and D64906 (I think that there is also D65865 on X86) as I think that this is an important size optimization already performed by Gold and BFD. Thinking about how to progress this further, I think it would be good to write a summary of where each of these reviews is at and what the plan is for this feature. For example:

  • Have all the comments been addressed, I think the main concern was about TLS?
  • Which targets do we want this to apply to, all of them, or just those with a 64k page size?
  • What is the dependency tree of the reviews, I think D64906 is a pre-requisite for all of them?
  • Does anyone have any objection, or are they happy for individual people active in a backend to approve that part?

I'm personally in favour of accepting D64930 and D64906 (I think that there is also D65865 on X86) as I think that this is an important size optimization already performed by Gold and BFD. Thinking about how to progress this further, I think it would be good to write a summary of where each of these reviews is at and what the plan is for this feature. For example:

Thanks for pushing forward this patch series!

  • Have all the comments been addressed, I think the main concern was about TLS?

All addressed. There was concern about TLS but I aligned p_vaddr/p_offset to p_align of PT_TLS.

  • Which targets do we want this to apply to, all of them, or just those with a 64k page size?

All, probably except x86_64 (with max-page-size=4096).

bool enable = config->emachine != EM_X86_64 || config->maxPageSize != 4096;

  • What is the dependency tree of the reviews, I think D64906 is a pre-requisite for all of them?

A tree with a single internal node:) Yes, D64906 is a pre-requisite for all of them.

D64906 base, PPC64
  D64930 AArch64 47 tests
  D65865 i386 33 tests
  Dxxxxx ARM (30~50 tests)
  Dxxxxx MIPS
  rLLDxxxxx RISC-V <10 tests
  rLLD PPC32 <10 tests
  Dxxxxx x86_64 with maxpagesize != 4096

Each new target can be enabled independently. The code change is trivial:

-      bool enable = config->emachine == EM_PPC64;
+      bool enable =
+          config->emachine == EM_AARCH64 || config->emachine == EM_PPC64;

The only problem is the number of the affected tests. Due to the layout change nature of the patch series, lots of addresses have to be updated. x86_64 with maxpagesize=4096 is estimated to affect 300+ tests.

  • Does anyone have any objection, or are they happy for individual people active in a backend to approve that part?

We haven't heard from MIPS yet. @atanasyan

MaskRay added a comment.EditedAug 19 2019, 11:11 PM

I'm personally in favour of accepting D64930 and D64906 (I think that there is also D65865 on X86) as I think that this is an important size optimization already performed by Gold and BFD. Thinking about how to progress this further, I think it would be good to write a summary of where each of these reviews is at and what the plan is for this feature. For example:

Thanks for pushing forward this patch series!

  • Have all the comments been addressed, I think the main concern was about TLS?

All addressed. There was concern about TLS but I aligned p_vaddr/p_offset to p_align of PT_TLS.

  • Which targets do we want this to apply to, all of them, or just those with a 64k page size?

All, probably except x86_64 (with max-page-size=4096).

bool enable = config->emachine != EM_X86_64 || config->maxPageSize != 4096;

  • What is the dependency tree of the reviews, I think D64906 is a pre-requisite for all of them?

A tree with a single internal node:) Yes, D64906 is a pre-requisite for all of them.

D64906 base, PPC64
  D64930 AArch64 47 tests
  D65865 i386 33 tests
  Dxxxxx ARM (30~50 tests)
  Dxxxxx MIPS
  rLLDxxxxx RISC-V <10 tests
  rLLD PPC32 <10 tests
  Dxxxxx x86_64 with maxpagesize != 4096

Each new target can be enabled independently. The code change is trivial:

-      bool enable = config->emachine == EM_PPC64;
+      bool enable =
+          config->emachine == EM_AARCH64 || config->emachine == EM_PPC64;

The only problem is the number of the affected tests. Due to the layout change nature of the patch series, lots of addresses have to be updated. x86_64 with maxpagesize=4096 is estimated to affect 300+ tests.

  • Does anyone have any objection, or are they happy for individual people active in a backend to approve that part?

We haven't heard from MIPS yet. @atanasyan

@ruiu Let's move D64906, D64930 and D65865 forward?

ruiu added a comment.Aug 20 2019, 12:31 AM

I'm personally in favour of accepting D64930 and D64906 (I think that there is also D65865 on X86) as I think that this is an important size optimization already performed by Gold and BFD. Thinking about how to progress this further, I think it would be good to write a summary of where each of these reviews is at and what the plan is for this feature. For example:

Thanks for pushing forward this patch series!

  • Have all the comments been addressed, I think the main concern was about TLS?

All addressed. There was concern about TLS but I aligned p_vaddr/p_offset to p_align of PT_TLS.

  • Which targets do we want this to apply to, all of them, or just those with a 64k page size?

All, probably except x86_64 (with max-page-size=4096).

bool enable = config->emachine != EM_X86_64 || config->maxPageSize != 4096;

  • What is the dependency tree of the reviews, I think D64906 is a pre-requisite for all of them?

A tree with a single internal node:) Yes, D64906 is a pre-requisite for all of them.

D64906 base, PPC64
  D64930 AArch64 47 tests
  D65865 i386 33 tests
  Dxxxxx ARM (30~50 tests)
  Dxxxxx MIPS
  rLLDxxxxx RISC-V <10 tests
  rLLD PPC32 <10 tests
  Dxxxxx x86_64 with maxpagesize != 4096

Each new target can be enabled independently. The code change is trivial:

-      bool enable = config->emachine == EM_PPC64;
+      bool enable =
+          config->emachine == EM_AARCH64 || config->emachine == EM_PPC64;

The only problem is the number of the affected tests. Due to the layout change nature of the patch series, lots of addresses have to be updated. x86_64 with maxpagesize=4096 is estimated to affect 300+ tests.

  • Does anyone have any objection, or are they happy for individual people active in a backend to approve that part?

We haven't heard from MIPS yet. @atanasyan

@ruiu Let's move D64906, D64930 and D65865 forward?

Yes, I'm in favor of these patches. Please move forward. Are there any remaining concerns?

I'm personally in favour of accepting D64930 and D64906 (I think that there is also D65865 on X86) as I think that this is an important size optimization already performed by Gold and BFD. Thinking about how to progress this further, I think it would be good to write a summary of where each of these reviews is at and what the plan is for this feature. For example:

Thanks for pushing forward this patch series!

  • Have all the comments been addressed, I think the main concern was about TLS?

All addressed. There was concern about TLS but I aligned p_vaddr/p_offset to p_align of PT_TLS.

  • Which targets do we want this to apply to, all of them, or just those with a 64k page size?

All, probably except x86_64 (with max-page-size=4096).

bool enable = config->emachine != EM_X86_64 || config->maxPageSize != 4096;

  • What is the dependency tree of the reviews, I think D64906 is a pre-requisite for all of them?

A tree with a single internal node:) Yes, D64906 is a pre-requisite for all of them.

D64906 base, PPC64
  D64930 AArch64 47 tests
  D65865 i386 33 tests
  Dxxxxx ARM (30~50 tests)
  Dxxxxx MIPS
  rLLDxxxxx RISC-V <10 tests
  rLLD PPC32 <10 tests
  Dxxxxx x86_64 with maxpagesize != 4096

Each new target can be enabled independently. The code change is trivial:

-      bool enable = config->emachine == EM_PPC64;
+      bool enable =
+          config->emachine == EM_AARCH64 || config->emachine == EM_PPC64;

The only problem is the number of the affected tests. Due to the layout change nature of the patch series, lots of addresses have to be updated. x86_64 with maxpagesize=4096 is estimated to affect 300+ tests.

  • Does anyone have any objection, or are they happy for individual people active in a backend to approve that part?

We haven't heard from MIPS yet. @atanasyan

@ruiu Let's move D64906, D64930 and D65865 forward?

Yes, I'm in favor of these patches. Please move forward. Are there any remaining concerns?

Great! No concerns as explained above.

ruiu accepted this revision.Aug 20 2019, 12:48 AM

LGTM

This revision is now accepted and ready to land.Aug 20 2019, 12:48 AM
MaskRay updated this revision to Diff 216063.Aug 20 2019, 1:19 AM

Rebase. aarch64-gnu-ifunc-nonpreemptable2.s was recently added.

This revision was automatically updated to reflect the committed changes.
pcc added a comment.Aug 22 2019, 7:38 PM

Hi @MaskRay, it looks like this change has caused some binaries to start segfaulting on Android aarch64 devices. The strange thing is that the segfault happens even before entering the dynamic loader, and furthermore the segfault doesn't happen if I invoke the dynamic loader directly, so I'm guessing that the Linux kernel doesn't like the binary for some reason.


I've attached a repro.tar that I can use to reproduce this problem reliably like so (the "CANNOT LINK EXECUTABLE" error is expected because the runtime library is not present on the device):

$ ~/l2/ra/bin/ld.lld @response.txt
$ adb -s $pixel2 push bin /data/local/tmp
bin: 1 file pushed. 3.9 MB/s (15608 bytes in 0.004s)
$ adb -s $pixel2 shell
walleye:/ # /data/local/tmp/bin
Segmentation fault 
139|walleye:/ # /system/bin/linker64 /data/local/tmp/bin                                                                                   
CANNOT LINK EXECUTABLE "/data/local/tmp/bin": library "libclang_rt.hwasan-aarch64-android.so" not found

Without this change:

$ ~/l/ra/bin/ld.lld @response.txt
$ adb -s $pixel2 push bin /data/local/tmp
bin: 1 file pushed. 1.2 MB/s (17992 bytes in 0.014s)
$ adb -s $pixel2 shell
walleye:/ # /data/local/tmp/bin
CANNOT LINK EXECUTABLE "/data/local/tmp/bin": library "libclang_rt.hwasan-aarch64-android.so" not found

These instructions use Android adb, but given that the problem appears to be with the Linux kernel you might be able to reproduce on regular Linux as well.

MaskRay added a comment.EditedAug 22 2019, 9:25 PM
In D64930#1642223, @pcc wrote:

Hi @MaskRay, it looks like this change has caused some binaries to start segfaulting on Android aarch64 devices. The strange thing is that the segfault happens even before entering the dynamic loader, and furthermore the segfault doesn't happen if I invoke the dynamic loader directly, so I'm guessing that the Linux kernel doesn't like the binary for some reason.


I've attached a repro.tar that I can use to reproduce this problem reliably like so (the "CANNOT LINK EXECUTABLE" error is expected because the runtime library is not present on the device):

$ ~/l2/ra/bin/ld.lld @response.txt
$ adb -s $pixel2 push bin /data/local/tmp
bin: 1 file pushed. 3.9 MB/s (15608 bytes in 0.004s)
$ adb -s $pixel2 shell
walleye:/ # /data/local/tmp/bin
Segmentation fault 
139|walleye:/ # /system/bin/linker64 /data/local/tmp/bin                                                                                   
CANNOT LINK EXECUTABLE "/data/local/tmp/bin": library "libclang_rt.hwasan-aarch64-android.so" not found

Without this change:

$ ~/l/ra/bin/ld.lld @response.txt
$ adb -s $pixel2 push bin /data/local/tmp
bin: 1 file pushed. 1.2 MB/s (17992 bytes in 0.014s)
$ adb -s $pixel2 shell
walleye:/ # /data/local/tmp/bin
CANNOT LINK EXECUTABLE "/data/local/tmp/bin": library "libclang_rt.hwasan-aarch64-android.so" not found

These instructions use Android adb, but given that the problem appears to be with the Linux kernel you might be able to reproduce on regular Linux as well.

In D64930#1642223, @pcc wrote:

Hi @MaskRay, it looks like this change has caused some binaries to start segfaulting on Android aarch64 devices. The strange thing is that the segfault happens even before entering the dynamic loader, and furthermore the segfault doesn't happen if I invoke the dynamic loader directly, so I'm guessing that the Linux kernel doesn't like the binary for some reason.


I've attached a repro.tar that I can use to reproduce this problem reliably like so (the "CANNOT LINK EXECUTABLE" error is expected because the runtime library is not present on the device):

$ ~/l2/ra/bin/ld.lld @response.txt
$ adb -s $pixel2 push bin /data/local/tmp
bin: 1 file pushed. 3.9 MB/s (15608 bytes in 0.004s)
$ adb -s $pixel2 shell
walleye:/ # /data/local/tmp/bin
Segmentation fault 
139|walleye:/ # /system/bin/linker64 /data/local/tmp/bin                                                                                   
CANNOT LINK EXECUTABLE "/data/local/tmp/bin": library "libclang_rt.hwasan-aarch64-android.so" not found

Without this change:

$ ~/l/ra/bin/ld.lld @response.txt
$ adb -s $pixel2 push bin /data/local/tmp
bin: 1 file pushed. 1.2 MB/s (17992 bytes in 0.014s)
$ adb -s $pixel2 shell
walleye:/ # /data/local/tmp/bin
CANNOT LINK EXECUTABLE "/data/local/tmp/bin": library "libclang_rt.hwasan-aarch64-android.so" not found

These instructions use Android adb, but given that the problem appears to be with the Linux kernel you might be able to reproduce on regular Linux as well.

I don't have any aarch64 device for testing, but I tried some simple programs on qemu-user and they work:

aarch64-linux-gnu-gcc -Wl,-rpath,/usr/aarch64-linux-gnu/lib:. -Wl,--dynamic-linker,/usr/aarch64-linux-gnu/lib/ld-2.28.so a.o -o a '-###'
# copy the linker command line and link with ld.lld
./a

@peter.smith tested an earlier revision of D64906+D64930 on AArch64 hardware. On regular Linux AArch64 desktop distribution? That earlier revision doesn't include PT_TLS workarounds for various ld.so implementations but I think it is irrelevant here.

@pcc Can you test if the program runs with -z separate-code? What about -z max-page-size=65536 (a program should work if runtime pagesz <= max-page-size)? Also -z norelro. If strace exists on the system, what does it report?

Delete the --chroot line from response.txt and link with aarch64-linux-gnu-ld, try with (a) -z noseprate-code (b) -z separate-code. Does the program crash as well?

+@rprichard who's already subscribed..

lld (-z noseparate-code):
Program Headers:                                                                                        [16/544]
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align                    
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x000268 0x000268 R   0x8                      
  INTERP         0x0002a8 0x00000000000002a8 0x00000000000002a8 0x000015 0x000015 R   0x1                      
      [Requesting program interpreter: /system/bin/linker64]                                                   
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000bc4 0x000bc4 R   0x1000                   
  LOAD           0x000bc4 0x0000000000001bc4 0x0000000000001bc4 0x000aec 0x000aec R E 0x1000                   
  LOAD           0x0016b0 0x00000000000036b0 0x00000000000036b0 0x0002e8 0x0002e8 RW  0x1000                   
  LOAD           0x001998 0x00000000000049a0 0x00000000000049a0 0x000000 0x010010 RW  0x1000                   
  DYNAMIC        0x0016e8 0x00000000000036e8 0x00000000000036e8 0x0001e0 0x0001e0 RW  0x8                      
  GNU_RELRO      0x0016b0 0x00000000000036b0 0x00000000000036b0 0x0002e8 0x000950 R   0x1                      
  GNU_EH_FRAME   0x000a80 0x0000000000000a80 0x0000000000000a80 0x000044 0x000044 R   0x4                      
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0                        
  NOTE           0x0002c0 0x00000000000002c0 0x00000000000002c0 0x0000b4 0x0000b4 R   0x4

ld.bfd (-z noseparate-code):
Program Headers:                                                                                                
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align                     
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8                       
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x000015 0x000015 R   0x1                      
      [Requesting program interpreter: /system/bin/linker64]                                                   
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x001644 0x001644 R E 0x1000                   
  LOAD           0x001cd0 0x0000000000002cd0 0x0000000000002cd0 0x000330 0x010340 RW  0x1000                   
  DYNAMIC        0x001d08 0x0000000000002d08 0x0000000000002d08 0x000220 0x000220 RW  0x8                      
  NOTE           0x000250 0x0000000000000250 0x0000000000000250 0x0000b4 0x0000b4 R   0x4                      
  GNU_EH_FRAME   0x001528 0x0000000000001528 0x0000000000001528 0x000044 0x000044 R   0x4                      
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10                     
  GNU_RELRO      0x001cd0 0x0000000000002cd0 0x0000000000002cd0 0x000330 0x000330 R   0x1

aarch64 gold compiled from binutils-gdb HEAD does not recognize R_AARCH64_MOVW_PREL_G3 used in tmp/decorate-proc-maps-2a883b.o.

% ./ld-new @response.txt
./ld-new: error: tmp/decorate-proc-maps-2a883b.o: unsupported reloc 293 against local symbol
./ld-new: error: tmp/decorate-proc-maps-2a883b.o: unsupported reloc 293 against local symbol
./ld-new: error: tmp/decorate-proc-maps-2a883b.o: unsupported reloc 293 against local symbol
./ld-new: error: tmp/decorate-proc-maps-2a883b.o: unsupported reloc 293 against local symbol
compiler-rt/test/hwasan/TestCases/Linux/decorate-proc-maps.c:34: error: cannot relocate invalid reloc 293 in object file
compiler-rt/test/hwasan/TestCases/Linux/decorate-proc-maps.c:36: error: cannot relocate invalid reloc 293 in object file
compiler-rt/test/hwasan/TestCases/Linux/decorate-proc-maps.c:40: error: cannot relocate invalid reloc 293 in object file
compiler-rt/test/hwasan/TestCases/Linux/decorate-proc-maps.c:48: error: cannot relocate invalid reloc 293 in object file

I can reproduce a segfault with this image on an Ubuntu 16.04 linux machine (CPU is a cortex-a72). For reference I needed to alter the response file to change the dynamic linker --dynamic-linker=/lib/ld-linux-aarch64.so.1

At the moment I'm not very far in debugging what the problem is
execve("./bin", ["./bin"], [/* 20 vars */]) = -1 EINVAL (Invalid argument)

  • SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0}

expanding the truncated /* 20 vars */ is not helpful, just the standard arguments passed to execve.

A critical part of the failure seems to be -znow, if I take that out then I get
./bin: error while loading shared libraries: libclang_rt.hwasan-aarch64-android.so: cannot open shared object file: No such file or directory

Will see if I can reproduce in QEMU.

A critical part of the failure seems to be -znow, if I take that out then I get
./bin: error while loading shared libraries: libclang_rt.hwasan-aarch64-android.so: cannot open shared object file: No such file or directory

Will see if I can reproduce in QEMU.

Altering the -dynamic-linker in the response file to /lib/ld-linux-aarch64.so.1

qemu-aarch64 -L /work/gcca64/aarch64-linux-gnu/libc ./bin
./bin: Invalid argument

when I remove -znow

qemu-aarch64 -L /work/gcca64/aarch64-linux-gnu/libc ./bin
  ./bin: error while loading shared libraries: libclang_rt.hwasan-aarch64-android.so: cannot open shared object file: No such file or directory

From what I can tell the linux kernel doesn't do anything with the DF_BIND_NOW flag, but that does add 2 dynamic tags (32 bytes) to the image. If I hex edit both the flag values to 0 I still get the same error from the kernel.

I think I know what is causing the kernel to fault the ELF file

The specific problem is the PT_LOAD before the PT_DYNAMIC, the p_vaddr is not congruent to the p_offset modulo p_align. It should be 0x19A0. If I hex edit the binary to make this the case I can get the ELF file to run. I think that this misalignment might be caused because p_filesz is 0, which might skip the alignment of the p_offset (not checked). When -znow is removed the ordering of some sections changes and there is no PT_LOAD with 0 p_filesz so everything gets aligned.

ProgramHeader {
  Type: PT_LOAD (0x1)
  Offset: 0x16B0
  VirtualAddress: 0x36B0
  PhysicalAddress: 0x36B0
  FileSize: 744
  MemSize: 744
  Flags [ (0x6)
    PF_R (0x4)
    PF_W (0x2)
  ]
  Alignment: 4096
}
ProgramHeader {
  Type: PT_LOAD (0x1)
  Offset: 0x1998
  VirtualAddress: 0x49A0
  PhysicalAddress: 0x49A0
  FileSize: 0
  MemSize: 65552
  Flags [ (0x6)
    PF_R (0x4)
    PF_W (0x2)
  ]
  Alignment: 4096
}
ProgramHeader {
  Type: PT_DYNAMIC (0x2)
  Offset: 0x16E8
  VirtualAddress: 0x36E8
  PhysicalAddress: 0x36E8
  FileSize: 480
  MemSize: 480
  Flags [ (0x6)
    PF_R (0x4)
    PF_W (0x2)
  ]
  Alignment: 8
}

@MaskRay are you ok to take it from here?

MaskRay added a comment.EditedAug 23 2019, 7:21 AM

I think I know what is causing the kernel to fault the ELF file

I think that this misalignment might be caused because p_filesz is 0, which might skip the alignment of the p_offset (not checked).

@MaskRay are you ok to take it from here?

Thanks a lot for narrowing down the issue! I saw a similar issue while playing with lld ppc32. (ppc32 BSS PLT is a nobits RWX section, included in a separate PT_LOAD. Our current computeFileOffset can cause p_vaddr%p_align!=p_offset%p_align) I will work on a patch.

MaskRay added a comment.EditedAug 23 2019, 8:21 AM

I think I know what is causing the kernel to fault the ELF file

I think that this misalignment might be caused because p_filesz is 0, which might skip the alignment of the p_offset (not checked).

@MaskRay are you ok to take it from here?

Thanks a lot for narrowing down the issue! I saw a similar issue while playing with lld ppc32. (ppc32 BSS PLT is a nobits RWX section, included in a separate PT_LOAD. Our current computeFileOffset can cause p_vaddr%p_align!=p_offset%p_align) I will work on a patch.

@pcc @peter.smith D66658 should fix the issue.

Edit: fixed.