Page MenuHomePhabricator

[ARM][AArch64] Overalign .tbss (Android Bionic hack) to avoid p_vaddr%p_align!=0
AbandonedPublic

Authored by MaskRay on May 11 2019, 6:41 AM.

Details

Summary

D53906 increased p_align of PT_TLS on ARM/AArch64 to 32/64 to make the
TLS layout compatible with Android Bionic. However, this can make glibc
ARM/AArch64 programs using {initial,local}-exec TLS models crash (see
PR41527).

The faulty PT_TLS satisfies p_vaddr%p_align != 0. The remainder is normally
0 but may be non-zero with the D53906 hack in place. The problem is that
we increase the p_align of the PT_TLS after the OutputSection's
addresses are fixed (assignAddress()). It is possible that
p_vaddr%old_p_align = 0 while p_vaddr%new_p_align != 0.

When this happens, lld and different ld.so implementations have
different opinions on the offset of the first module. glibc elf/dl-tls.c
computes an offset that satisfies: offset%p_align == p_vaddr%p_align.
This is correct as a basic ELF requirement.

lld, musl (1.1.20 ~ latest 1.1.22), FreeBSD rtld, and Bionic place the
main TLS block at a wrong offset: `alignUp(2*sizeof(void*),
main_tls_align)`. This choice doesn't take p_vaddr%p_align into account.

This patches overaligns .tbss to make p_vaddr%p_align==0.

Event Timeline

MaskRay created this revision.May 11 2019, 6:41 AM
MaskRay added a comment.EditedMay 11 2019, 6:44 AM

@emaste I guess FreeBSD rtld also handles p_vaddr%p_align!=0 correctly (from my casual reading of freebsd/libexec/rtld-elf/aarch64/rtld_machdep.h calculate_tls_post_size). It may be worth checking...

Update: I believe FreeBSD rtld handles p_vaddr%p_align!=0 correctly. Checked with a FreeBSD arm64 live cd: qemu-system-aarch64 -M virt -cpu cortex-a57 -nographic -m 2G -bios /tmp/Downloads/QEMU_EFI.img -drive if=none,file=FreeBSD-13.0-CURRENT-arm64-aarch64-20190510-r347415-mini-memstick.img,id=hd0 -device virtio-blk-device,drive=hd0 -nographic

// llvm-readelf -l
  TLS            0x011000 0x00000000000103a0 0x00000000000103a0 0x000000 0x000040 R   0x40

If people feel the Android Bionic hack is unfair to other systems... we can probably hide that in an --androidxxx option.

An alternative is D61825: keep the hack under an option --android-tls. I prefer that one as the overhead of 126 bytes isn't negligible and the unconditional hack is unfair to other ld.so implementations. I've talked to a maintainer of glibc's arm port and he will report that issue on the glibc bugzilla.

I think that this will definitely fix the problem, but I'd like to understand a bit more about what glibc is doing and what assumptions it is making, there does seem to be some code that attempts to deal with a PT_TLS with p_vaddr != 0 modulo p_align. My understanding from looking at dt-tls.c _dl_determine_tlsoffset() which sets the offset of each tls block (with the executable being in slot 0). Does something like this:

  • offset = TLS_TCB_SIZE (hard-coded to 2 * (void*) pointers.
  • calculates firstbyte, which is derived from l_tls_firstbyte_offset which is PT_TLS p_vaddr % p_align. For the example in pr41527 this is 32 bytes
  • sets the l_tls_offset of slot 0 to alignto(offset, p_align) - firstbyte.

With an unpatched LLD we are expecting l_tls_offset to be alignto(offset, p_align).

Have I got that right, and can you point out what assumption glibc is making? I've not yet managed to convince myself what it is doing wrong? I think that this may be important in choosing between this and D61825. It may be that what we do now just happens to work for non-glibc but we may still not be entirely correct. If that is the case then this would favour this fix rather than D61825.

The other problem with D61825 is that it forces a change to the clang driver to Android to always pass the flag through to LLD. That is something the Android toolchain folk will want to comment on.

I think that this will definitely fix the problem, but I'd like to understand a bit more about what glibc is doing and what assumptions it is making, there does seem to be some code that attempts to deal with a PT_TLS with p_vaddr != 0 modulo p_align. My understanding from looking at dt-tls.c _dl_determine_tlsoffset() which sets the offset of each tls block (with the executable being in slot 0). Does something like this:

  • offset = TLS_TCB_SIZE (hard-coded to 2 * (void*) pointers.
  • calculates firstbyte, which is derived from l_tls_firstbyte_offset which is PT_TLS p_vaddr % p_align. For the example in pr41527 this is 32 bytes
  • sets the l_tls_offset of slot 0 to alignto(offset, p_align) - firstbyte.

    With an unpatched LLD we are expecting l_tls_offset to be alignto(offset, p_align).

    Have I got that right, and can you point out what assumption glibc is making? I've not yet managed to convince myself what it is doing wrong? I think that this may be important in choosing between this and D61825. It may be that what we do now just happens to work for non-glibc but we may still not be entirely correct. If that is the case then this would favour this fix rather than D61825.

    The other problem with D61825 is that it forces a change to the clang driver to Android to always pass the flag through to LLD. That is something the Android toolchain folk will want to comment on.

Some background:

GAP_ABOVE_TP = 2 * sizeof(void*)
the offset of the main module = alignUp(GAP_ABOVE_TP, main_tls_align)
the linker and the dynamic loader must agree upon this

Android bionic: the offset of the main module must be >= 8 * sizeof(void*)
freebsd rtld: good
musl: if main_tls_align > GAP_ABOVE_TP, the offset of the main module was wrong before 1.1.20
glibc: p_vaddr%p_align != 0 => wrong offset of the first module

Have I got that right, and can you point out what assumption glibc is making?

Below is my understanding about how glibc is wrong. glibc/elf/dl-tls.c:250 (the #elif TLS_DTV_AT_TP part)

// I don't know what this piece of code intended to do with non-zero firstbyte...
off = roundup (offset, slotinfo[cnt].map->l_tls_align); // roundup(16, 64) = 64
if (off - offset < firstbyte) // 64 - 16 ; 32
  off += slotinfo[cnt].map->l_tls_align;
slotinfo[cnt].map->l_tls_offset = off - firstbyte; // 64 - 32 = 32, wrong, should be 64

// If firstbyte is 0 (i.e. p_vaddr % l_tls_align == 0), this is simply:
off = roundup (offset, slotinfo[cnt].map->l_tls_align);
slotinfo[cnt].map->l_tls_offset = off;
// and is correct

I hope Szabolcs can chime in as he is the maintainer of glibc's arm port..

Below is my understanding about how glibc is wrong. glibc/elf/dl-tls.c:250 (the #elif TLS_DTV_AT_TP part)

// I don't know what this piece of code intended to do with non-zero firstbyte...
off = roundup (offset, slotinfo[cnt].map->l_tls_align); // roundup(16, 64) = 64
if (off - offset < firstbyte) // 64 - 16 ; 32
  off += slotinfo[cnt].map->l_tls_align;
slotinfo[cnt].map->l_tls_offset = off - firstbyte; // 64 - 32 = 32, wrong, should be 64

// If firstbyte is 0 (i.e. p_vaddr % l_tls_align == 0), this is simply:
off = roundup (offset, slotinfo[cnt].map->l_tls_align);
slotinfo[cnt].map->l_tls_offset = off;
// and is correct

My understanding of the firstbyte != 0 case is that this is where p_vaddr % p_align is non-zero. Note that if the loader puts the TCB at an address that is 32 modulo p_align then the TLS block of the application in slot 0 will be 64-bit aligned according to its program header. In effect I think glibc is placing the TLS block in a legal position, another legal position would be to place the TCB at an address that was 0 modulo p_align and then place first TLS block at 64 bytes from the start of the TCB. I think we are assuming that glibc should always be doing the latter, but I think the PT_TLS we are giving it is permitting it to do the former. I think this patch would force it to do the latter, and match LLD's calculation. I'm not 100% certain of this at the moment though.

nsz added a subscriber: nsz.May 13 2019, 8:46 AM

i think glibc is right (comes up with a tlsoffset%p_align == p_vaddr%p_align for the module) and musl and other implementations are wrong (they may not get the tlsoffset right in this case: it seems musl always computes tlsoffset%p_align==0 on tls variant 1 targets).

MaskRay added a comment.EditedMay 13 2019, 9:31 AM

@rprichard Rich Felker briefly mentioned this can be worked around in crt1.o. My understanding is that:

  1. create a dummy .tdata ((a) with placeholder TLS variables of 8*sizeof(void*) bytes (b) sh_addralign=8*sizeof(void*)) in crt1.o (and its variants)
  2. create a R_{ARM,AARCH64}_NONE from .text offset 0 to the .tdata (this prevents --gc-sections stripping)
  3. no hack is needed in ld.bfd/gold/lld. crt1.o is linked into executables. The linker applies its usual Rules for Linking Unrecognized Sections and concatenates .tdata and .tbss. The .tdata from crt1.o will be the first input section in the TLS segment. Don't use the dummy part of the TLS block (they may collide with Bionic's reserved TLS slots)

Does this approach work?

@emaste If glibc is correct, then all of FreeBSD ldso/musl/Bionic are wrong regarding p_vaddr%p_align!=0. https://reviews.freebsd.org/D16510 may need fixing.

@rprichard Rich Felker briefly mentioned this can be worked around in crt1.o. My understanding is that:

  1. create a dummy .tdata ((a) with placeholder TLS variables of 8*sizeof(void*) bytes (b) sh_addralign=8*sizeof(void*)) in crt1.o (and its variants)
  2. create a R_{ARM,AARCH64}_NONE from .text offset 0 to the .tdata (this prevents --gc-sections stripping)
  3. no hack is needed in ld.bfd/gold/lld. crt1.o is linked into executables. The linker applies its usual Rules for Linking Unrecognized Sections and concatenates .tdata and .tbss. The .tdata from crt1.o will be the first input section in the TLS segment. Don't use the dummy part of the TLS block (they may collide with Bionic's reserved TLS slots)

    Does this approach work?

I think that basically works. It would add a TLS segment to every Android executable, even those that aren't using ELF TLS:

  • If the placeholder symbol has size 1 and 64-byte alignment, then every Android executable (targeting Q(29) and up) would have a PT_TLS segment of 64-byte size and alignment with gold or lld.
  • If the placeholder symbol in .tdata has size 0 instead, then ld.bfd omits the .tdata section from the output if there aren't any other non-empty .tdata sections. This is a problem if there are other TLS sections (.tbss). Android is switching to lld, though, so maybe it's OK. (The NDK still defaults to bfd and gold.)
  • If the placeholder symbol has a size of 48 bytes and 1-byte alignment, then libc.so can't use alignment to detect whether space was reserved. I can think of a couple ways for space to not be reserved:
    • Using an NDK that's too old
    • Using a too-old NDK API level, assuming this placeholder symbol is omitted when targeting old versions of Bionic (because Q(29) is currently rejecting an underaligned TLS segment...)

The linker applies its usual Rules for Linking Unrecognized Sections

I don't think these rules apply to .tdata and .tbss, because the linker should recognize the SHF_TLS flag?

Bionic is requiring that executables (dynamic ones, at least) leave their section tables unstripped. Maybe Bionic could look for a special SHF_TLS section (.tcb_reservation?) at the start of the TLS segment to sanity-check whether space was properly reserved? The section would need to be located before .tdata and .tbss, but I think it could be output in crtbegin*.o.

Aside: An R_{ARM,AARCH64}_NONE relocation to a TLS symbol has the desired effect. Is there a better way to create this relocation than by using a hex editor?

nsz added a comment.May 15 2019, 4:29 AM

I think that basically works. It would add a TLS segment to every Android executable, even those that aren't using ELF TLS:

  • If the placeholder symbol has size 1 and 64-byte alignment, then every Android executable (targeting Q(29) and up) would have a PT_TLS segment of 64-byte size and alignment with gold or lld.
  • If the placeholder symbol in .tdata has size 0 instead, then ld.bfd omits the .tdata section from the output if there aren't any other non-empty .tdata sections. This is a problem if there are other TLS sections (.tbss). Android is switching to lld, though, so maybe it's OK. (The NDK still defaults to bfd and gold.)
  • If the placeholder symbol has a size of 48 bytes and 1-byte alignment, then libc.so can't use alignment to detect whether space was reserved. I can think of a couple ways for space to not be reserved:
    • Using an NDK that's too old
    • Using a too-old NDK API level, assuming this placeholder symbol is omitted when targeting old versions of Bionic (because Q(29) is currently rejecting an underaligned TLS segment...)

i think the placeholder tls symbol size and alignment should
be exactly such that it covers the bionic internal tls slots.

if the symbol has default visibility then the libc can check
at runtime that it really covers the required range.

if the symbol is missing from an executable then it is only
safe to load if there is no static tls in it. the libc can
take care of static tls in other modules (their static tls
offset is decided by the libc)

in principle the symbol check can be done using weak reference
but i don't think undef weak tls sym reliably evaluates to 0,
so i'd just rely on plain symbol interposition (placeholder
is size 1 extern tls sym in libc that is interposed by the
placeholder from the .exe)

Aside: An R_{ARM,AARCH64}_NONE relocation to a TLS symbol has the desired effect. Is there a better way to create this relocation than by using a hex editor?

why not export the symbol?
then it cannot be removed for sure.
(i think gas has a .reloc directive that accepts R_AARCH64_NONE as argument, not sure if llvm handles that)

Aside: An R_{ARM,AARCH64}_NONE relocation to a TLS symbol has the desired effect. Is there a better way to create this relocation than by using a hex editor?

why not export the symbol?
then it cannot be removed for sure.
(i think gas has a .reloc directive that accepts R_AARCH64_NONE as argument, not sure if llvm handles that)

llvm has some support for the .reloc directive but IIRC it requires each target (ARM, AArch64, ...) each target AsmBackend is required to map the relocation name to a supported MCFixupKind by I think getFixupKind(RelocName). Unfortunately neither ARM, AArch64 or many other backends override the default, which doesn't recognise anything. I think this would be a fairly straightforward thing to implement.

peter.smith added a comment.EditedMay 15 2019, 6:50 AM

I have a question based on the updated description of D61825 (makes more sense in the context of this patch)

- Moving the hack to the OutputSection layer makes the hack more perceivable as it has interaction with the linker script ALIGNUP command.
- It may waste up to 8*WordSize * 2 bytes of each thread (when both .tbss and .tdata exist and are overaligned). With more code, we can apply the overalignment to the first TLS output section (either .tbss or .tdata) and decrease the wastage to 8*WordSize, but that is still unsatisfactory on other platforms.

I'm not sure I understand the comment about ALIGNUP, did you mean the linker script command ALIGN? I guess by perceivable you mean it has more impact on the file and following sections?

I noticed while trying to construct a test case on Arm that does not need a shared library, that LLD will page align the first SHF_TLS .tdata section as (with the default layout and most linker scripts) it is the first Section in the Read/Write PT_LOAD segment. It does not do this for .tbss as it returns false for needsPtLoad(). I think that this means that increasing the alignment for .tdata to 32/64 won't make any difference given that it will be aligned to a 4k/64k page anyway. The problem case can only currently happen when there is only .tbss. Given that no extra padding is needed for .tdata I think the impact of extra padding for .tbss if we only output it if there is no .tdata will be minimal. I think the extra padding would also be a maximum of 48 bytes on AArch64 and 16 on Arm, as the section would follow the .plt and that is 16-byte aligned with a size that is a multiple of 16-bytes. In summary, while an extra 48-bytes of padding to the TLS if there is only .tbss is unfortunate, I don't think it would be overfully painful to other platforms.

I agree that a change in Bionic that would work for all linkers would be ideal, but if there were a change needed in LLD, I prefer D61824 to D61825.

My Arm test case just in case it is useful to anyone. Arm does not do TLS relaxation so it may not fail on AArch64 due to relaxation of IE to LE:

tls.c

#include <stdio.h>
__thread int val __attribute__((tls_model("initial-exec")));

extern int func();

int main(void) {
    printf("From IE local %p\n", &val);
    func();
    return 0;
}

tls2.c

#include<stdio.h>
extern __thread int val __attribute__((tls_model("initial-exec")));

int func(void)  {
    printf("From IE non local %p\n", &val);
    return val;
}

clang -fpie -pie --target=armv7a-linux-gnueabihf tls.c tls2.c -O2 -o tls.axf -fuse-ld=lld
QEMU gives me different addresses for the static and dynamic locations of the variable.
From IE local 0xf6657940
From IE non local 0xf6657930

dalias added a subscriber: dalias.May 15 2019, 7:54 AM

I do not see how .tdata happening to be the first thing in the LOAD segment is relevant or helpful at all. Any "accidental alignment" as a result of that is not part of the alignment that will affect the runtime addresses of TLS objects.

I find the whole approach of using ridiculous amounts of alignment to produce a gap for Bionic to be highly misguided. It imposes significant amounts of wasted memory, and is a roundabout way of doing what you really want to do, which is just *allocating* 8 words of reserved TLS at the beginning of the main program's TLS image, for Bionic. This could be done directly either in Bionic's crt files, or in LLD as a Bionic-specific hack that just emits 8 words at the beginning of .tdata. Nowhere is there any reason to touch alignment or impose these costs on non-buggy (non-Bionic) implementations.

Also, I think the description "make glibc happy" is possibly wrong. It looks like, while musl and some other implementations work with the current hack in LLD, it's only due to a bug that they work. The real problem is that the hack is emitting bad ELF.

MaskRay updated this revision to Diff 199612.May 15 2019, 8:07 AM
MaskRay edited the summary of this revision. (Show Details)
MaskRay removed subscribers: dalias, nsz.

Update description.

I don't know where to check if .tdata exists.

MaskRay retitled this revision from [ARM][AArch64] Overalign SHF_TLS instead of PT_TLS (Android Bionic hack) to make glibc happy to [ARM][AArch64] Overalign .tdata (Android Bionic hack) to fix ld.so implementations that handle p_vaddr%p_align!=0 correctly.May 15 2019, 8:12 AM
MaskRay added a comment.EditedMay 15 2019, 8:22 AM

I'm not sure I understand the comment about ALIGNUP, did you mean the linker script command ALIGN? I guess by perceivable you mean it has more impact on the file and following sections?

Sorry, I meant ALIGNOF. If we lower the hack to the OutputSection layer, it may interfere with linker scripts. Though my understanding about linker scripts is largely learned from lld source, I don't use linker scripts myself so I am not very clear what issues it may cause.

The problem case can only currently happen when there is only .tbss.

This is a good point. I've changed this revision to keep the hack in Writer.cpp (PT_TLS) but restrict the OutputSections.cpp hack to SHT_NOBITS + SHF_TLS (.tbss), but I don't know where is the best place to check if .tdata exists. If I don't update OutputSection::Alignment, I may have to update several LinkerScript.cpp functions (at least output switchTo).

If we overalign .tbss unconditionally (without checking if .tdata exists), I don't think this overhead on non-Android platforms is small:

# this is page aligned
.section .tdata,"awT",@progbits                                                                                 
.byte 0                                                                                                         

# this gets overaligned to 8*sizeof(void*). This overalignment is unnecessary but where to place the hack is a problem.
.section .tbss,"awT",@progbits
.byte 0

# Since we have the Writer.cpp:2201 `      P->p_memsz = alignTo(P->p_memsz, P->p_align);` rule, the p_memsz is 8*sizeof(void*) * 2 = 0x80
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  TLS            0x020000 0x0000000000220000 0x0000000000220000 0x000001 0x000080 R   0x40
MaskRay retitled this revision from [ARM][AArch64] Overalign .tdata (Android Bionic hack) to fix ld.so implementations that handle p_vaddr%p_align!=0 correctly to [ARM][AArch64] Overalign .tbss (Android Bionic hack) to avoid p_vaddr%p_align!=0.May 15 2019, 8:31 AM

I'm not sure I understand the comment about ALIGNUP, did you mean the linker script command ALIGN? I guess by perceivable you mean it has more impact on the file and following sections?

Sorry, I meant ALIGNOF. If we lower the hack to the OutputSection layer, it may interfere with linker scripts. Though my understanding about linker scripts is largely learned from lld source, I don't use linker scripts myself so I am not very clear what issues it may cause.

I think that should be fine. If someone happened to write ALIGNOF(.tbss) then it would return the higher value, which would be consistent with the actual section alignment. The only reason I can think of for doing that is to roll their own TLS implementation using linker defined symbols rather than the PT_TLS segment.

The problem case can only currently happen when there is only .tbss.

If we overalign .tbss unconditionally (without checking if .tdata exists), I don't think this overhead on non-Android platforms is small:

I don't have a lot of experience in this area, I'm happy to defer to those with more familiar with platforms/software using a large number of threads. With luck a solution in Bionic can be found that means this isn't required.

In D61824#1502750, @nsz wrote:

I think that basically works. It would add a TLS segment to every Android executable, even those that aren't using ELF TLS:

  • If the placeholder symbol has size 1 and 64-byte alignment, then every Android executable (targeting Q(29) and up) would have a PT_TLS segment of 64-byte size and alignment with gold or lld.
  • If the placeholder symbol in .tdata has size 0 instead, then ld.bfd omits the .tdata section from the output if there aren't any other non-empty .tdata sections. This is a problem if there are other TLS sections (.tbss). Android is switching to lld, though, so maybe it's OK. (The NDK still defaults to bfd and gold.)
  • If the placeholder symbol has a size of 48 bytes and 1-byte alignment, then libc.so can't use alignment to detect whether space was reserved. I can think of a couple ways for space to not be reserved:
    • Using an NDK that's too old
    • Using a too-old NDK API level, assuming this placeholder symbol is omitted when targeting old versions of Bionic (because Q(29) is currently rejecting an underaligned TLS segment...)

i think the placeholder tls symbol size and alignment should
be exactly such that it covers the bionic internal tls slots.

If the placeholder symbol has 48-byte-size, 1-byte-alignment, it would cover the internal slots as long as the executable's TLS alignment is 16 or less. At higher levels of alignment, the placeholder shifts forward. At an alignment of 64 bytes, the placeholder isn't needed anymore, but it's still wasting TLS memory.

Linker magic could compute the TLS section alignment and insert a placeholder section/symbol of the size needed to cover the slots exactly.

In any case, as long as libc could find the placeholder symbol, it could verify that it covered the intersection of the TLS segment and the internal slots.

in principle the symbol check can be done using weak reference
but i don't think undef weak tls sym reliably evaluates to 0,
so i'd just rely on plain symbol interposition (placeholder
is size 1 extern tls sym in libc that is interposed by the
placeholder from the .exe)

Aside: An R_{ARM,AARCH64}_NONE relocation to a TLS symbol has the desired effect. Is there a better way to create this relocation than by using a hex editor?

why not export the symbol?
then it cannot be removed for sure.
(i think gas has a .reloc directive that accepts R_AARCH64_NONE as argument, not sure if llvm handles that)

It looks like using a version script on an executable (e.g. with a local: *; wildcard) would hide an interposed symbol. libc.so wouldn't find the placeholder and would abort the process. Version scripts on executables are probably rare, though?

ruiu added a comment.May 17 2019, 3:03 AM

I don't think that always aligning them to 64 byte boundaries is not that bad. It rather sounds like a straightforward fix for the exiting compatibility hack for Android. That may waste 48 bytes, but frankly 48 is a very small number compared to other paddings we have in an in-file binary and an on-memory process image. For example, we have a half-page unused padding at end of every segment on average. For example, there's 2048 bytes of unused memory at the end of text segment on average, and if you are building your executables in such a way that they can run on 64 KiB page machines, you are wasting 32768 bytes on average just for the tail padding of each segment. There are a lot of other paddings here and there.

If we are serious about saving every byte, we probably could easily save 48 bytes for any program by sorting sections more wisely to pack them better. But we are not currently doing that, mainly because we are not serious about it, as an expected return is essentially negligible.

So, I think I'd vote for increasing the overalignment size if it makes lld to create standard-compliant executables. What do you guys think?

MaskRay abandoned this revision.May 17 2019, 3:23 AM

Abandon in favor of a revert: D62055. Ryan told me reverting is preferable.

The waste is not just 48 bytes in the ELF image on disk, but up to 63 bytes in each thread. Depending on how that falls, the waste is likely actually either 0 bytes or 4096 bytes per thread, since the memory is allocated with page granularity.

Forcing the alignment will also force musl to *always* mmap space (on page granularity, so A LOT of waste) for the main thread's TLS at startup, rather than using the built-in bss space intended to be used when the application's TLS needs are small, since the built-in space does not have excess alignment. This imposes a syscall as well, and a new failure mode for static-linked binaries that may have been intended not to be able to fail after exec.

Please, drop this ridiculous and costly hack for Bionic or put it under the --android-tls option or whatever.

@nsz @dalias

In D61824#1499997, @nsz wrote:

i think glibc is right (comes up with a tlsoffset%p_align == p_vaddr%p_align for the module) and musl and other implementations are wrong (they may not get the tlsoffset right in this case: it seems musl always computes tlsoffset%p_align==0 on tls variant 1 targets).

Is the (tlsoffset%p_align == p_vaddr%p_align) requirement documented somewhere?

The ELF spec mentions that p_offset and p_vaddr must be congruent w.r.t. both the maximum page size and p_align. For a PT_LOAD segment, it's obvious that page-size congruency is required, because the loader uses mmap for each load segment. The TLS initialization image is copied for each thread, though, so mmap's limitation doesn't apply.

Ulrich Drepper's TLS paper describes the tlsoffset calculation for IA-64 (variant I) without reference to the (p_vaddr % p_align) value:

tlsoffset_1 = round(16, align_1)
tlsoffset_m+1 = round(tlsoffset_m + tlssize_m , align_m+1)

And for IA-32 (variant II):

To access TLS blocks for modules using the static model the tlsoffset_m offsets
have to be known. These values must be subtracted from the thread register value.
Unlike what happens on IA-64 where the offsets are added. The offsets are computed
as follows:
tlsoffset_1 = round(tlssize_1 , align_1)
tlsoffset_m+1 = round(tlsoffset_m + tlssize_m+1 , align_m+1)

The waste is not just 48 bytes in the ELF image on disk, but up to 63 bytes in each thread. Depending on how that falls, the waste is likely actually either 0 bytes or 4096 bytes per thread, since the memory is allocated with page granularity.

Forcing the alignment will also force musl to *always* mmap space (on page granularity, so A LOT of waste) for the main thread's TLS at startup, rather than using the built-in bss space intended to be used when the application's TLS needs are small, since the built-in space does not have excess alignment. This imposes a syscall as well, and a new failure mode for static-linked binaries that may have been intended not to be able to fail after exec.

Please, drop this ridiculous and costly hack for Bionic or put it under the --android-tls option or whatever.

Yes, this seems like a good argument for limiting the hack to Bionic.

I'm not sure if/where it's documented, but the congruence requirement is the only reasonable interpretation of p_align. We're fixing it in musl according to this interpretation.