This is an archive of the discontinued LLVM Phabricator instance.

[ARM][AArch64] Increase TLS alignment to reserve space for Android's TCB
ClosedPublic

Authored by rprichard on Oct 30 2018, 3:20 PM.

Details

Summary

ARM and AArch64 use TLS variant 1, where the first two words after the
thread pointer are reserved for the TCB, followed by the executable's TLS
segment. Both the thread pointer and the TLS segment are aligned to at
least the TLS segment's alignment.

Android/Bionic historically has not supported ELF TLS, and it has
allocated memory after the thread pointer for several Bionic TLS slots
(currently 9 but soon only 8). At least one of these allocations
(TLS_SLOT_STACK_GUARD == 5) is widespread throughout Android/AArch64
binaries and can't be changed.

To reconcile this disagreement about TLS memory layout, set the minimum
alignment for executable TLS segments to 8 words on ARM/AArch64, which
reserves at least 8 words of memory after the TP (2 for the ABI-specified
TCB and 6 for alignment padding). For simplicity, and because lld doesn't
know when it's targeting Android, increase the alignment regardless of
operating system.

Diff Detail

Repository
rL LLVM

Event Timeline

rprichard created this revision.Oct 30 2018, 3:20 PM

I'm currently planning to use a 16-word TCB layout on all architectures for Android, but I'm not completely committed to that yet. This patch can accommodate a larger TCB. I'm also considering using a variant 2 layout instead, which would reuse some of this patch, but would require additional work to accommodate negative offsets on arm64.

Details:

Android can't use the standard TLS layout on arm64 because of the stack protector cookie conflict described above.

There is another problem with the golang runtime that currently requires allocating the pthread keys (1KB on 32-bit, 2KB on 64-bit) after the thread pointer. On Android/arm{32,64}, golang allocates a "g" register by scanning from the thread pointer until it has found a TLS slot. (https://github.com/golang/go/blob/0ad332d80cf3e2bbeef5f1e5f5eb50272bbde92e/src/runtime/cgo/gcc_android_arm64.c). I'm hoping that golang can find a more robust way of allocating its TLS memory, but Android might need to work with the current runtime regardless.

Allocating a much larger TCB is a simple workaround, but tends to make future changes to the Bionic pthread layout harder in the future.

Alternatively, maybe Android could use variant 2 everywhere instead. That's problematic on arm64 because the 12-bit and 24-bit TLS models relocate add/ld/st instructions using 12-bit unsigned offsets. arm64 also has 32-, 48-, and 64-bit models that use signed offsets, but they're less efficient, and LLVM only implements the 24-bit model.

The most obvious way to adopt the 24-bit model to variant 2 requires replacing a "add xN, xN, :tprel_hi12:VAR" instruction with a subtract instruction of an adjusted offset. (If I got the math right, it needs to subtract ((-Offset + 0xfff) >> 12). The lower 12-bits are then added using an add or a ld/st instruction. The R_AARCH64_TLSLE_ADD_TPREL_HI12 relocation is intended to relocate an add instruction, so we'd need to define a new relocation or perhaps modify R_AARCH64_TLSLE_ADD_TPREL_HI12 to switch the add to a subtract. Both approaches seem unpleasant.

ruiu added a comment.Oct 30 2018, 4:37 PM

Are these a new ABI and a new command line option? If so, how does TLS variable work with the existing linker?

Android/Bionic currently doesn't have ELF TLS support, so no existing binaries are affected. I'm working on adding support for ELF TLS to Android, but Android's existing use of the thread pointer conflicts with the ordinary ELF TLS ABI.

Currently, C/C++ variables declared with __thread/thread_local are implemented using emutls. The compiler generates calls to functions in libgcc/libcompiler-rt.

Part of my motivation for PT_ANDROID_TLS_TPOFF is to detect binaries that weren't linked with this new linker flag and refuse to run them. Otherwise, programs would run but silently corrupt memory.

ruiu added a comment.Oct 30 2018, 4:55 PM

I wonder why you have to make a change to the static linker. In order to add a native support of TLS variables, you need to modify not only the linker but the loader, right? Then, if you can make a change to the loader, you can perhaps make a change to libc as well, and you can make libc compatible with the existing TLS model, no? Am I missing something?

Yeah, I also need to modify libc and the loader. The problem is that, with the TLS local-exec (LE) access model used in executables, the static linker writes a thread-pointer-relative offset into an executable's text section, which the loader can't relocate. With the next-most efficient model, initial-exec (IE), the linker stores TLS offsets in the GOT, which the loader does relocate. (Disabling LE in the compiler+linker in favor of IE is another way of resolving this ABI conflict, but IIRC, it was also problematic. e.g. We can't do LD->LE relaxation anymore.)

e.g.: Suppose I compile for Android with the stack protection enabled, and with emutls disabled: https://godbolt.org/z/Jq38rr

The function's assembly is:

        mrs     x8, TPIDR_EL0
        ldr     x9, [x8, #40]
        add     x10, x8, :tprel_hi12:var1
        add     x10, x10, :tprel_lo12_nc:var1
        str     x9, [sp, #8]
        strb    wzr, [x10, x0]
        ldr     x8, [x8, #40]
        ldr     x9, [sp, #8]
        cmp     x8, x9
        b.ne    .LBB0_2
.Ltmp1:
        ldp     x29, x30, [sp, #16]
        add     sp, sp, #32
        ret

The two [x8, #40] accesses are reading the stack protector cookie from TLS memory, but the function is also writing to TLS memory, and on arm64, lld would normally allocate the first TLS section at TP+16. If var1 is allocated at TP+16, it will overlap with the cookie at TP+40.

ruiu added a comment.Oct 30 2018, 5:40 PM

I'm not happy to add a new ABI and a new command line flag for the new ABI for obvious reasons, but I'm not sure if doing it is really unavoidable even if we can modify the loader and the libc. Let me think about that. I'll get back to you tomorrow, but if you can give me a justification as to this is really unavoidable and new ABI is required, please explain it for me. THanks.

rprichard updated this revision to Diff 171856.Oct 30 2018, 6:46 PM

Rebase on updated parent revision (TLS layout consolidated in getTlsTpOffset).

rprichard edited the summary of this revision. (Show Details)Oct 30 2018, 6:48 PM

Just trying to understand the implications here. Unfortunately I find I can't keep the details in my memory for more than the time I have to work on it. Can you correct me where I'm wrong?

My understanding of the current AArch64 TLS is:

  • It is variant 1
  • TCB size is 2 64-bit words with the first word being a pointer to the dynamic thread vector, the second word being reserved for the implementation.
  • The TP points directly to the start of TCB.
  • Only positive offsets are needed to access the data from the thread pointer. From memory the load/store instructions with scaled offsets can only use unsigned offsets.
  • Android needs a larger TCB than 16 due to various slots such as the stack protector being used (https://android.googlesource.com/platform/bionic/+/master/libc/private/bionic_tls.h)
  • Android would need to mark binaries using ELF TLS regardless of whatever implementation of TLS Android chooses. Hence this patch.

One potential solution for Android is to increase the TCB size to 16 words for Android. As I understand it this would only affect the Local Exec as that is where the relocations have to take into account the size of the TCB. I also think that this would limit the changes to the static linker and loader, the compiler and assembler wouldn't be affected.

The alternative solutions are to disable Local Exec, or implement variant 2 TLS in Arm and AArch64 which would likely require compiler changes and probably wouldn't work as well on AArch64 as some of the immediate addressing modes aren't available on the load.

Personally I'd prefer not to see an implementation of variant 2 for AArch64, it isn't too difficult to test AArch64 compiler and linker changes using a native linux machine or the qemu user mode emulator. I'm a bit concerned that it would be difficult for non Android developers to test Android specific changes? I understand the concern that 16 slots might not be enough. Perhaps reserve the last entry for a pointer to an extension vector? Or do something similar to what PPC have done and use negative offsets from TP for reserved values.

As an aside it would help to have a similar page to Apple's https://developer.apple.com/library/archive/documentation/Xcode/Conceptual/iPhoneOSABIReference/Articles/ARM64FunctionCallingConventions.html describing differences from the ABI.

On the patch itself, I've got a suggestion on an alternative mechanism to mark the ELF file that may be more extensible for the future.

There are a few architectural extensions coming for AArch64 such as pointer authentication (8.3) and branch target information BTI (8.5) both of which need static linker (PLT entries) and dynamic loader support (enable the feature for a range of addresses). These will need ELF files marked so that the linker and loader know what to do. The first implementation from Arm's GNU team is going to use .note.gnu.property section to describe these properties. It may be useful to examine these to see if you can use these to mark binaries rather than introduce an OS specific program header?
Links:
https://github.com/hjl-tools/linux-abi/wiki/Linux-Extensions-to-gABI (see https://github.com/hjl-tools/linux-abi/wiki/linux-abi-draft.pdf in particular)

I'm not sure if LLD supports these sections yet http://lists.llvm.org/pipermail/llvm-dev/2018-March/121951.html . I've got the 8.3 and 8.5 extensions in LLD on my list of things to do once they are posted on the binutils mailing lists so I'd need to work on support for them.

rprichard updated this revision to Diff 172002.Oct 31 2018, 1:21 PM

Update according to the latest revision of D53905 (export getTlsTpOffset).

ruiu added a comment.Oct 31 2018, 4:02 PM

This is a well-written small patch, but it introduces a new ABI for TLS variables which makes Android's ABI different from and incompatible with any other ABIs. I don't have a good idea to avoid doing that, though, given that we want to maintain the compatibility with DSOs built for the current and previous versions of Android OSes. However, I'd still like to seek a way so that we don't need to define "Android ABI".

I'd like to keep this patch open for a while. In the meantime, could you try to get a feedback from broader audiences? I think if I approve this patch now, what this patch implements will become a part of the ABI that lives like forever, and I don't think that decision cannot/shouldn't be made only by you and me.

@peter.smith Thanks for taking a look. That's a good summary of the problem.

Just trying to understand the implications here. Unfortunately I find I can't keep the details in my memory for more than the time I have to work on it. Can you correct me where I'm wrong?

My understanding of the current AArch64 TLS is:

  • It is variant 1
  • TCB size is 2 64-bit words with the first word being a pointer to the dynamic thread vector, the second word being reserved for the implementation.
  • The TP points directly to the start of TCB.
  • Only positive offsets are needed to access the data from the thread pointer. From memory the load/store instructions with scaled offsets can only use unsigned offsets.
  • Android needs a larger TCB than 16 due to various slots such as the stack protector being used (https://android.googlesource.com/platform/bionic/+/master/libc/private/bionic_tls.h)
  • Android would need to mark binaries using ELF TLS regardless of whatever implementation of TLS Android chooses. Hence this patch.

That all sounds about right.

Ideally, Android would have used negative TP offsets on ARM from the start. e.g.: like Fuschia:

I believe most of the Bionic slots could be moved to negative slots (or equivalently, plain fields of pthread_internal_t) without breaking anything.

The stack protector cookie looks like the biggest exception. Even if we allocated a new slot for it today, new AArch64 binaries would continue to use the old slot for compatibility with old Android releases. Alternatively, we could revert D18632, but that would make binaries larger and possibly slower. If we did revert it, we'd only have to deal with existing AArch64 binaries, which is still a big problem.

The Android NDK's build systems (ndk-build and cmake) compile with -fstack-protector-strong by default, and since NDK r14, Clang has used TLS_SLOT_STACK_GUARD for AArch64. The NDK has defaulted to Clang (versus GCC) since NDK r13.

Moving the TSAN slot might also be a problem. I suspect it would quietly break existing sanitizer solibs on new Android releases.

It's common on Android to have application solibs loaded into a Zygote process (/system/bin/app_process{32,64}), and I think we can also see old vendor solibs running in the same process with new platform code (as of Project Treble).

One potential solution for Android is to increase the TCB size to 16 words for Android. As I understand it this would only affect the Local Exec as that is where the relocations have to take into account the size of the TCB. I also think that this would limit the changes to the static linker and loader, the compiler and assembler wouldn't be affected.

Right -- increasing the TCB size affects the static linker, dynamic linker, and libc, but it doesn't affect the compiler or assembler.

The alternative solutions are to disable Local Exec, or implement variant 2 TLS in Arm and AArch64 which would likely require compiler changes and probably wouldn't work as well on AArch64 as some of the immediate addressing modes aren't available on the load.

Personally I'd prefer not to see an implementation of variant 2 for AArch64, it isn't too difficult to test AArch64 compiler and linker changes using a native linux machine or the qemu user mode emulator. I'm a bit concerned that it would be difficult for non Android developers to test Android specific changes? I understand the concern that 16 slots might not be enough. Perhaps reserve the last entry for a pointer to an extension vector? Or do something similar to what PPC have done and use negative offsets from TP for reserved values.

I agree that it's good to minimize the amount of Android-specific logic in LLVM. A larger-TCB is a better fix than variant 2 in that regard, especially if variant 2 required a new static relocation.

I don't think I'm worried about running out of slots -- as you said, Bionic should be to able to use negative offsets from the TP for new slots.

Allocating an especially-large TCB (e.g. 4KB) would be one way to keep Bionic compatible with golang's "g" register allocation, because Bionic could place the pthread_internal_t::key_data field immediately after the Bionic slots: https://android.googlesource.com/platform/bionic/+/6f3a56bb18628243b6dbe470222e56bb56ed10ae/libc/bionic/pthread_internal.h#124

As an aside it would help to have a similar page to Apple's https://developer.apple.com/library/archive/documentation/Xcode/Conceptual/iPhoneOSABIReference/Articles/ARM64FunctionCallingConventions.html describing differences from the ABI.

Yeah, that might be helpful. I think https://developer.android.com/ndk/guides/abis is the closest thing we currently have.

On the patch itself, I've got a suggestion on an alternative mechanism to mark the ELF file that may be more extensible for the future.

There are a few architectural extensions coming for AArch64 such as pointer authentication (8.3) and branch target information BTI (8.5) both of which need static linker (PLT entries) and dynamic loader support (enable the feature for a range of addresses). These will need ELF files marked so that the linker and loader know what to do. The first implementation from Arm's GNU team is going to use .note.gnu.property section to describe these properties. It may be useful to examine these to see if you can use these to mark binaries rather than introduce an OS specific program header?
Links:
https://github.com/hjl-tools/linux-abi/wiki/Linux-Extensions-to-gABI (see https://github.com/hjl-tools/linux-abi/wiki/linux-abi-draft.pdf in particular)

I'm not sure if LLD supports these sections yet http://lists.llvm.org/pipermail/llvm-dev/2018-March/121951.html . I've got the 8.3 and 8.5 extensions in LLD on my list of things to do once they are posted on the binutils mailing lists so I'd need to work on support for them.

Thanks for the links. I'll look into them more later. A note would work in general. To use .note.gnu.property, it looks like we'd need to allocate a GNU_PROPERTY_TLS_TPOFF property value. These are the currently-defined property types:

GNU_PROPERTY_STACK_SIZE1
GNU_PROPERTY_NO_COPY_ON_PROTECTED2
GNU_PROPERTY_LOPROC0xc0000000
GNU_PROPERTY_HIPROC0xdfffffff
GNU_PROPERTY_LOUSER0xe0000000
GNU_PROPERTY_HIUSER0xffffffff

As a near-equivalent to this patch, if we had a good way of increasing an executable's TLS segment alignment to 16 words, then libc could allocate the Bionic slots in the alignment padding between the 2-word TCB and the TLS segment. In that case, Android executables would still be following the official TLS ABI, but slightly subsetted.

Increasing the alignment of the TLS segment is an interesting idea as it allows the possibility of a larger TCB and in theory could also be used by the loader to detect compatible ELF files (reject those without the alignment).

One possible way to achieve this without linker changes would be for the Android startup equivalent crt0.s to have a zero sized TLS .data section with 16 byte alignment. This would have the problem that all executables and shared libraries that didn't use TLS would get a 0 sized TLS segment.

If changes were made to the linker I think it probably would be best to search for the TLS segment and increase its alignment to 16 if it exists. That would work with both arbitrary linker scripts where the author may forget to increase the alignment; and the "internal" linker script which in theory we could just up the alignment to 16.

Increasing the alignment of the TLS segment is an interesting idea as it allows the possibility of a larger TCB and in theory could also be used by the loader to detect compatible ELF files (reject those without the alignment).

One possible way to achieve this without linker changes would be for the Android startup equivalent crt0.s to have a zero sized TLS .data section with 16 byte alignment. This would have the problem that all executables and shared libraries that didn't use TLS would get a 0 sized TLS segment.

Yeah, when I was prototyping ELF TLS on Android, that was my first approach: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/arch-common/bionic/crtbegin.c. (I used a non-zero variable, but a 0-sized variable would be better.)

Putting a TLS segment in every executable seems like something we'd like to avoid. Another problem is that the compiler and/or linker tries to optimize out the dummy section. e.g. with --gc-sections

If changes were made to the linker I think it probably would be best to search for the TLS segment and increase its alignment to 16 if it exists. That would work with both arbitrary linker scripts where the author may forget to increase the alignment; and the "internal" linker script which in theory we could just up the alignment to 16.

I think this would be OK. If we go this route, it might be nice to modify this behavior, which tends to wastes space after the executable TLS segment on non-variant-2 targets:

// The TLS pointer goes after PT_TLS for variant 2 targets. At least glibc
// will align it, so round up the size to make sure the offsets are
// correct.
if (P->p_type == PT_TLS && P->p_memsz)
  P->p_memsz = alignTo(P->p_memsz, P->p_align);

On variant 2 targets, it's necessary for getTlsTpOffset() to use a p_memsz aligned up to p_align, but it's (presumably) not necessary for the PT_TLS segment's size to actually be rounded up in the ELF output. (ld.bfd doesn't round up the size. ld.gold/lld do round it up, but running binutils strip recomputes the PT_TLS size without the rounding.)

On other architectures, the extra rounding prevents the loader/libc from allocating anything in the area immediately after the executable's TLS segment. Normally it could put another solib's TLS there, or perhaps some per-thread data. Maybe this is OK if Android only needs 16 reserved words, though, which is my current goal?

If we decided to change this code, we could:

  • restrict it to variant 2 only, or
  • move the alignTo() call to getTlsTpOffset() and apply it to variant 2 only
ruiu added a comment.Nov 1 2018, 4:05 PM

Setting a large alignment to the TLS segment to make room from TP and the TLS segment is an interesting idea. It is tempting, but at the same time it feels a bit too subtle and perhaps fragile. Looks like it is a very indirect way to do what we actually want to do.

If I understand correctly, what we want to do is boiled down to this:

 // A TLS symbol's virtual address is relative to the TLS segment. Add a
 // target-specific adjustment to produce a thread-pointer-relative offset.
 static int64_t getTlsTpOffset() {
   switch (Config->EMachine) {
   case EM_ARM:
   case EM_AARCH64:
     // Variant 1. The thread pointer points to a TCB with a fixed 2-word size,
     // followed by a variable amount of alignment padding, followed by the TLS
    // segment.
-    return alignTo(Config->Wordsize * 2, Out::TlsPhdr->p_align);
+    if (Android)
+      return alignTo(Config->Wordsize * 6, Out::TlsPhdr->p_align);
+    else
+      return alignTo(Config->Wordsize * 2, Out::TlsPhdr->p_align);
   case EM_386:
   case EM_X86_64:
     // Variant 2. The TLS segment is located just before the thread pointer.
     return -Out::TlsPhdr->p_memsz;

Is this understanding correct? If so, one radical but simple approach would be to always skip the first 6 words from TP on ARM and AArch64 whether it is Android or not. Obviously that wastes 4 words per a thread, but since thread is not a lightweight data structure, it might be negligible.

Setting a large alignment to the TLS segment to make room from TP and the TLS segment is an interesting idea. It is tempting, but at the same time it feels a bit too subtle and perhaps fragile. Looks like it is a very indirect way to do what we actually want to do.

If I understand correctly, what we want to do is boiled down to this:

 // A TLS symbol's virtual address is relative to the TLS segment. Add a
 // target-specific adjustment to produce a thread-pointer-relative offset.
 static int64_t getTlsTpOffset() {
   switch (Config->EMachine) {
   case EM_ARM:
   case EM_AARCH64:
     // Variant 1. The thread pointer points to a TCB with a fixed 2-word size,
     // followed by a variable amount of alignment padding, followed by the TLS
    // segment.
-    return alignTo(Config->Wordsize * 2, Out::TlsPhdr->p_align);
+    if (Android)
+      return alignTo(Config->Wordsize * 6, Out::TlsPhdr->p_align);
+    else
+      return alignTo(Config->Wordsize * 2, Out::TlsPhdr->p_align);
   case EM_386:
   case EM_X86_64:
     // Variant 2. The TLS segment is located just before the thread pointer.
     return -Out::TlsPhdr->p_memsz;

Is this understanding correct? If so, one radical but simple approach would be to always skip the first 6 words from TP on ARM and AArch64 whether it is Android or not. Obviously that wastes 4 words per a thread, but since thread is not a lightweight data structure, it might be negligible.

We wouldn't be able to do that on non-Android targets, because libc is responsible for initializing the thread pointer to point at a 2-word TCB followed by the TLS segment. Other libc's would need to use a 6-word TCB instead. (e.g. musl would need its GAP_ABOVE_TP value changed -- https://git.musl-libc.org/cgit/musl/tree/arch/aarch64/pthread_arch.h?h=v1.1.20#n9)

The Android libc could use a 6-word TCB, so I believe the patch above would work. I have a few reservations:

  • If we reserve 6 words, maybe it makes sense to reserve a bit more to avoid moving the TSAN slot?
  • We probably still need a PT_ANDROID_TLS_TPOFF or equivalent note, because without it, it's too easy to create executables with memory corruption.
  • It simplifies Bionic if x86 and ARM use the same TLS layout. One justification for variant 2 in Drepper's TLS document was backwards-compatibility with existing thread pointer usage, and since then, new architectures have used variant 1 (or a fixed-offset) instead. If x86 used a variant 1 layout instead, then Bionic likely wouldn't have to implement variant 2. I'm a little hesitant to diverge unnecessarily from the x86 ABI, but I can't find a specific reason not to. I wouldn't object if lld continuing conforming to the standard x86 TLS ABI, though.

Aside: I don't think lld currently knows when it's targeting Android?

ruiu added a comment.Nov 1 2018, 5:08 PM

Yes I think you are right. We cannot simply increase the TCB size to 6 words. It is a ABI-breaking change.

What I'm currently seeking is a simple way to create executables/DSOs that are compatible on non-Android and Android, without introducing any new command line flag or a target-dependent logic. I believe it is OK to waste a few words on non-Android if we can achieve that goal. Another idea that is in line with that is to always set a >8 word alignment (you suggested 16, but I don't know why it has to be 16) to a TLS segment if ARM or AArch64. Maybe that could produce a binary that is compatible both Android and non-Android?

What I'm currently seeking is a simple way to create executables/DSOs that are compatible on non-Android and Android, without introducing any new command line flag or a target-dependent logic. I believe it is OK to waste a few words on non-Android if we can achieve that goal. Another idea that is in line with that is to always set a >8 word alignment (you suggested 16, but I don't know why it has to be 16) to a TLS segment if ARM or AArch64. Maybe that could produce a binary that is compatible both Android and non-Android?

We'd only need 9 words to reserve all the existing Bionic TLS slots. (TSAN is slot 8.) 16 provides a few spare slots for the future, assuming Bionic wants to keep using positive slot indices (but it really could just negative indices). If we use alignment for reservation, that'd also require a power-of-two.

If we can move the TSAN slot, then an alignment of >8 would work for both Android and non-Android.

It looks like the ASAN runtime libraries shipped with NDK r18 don't use the TSAN slot, and the NDK doesn't support -fsanitize=thread (yet), so maybe moving the slot is feasible?

srhines added a subscriber: yabinc.Nov 1 2018, 5:22 PM

Yabin did some of the work to bring TSan to Android. Perhaps he knows whether this is has significant use. Considering that it is meant as a debugging aide, I would actually be comfortable with moving it. You are also correct that we never officially marked it as a feature of the NDK (or platform). That is something we want to do though.

ruiu added a comment.Nov 1 2018, 5:25 PM

I don't know if moving the TSAN slot is feasible, but if it's feasible, I think I prefer the approach to always set 8 word alignment to the TLS segment on ARM/AArch64. That means on non-Android OSes there are 6 unused words between the TP and the beginning of the TLS segment per thread. I think that's acceptable given that by having that gap, we can avoid introducing a new ABI and a new static linker configuration/command line flag.

rprichard added a subscriber: enh.Nov 1 2018, 5:45 PM

@eugenis @enh

I'm wondering whether moving the TLS_SLOT_TSAN slot is feasible. e.g. This sanitizer code would stop working, but it could presumably be fixed in a newer sanitizer library:

// The Android Bionic team has allocated a TLS slot for TSan starting with N,
// given that Android currently doesn't support ELF TLS. It is used to store
// Sanitizers thread specific data.
static const int TLS_SLOT_TSAN = 8;

ALWAYS_INLINE uptr *get_android_tls_ptr() {
  return reinterpret_cast<uptr *>(&__get_tls()[TLS_SLOT_TSAN]);
}

As to where to move it:

  • use initial-exec ELF TLS (but then the sanitizer solib requires a new version of Bionic and must not be loaded via dlopen, directly or indirectly).
  • use a negative TP offset on arm (and positive on x86).
  • use a mechanism for allocating a slot (something like my pthread_alloc_tls_word_np proposal for fixing golang on http://b/118381796#comment7).
  • change one of the existing slots to a normal pthread field. (e.g. TLS_SLOT_ERRNO or TLS_SLOT_DLERROR)

The issue is that our current use of Bionic slots disagrees with the standard arm{32,64} ELF TLS ABI. The standard ABI requires negative slots on arm and positive on x86. e.g.:

@cryptoad
TLS_SLOT_TSAN is used in scudo, tsan and hwasan. AFAIK we don't yet depend on binary compatibility for either.

Incompatibility with dlopen is not great. At least scudo would need to support that case, I think - as a custom allocator in an android application.

Changing to a negative or smaller offset should be fine.

Note that HWASan needs to access this slot from instrumented code, so a dynamic offset will not do. It needs to be a compile-time constant.

On non-Android Linux, it looks like hwasan and msan instrumentation use an IE-model TLS variable (via the GOT), which is more of a load-time constant than a compile-time one.

You are right, initial-exec TLS would work for hwasan and msan. It would cost about 2 instructions and 1 load per function, compared to the fixed slot. Not great, but tolerable.

In the future, with hardware MTE, we'd really like a dedicated slot though.

FWIW: I've published a draft of a document summarizing ELF TLS with emphasis on issues affecting Bionic. It's at https://android.googlesource.com/platform/bionic/+/master/docs/elf-tls.md. These sections are relevant to this LLVM patch:

The second link lists several ways of addressing problems with the standard TLS layout on Android. It ought to list the current proposal here (minimum TLS alignment on ARM), but I forgot to add it.

ruiu added a comment.Nov 13 2018, 12:22 AM

Thank you very much for publishing the document! I found the document is very well-written. It is not only a good documentation to understand what you are proposing to add TLS support to Android, but that's a good introduction to understand how thread-local storage works in general.

Back to Android TLS, it seems like setting an alignment that is greater than 8 words to the TLS segment should work on both Android and non-Android, and if that's the case, I believe we should do that. What do you think?

I implemented the proposed fix for ARM/AArch64 (i.e. set a minimum alignment of 8 words on an executable's TLS segment). I think this will work for Android -- i.e. we'll leave the x86 layout alone. Even if we *did* decide to change the x86 TLS layout later, it'd probably make sense to use this approach on ARM.

On a somewhat related topic:

lld rounds up a TLS segment's size to a multiple of its alignment, which tends to waste TLS memory, because it creates alignment padding where libc could have otherwise allocated a TLS object with a lower alignment requirement.

e.g.: With this patch, if an ARM/AArch64 executable and solib both declare a single TLS pointer, we'd have this TLS layout:

2-word TCB
... 6 words of padding for Bionic's TCB ...
executable's pointer
... 7 words of padding ...
solib's pointer

The 7 words of padding is unnecessary, and libc could avoid it if the PT_TLS segment size weren't rounded up.

I looked at other tools:

  • ld.bfd does not round up a TLS segment's size.
  • ld.gold *does* round up.
  • If gold/lld rounded up the size, then binutils' strip brings it back down, apparently using the size of the .tdata/.tbss sections.

The lld padding was added in this commit:
https://reviews.llvm.org/rL252352

The gold padding was added in this commit:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=96a2b4e4bf67a7430e7ddd06a1d0730a49e6b07e

If we did decide to remove this behavior, I could submit another patch for it.

rprichard retitled this revision from [ELF] Allow configuring the TLS layout for an Android executable to [ARM][AArch64] Increase TLS alignment to reserve space for Android's TCB.Dec 10 2018, 11:23 PM
rprichard edited the summary of this revision. (Show Details)

I think this CL is ready for review now.

There was a bit of trouble with switching a buildbot over to aosp-master for testing, but that's fixed now, and hwasan and compiler-rt are now switched over from TLS_SLOT_TSAN(8) to TLS_SLOT_SANITIZER(6).

I also have a Bionic CL in review that rearranges TLS memory:

I've got a small suggestion about the comment. Otherwise I'm happy with the change. I think that it is reasonable to increase the alignment for all operating systems as the amount is not likely to be significant for platforms that make use of TLS.

ELF/Writer.cpp
2183 ↗(On Diff #177663)

May I suggest moving the added code after the // The TLS pointer goes after PT_TLS for variant 2 targets comment? I think the comment here would be easier to understand if it came sequentially afterwards. It may also be worth moving the comment inside the if statement, or clarifying that it is Arm/AArch64 only at the moment.

ruiu accepted this revision.Jan 8 2019, 2:19 PM

LGTM. I'm happy with this change. Thank you for doing this.

Please address Peter's comment before committing.

This revision is now accepted and ready to land.Jan 8 2019, 2:19 PM
rprichard marked an inline comment as done.Jan 8 2019, 2:48 PM
rprichard added inline comments.
ELF/Writer.cpp
2183 ↗(On Diff #177663)

The code itself can't be reordered because the first statement modifies P->p_align and the second one reads P->p_align. Maybe something like this?

if (P->p_type == PT_TLS && P->p_memsz) {
  if (!Config->Shared &&
      (Config->EMachine == EM_ARM || Config->EMachine == EM_AARCH64)) {
    // On ARM/AArch64, reserve extra space (8 words) between the thread
    // pointer and an executable's TLS segment by overaligning the segment.
    // This reservation is needed for backwards compatibility with Android's
    // TCB, but for simplicity it is also done on other operating systems.
    P->p_align = std::max<uint64_t>(P->p_align, Config->Wordsize * 8);
  }

  // The TLS pointer goes after PT_TLS for variant 2 targets. At least glibc
  // will align it, so round up the size to make sure the offsets are
  // correct.
  P->p_memsz = alignTo(P->p_memsz, P->p_align);
}
rprichard updated this revision to Diff 180749.Jan 8 2019, 2:54 PM
rprichard edited the summary of this revision. (Show Details)

Move comment into if-statement and clarify that it's only for ARM and AArch64.

ruiu added a comment.Jan 8 2019, 2:56 PM

LGTM

ELF/Writer.cpp
2190 ↗(On Diff #180749)

Perhaps it is good to leave a comment as to why this is !Config->Shared only.

rprichard updated this revision to Diff 180755.Jan 8 2019, 3:35 PM

Add an explanation of how overaligning the executable TLS segment reserves the extra space Android needs for its TCB slots.

I think this helps clarify why the Config->Shared case doesn't do the overalignment? lld could overalign solibs, but it wouldn't be necessary, and it would create more alignment padding, wasting TLS memory.

rprichard updated this revision to Diff 180757.Jan 8 2019, 3:47 PM

Remove the blank line.

rprichard updated this revision to Diff 180764.Jan 8 2019, 4:08 PM

Update comment again.

This revision was automatically updated to reflect the committed changes.
nsz added a subscriber: nsz.May 13 2019, 9:27 AM

i think this should be fixed in the tools/runtimes that assume fixed tls offset abi that conflict with the elf tls abi.

e.g. allocating tls in bionic crt0 is a valid solution, changing bionic internals and compilers that make assumptions about bionic internals is another solution (but then handling bw compat becomes difficult).

"Putting a TLS segment in every executable seems like something we'd like to avoid. Another problem is that the compiler and/or linker tries to optimize out the dummy section. e.g. with --gc-sections"

why? it sounds to me that has the right semantics: tls gets reserved by bionic in every executable that bionic uses internally (you can create a tls var and reference it from bionic to avoid gc-sections)

Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2019, 9:27 AM
Herald added a subscriber: MaskRay. · View Herald Transcript