This is an archive of the discontinued LLVM Phabricator instance.

[lld] Support copy relocations for TLS symbols
AbandonedPublic

Authored by jrtc27 on Nov 18 2019, 7:59 AM.

Details

Summary

Previously, if Local Exec was used to access a symbol not in the
executable, we would error with "symol '<sym>' has no type". Instead, we
should emit a copy relocation using .tbss.

Event Timeline

jrtc27 created this revision.Nov 18 2019, 7:59 AM
jrtc27 added a comment.EditedNov 18 2019, 11:51 AM

Note that, for both GCC and LLVM, the default model used for TLS variables in position-depedent code is LE even if extern, relying on TLS copy relocations. GCC explicitly mentions this in gcc/config/riscv/riscv.c with /* Since we support TLS copy relocs, non-PIC TLS accesses may all use LE. */, and LLVM says // Non-PIC TLS lowering should always use the LocalExec model. (although perhaps that should mention *why* this is ok). Hence this is needed in order to link some real-world code.

Note that, for both GCC and LLVM, the default model used for TLS variables in position-depedent code is LE even if extern, relying on TLS copy relocations.

So RISC-V is in its own camp here. I played with other architectures. All use Initial Exec (a GOT for TPREL) for extern TLS variables.

i386: R_386_TLS_IE
ARM: R_ARM_TLS_IE32
MIPS: R_MIPS_TLS_GOTTPREL
AArch64: R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC
PowerPC32: R_PPC_GOT_TPREL16
PowerPC64: R_PPC64_GOT_TPREL16_LO_DS

Can you refer to me notes/links about the decision choice? This looks to me a bad design.

ruiu added a comment.Nov 18 2019, 6:52 PM

Copy relocation is a hack to allow non-PIC object files to be linked with DSOs. And it is a bit dangerous hack because people don't usually understand the exact semantics of the copy relocation and can make an ABI-breaking change without knowing it. I described the problem here: https://github.com/llvm/llvm-project/blob/master/lld/ELF/Relocations.cpp#L517-L558

Copy relocation also wastes a little bit of resource because for each copy relocation we'll have two copies of the same piece of data in memory.

Therefore, I believe that we should in general avoid copy relocations. So I'm inclined to just tell users what to do to avoid copy relocations rather than silently working around and hiding the issue. All users have to do is essentailly recompile object files with -fPIC, and that shouldn't be too hard.

What do you think?

I would agree that, for other architectures, not supporting copy relocations for TLS makes sense, and a world without copy relocations would be good. However, at the end of the day, this is what the RISC-V toolchains out there currently rely upon, and it would be very sad to force an incompatibility, especially since -fPIC introduces extra indirection in many cases that aren't needed, in particular because code generation has to assume symbol preemption. It does seem at least however that -fPIE yields the saner IE model again for both LLVM and GCC.... which begs the question why, as PIE vs non-PIE makes absolutely zero difference to the TLS addresses and semantics. I haven't yet found a discussion as to why this decision was made; the comment in the GCC source was the only reference I could even find that acknowledged this would require copy relocations. Ultimately, Clang+LLD can't currently be used to build FreeBSD as a result of this (nor can GCC+LLD), and telling people they can't use the default compiler mode that has always worked doesn't seem like the right answer; either LLVM and GCC need to stop forcing TLS copy relocations for position-dependent executables, or we need to support it here in LLD (or both...).

@palmer-dabbelt Do you have any comments about this?

ruiu added a comment.Nov 18 2019, 11:33 PM

James, I'm sorry that I think I missed your second comment. So if gcc and clang uses LE model even for extern TLS variables, I agree with you that it's odd. They may have wanted to avoid the cost of more general access models, but because the linker is able to relax them, that shouldn't be a big problem.

What's strange is that TLS GD(/LD)/IE *don't* have linker relaxations currently even in BFD (and no R_RISCV_RELAX relocations are emitted since both compilers know it's not currently a relaxable sequence) despite RISC-V's love of linker relaxations for other things, so the model chosen at compile-time is set in stone. The only exception is that a 32-bit LE offset gets relaxed into a 12-bit LE offset when possible, but that's not changing the model. It really feels like they should have gone with IE and defined linker relaxations (something that even other architectures have here, albeit they pad with NOPs rather than deleting code) instead of this approach, but alas that's what we have.

MaskRay added a comment.EditedNov 19 2019, 9:53 AM

What's strange is that TLS GD(/LD)/IE *don't* have linker relaxations currently even in BFD (and no R_RISCV_RELAX relocations are emitted since both compilers know it's not currently a relaxable sequence) despite RISC-V's love of linker relaxations for other things, so the model chosen at compile-time is set in stone. The only exception is that a 32-bit LE offset gets relaxed into a 12-bit LE offset when possible, but that's not changing the model. It really feels like they should have gone with IE and defined linker relaxations (something that even other architectures have here, albeit they pad with NOPs rather than deleting code) instead of this approach, but alas that's what we have.

The ABI does not say about any relaxation: code sequences can not be changed but dynamic DTPMOD relocations can be omitted if applicable, e.g. https://sourceware.org/bugzilla/show_bug.cgi?id=24676 https://sourceware.org/bugzilla/show_bug.cgi?id=24673.

I think we should improve the current TLS ABI:

  • Replace traditional General Dynamic and Local Dynamic with TLSDESC (https://github.com/riscv/riscv-elf-psabi-doc/issues/94). Define the relaxations to Initial Exec and Local Exec (in lld, R_RELAX_TLS_GD_TO_IE and R_RELAX_TLS_GD_TO_LE).
  • Define how Initial Exec relaxes to Local Exec (in lld, R_RELAX_IE_TO_LE).

I made a comment on https://sourceware.org/bugzilla/show_bug.cgi?id=23825 . Rich Felker pointed out that https://github.com/riscv/riscv-gcc/pull/118, which is applicable on GCC 7,8 and 9, can disallow Local Exec for extern TLS variable accesses https://github.com/richfelker/musl-cross-make/commit/52527c462f255065bcb053f14a13b747a8a30005. So we have a simple GCC fix. If we can reach a consensus, I think we can drop this lld change (sorry but I feel that allowing this case seems really, really bad).

Created https://github.com/riscv/riscv-elf-psabi-doc/issues/122

I would suggest stepping back and looking at the whole picture. TLS is not special here; for normal globals, both code models also rely on copy relocations, and there are likely to be many more of those. If you think TLS copy relocations are bad then you should also be against using HI/LO and PCREL_HI/LO for extern symbols and require those to be indirected through the GOT.

dalias added a subscriber: dalias.Nov 19 2019, 11:00 AM

Copy relocations are extremely bad, in all places. They waste memory, possibly massive amounts (e.g. const char libfoo_table[1<<24] = { ... }; in a library transforms from 16M of shared text to 16M of per-instance data), and they make the size of structures and arrays exported from shared libraries part of the ABI (precluding extending them at the end in future library versions). They should not be used anywhere, and the only historical reason they exist at all is that, prior to the existence of shared libraries, the existing ABIs on platforms at the time allowed absolute references to data objects embedded in program text. There is no good reason to allow this on new architectures not encumbered by such legacy ABIs, and certainly no reason to introduce it for TLS, where the extreme memory waste is not per-process but per-thread.

I'm pretty sure copy relocations would also interact really badly with STV_PROTECTED TLS objects - the shared library would be accessing its own copy via local-dynamic model while the main program would be accessing the copy it made through the copy relocation. While STV_PROTECTED has all sorts of problems, prior to the introduction of TLS copy relocations, it *did* work as intended for TLS, even if not for global data and functions.

MaskRay added a comment.EditedNov 19 2019, 11:28 AM

TLS copy relocation requires cooperation from the dynamic loader. Just skimmed through several dynamic loader implementations:

glibc ld.so copies the data from the shared object to the TLS image in the executable, but musl and FreeBSD ld.so do not. So even if lld supported TLS copy relocation, it would not work on them because the TLS image would not be properly initialized.

The discussion about the relaxation can be moved to https://github.com/riscv/riscv-elf-psabi-doc/issues/122 We are proposing extensions to the ABI and deprecating things, existing executables/shared objects linked against glibc can still work. binutils cannot simply drop the change because it still needs to support existing object files/archives for a while...

ruiu added a comment.Nov 19 2019, 7:56 PM

I think I agree with dalias. I don't get a point to allow users to use copy relocations liberally for a new ISA/ABI like RISC-V, and it looks like it is a problem that we should fix compilers as soon as possible, rather than adopting the linker to what compilers emit at the moment. Even though I can see the reason why you wanted to add this to lld, I can also imagine that that would be a regrettable decision in the long term.

jrtc27 added a comment.EditedNov 21 2019, 10:13 AM

I think the fact that many runtime linkers out there don't handle it correctly is a pretty compelling reason in and of itself (but I agree that, being a new ABI, it should avoid introducing more copy relocations). The fact that GCC relies on this behaviour (which glibc and bfd correctly implement) but doesn't document it anywhere other than in the compiler source itself is worrying, since it leads to the situation we have now where binaries built by GCC+bfd for other systems (such as FreeBSD) may not run correctly, as they will copy garbage from early on in the mapping.

EDIT: I see the discussion linked by MaskRay has already reached the conclusion that GCC will stop emitting these. I will file a patch for LLVM to do the same if nobody beats me to it (relaxations can come later once they are finalised).

I think the fact that many runtime linkers out there don't handle it correctly is a pretty compelling reason in and of itself (but I agree that, being a new ABI, it should avoid introducing more copy relocations). The fact that GCC relies on this behaviour (which glibc and bfd correctly implement) but doesn't document it anywhere other than in the compiler source itself is worrying, since it leads to the situation we have now where binaries built by GCC+bfd for other systems (such as FreeBSD) may not run correctly, as they will copy garbage from early on in the mapping.

I am also worried about defects in fait accomplis. (I can regret nothing but I didn't start watching RISC-V toolchain development earlier...) This issue has demonstrated again how an alternative implementation (libc, compiler, linker, binary utilities, etc) can be valuable by improving the design!

EDIT: I see the discussion linked by MaskRay has already reached the conclusion that GCC will stop emitting these. I will file a patch for LLVM to do the same if nobody beats me to it (relaxations can come later once they are finalised).

I wanted to do this, but you claimed this first. Go ahead:) Given the consensus that TLS copy relocation is a bad idea and newer toolchains do not need to support it, can be drop this patch?

jrtc27 abandoned this revision.Nov 21 2019, 4:49 PM

EDIT: I see the discussion linked by MaskRay has already reached the conclusion that GCC will stop emitting these. I will file a patch for LLVM to do the same if nobody beats me to it (relaxations can come later once they are finalised).

I wanted to do this, but you claimed this first. Go ahead:) Given the consensus that TLS copy relocation is a bad idea and newer toolchains do not need to support it, can be drop this patch?

Oh, yes, I intended to abandon this.