This is an archive of the discontinued LLVM Phabricator instance.

[LLD][PowerPC] Add check in LLD to produce an error for missing TLSGD/TLSLD
ClosedPublic

Authored by stefanp on Aug 14 2020, 1:34 PM.

Details

Summary

The function __tls_get_addr is used to get the address of an object that is Thread Local Storage.
It needs to have two relocations on it.
One relocation is for the function call itself and it is either R_PPC64_REL24 or R_PPC64_REL24_NOTOC.
The other is R_PPC64_TLSGD or R_PPC64_TLSLD for the symbol that is having its address computed.

In the early days of the transition from the ELFv1 ABI that is used for big endian PowerPC Linux distributions to the ELFv2 ABI that is used for little endian PowerPC Linux distributions, there was some ambiguity in the specification of the relocations for TLS. The GNU linker has implemented support for correct handling of calls to __tls_get_addr with a missing relocation. Unfortunately, we didn't notice that the IBM XL compiler did not handle TLS according to the updated ABI until we tried linking XL compiled libraries with LLD. As a result, there is a lot of code out there in various libraries compiled with XL that have this problem.

This patch adds a new error check in LLD that makes sure calls to __tls_get_addr are not missing the TLSGD/TLSLD relocation.

Diff Detail

Event Timeline

stefanp created this revision.Aug 14 2020, 1:34 PM
stefanp requested review of this revision.Aug 14 2020, 1:34 PM

First, if this issue is not common enough, I'd rather we don't have a diagnostic at all.

If we do need it, note that *(i - 2); may be out of bounds.

Last, we should avoid pre-built object files. You can synthesize them with yaml2obj. The preferred style is llvm/test/tools/llvm-readobj/ELF/*.test. lld/test/ELF has some yaml2obj tests as well.

First, if this issue is not common enough, I'd rather we don't have a diagnostic at all.

If we do need it, note that *(i - 2); may be out of bounds.

Last, we should avoid pre-built object files. You can synthesize them with yaml2obj. The preferred style is llvm/test/tools/llvm-readobj/ELF/*.test. lld/test/ELF has some yaml2obj tests as well.

In the early days of the transition from the ELFv1 ABI that is used for big endian PowerPC Linux distributions to the ELFv2 ABI that is used for little endian PowerPC Linux distributions, there was some ambiguity in the specification of the relocations for TLS. The GNU linker has implemented support for correct handling of calls to __tls_get_addr with a missing relocation. Unfortunately, we didn't notice that the IBM XL compiler did not handle TLS according to the updated ABI until we tried linking XL compiled libraries with LLD. As a result, there is a lot of code out there in various libraries compiled with XL that have this problem.

The plan is to initially add a diagnostic message regarding the missing relocation and fail to link. This will prevent run-time errors. The following step will be to implement support for correctly linking such code along with a warning. While it would be great to recompile any code that has this missing relocation, we understand that our clients have many reasons they can't easily do so.

NeHuang added inline comments.Aug 17 2020, 11:39 AM
lld/ELF/Relocations.cpp
1313

Better to execute line 1312 and 1313 only when

if (((type == R_PPC64_REL24 || type == R_PPC64_REL24_NOTOC) &&
     sym.getName() == "__tls_get_addr")

First, if this issue is not common enough, I'd rather we don't have a diagnostic at all.

If we do need it, note that *(i - 2); may be out of bounds.

This needs to be addressed.

Last, we should avoid pre-built object files. You can synthesize them with yaml2obj. The preferred style is llvm/test/tools/llvm-readobj/ELF/*.test. lld/test/ELF has some yaml2obj tests as well.

In the early days of the transition from the ELFv1 ABI that is used for big endian PowerPC Linux distributions to the ELFv2 ABI that is used for little endian PowerPC Linux distributions, there was some ambiguity in the specification of the relocations for TLS. The GNU linker has implemented support for correct handling of calls to __tls_get_addr with a missing relocation. Unfortunately, we didn't notice that the IBM XL compiler did not handle TLS according to the updated ABI until we tried linking XL compiled libraries with LLD. As a result, there is a lot of code out there in various libraries compiled with XL that have this problem.

OK. I think the justification needs to be moved to the description. Such diagnostics (for wrong relocation type usage) are not common.

The test can be converted to assembly by dropping a@tlsgd in bl __tls_get_addr(a@tlsgd)

The plan is to initially add a diagnostic message regarding the missing relocation and fail to link. This will prevent run-time errors. The following step will be to implement support for correctly linking such code along with a warning. While it would be great to recompile any code that has this missing relocation, we understand that our clients have many reasons they can't easily do so.

This will add a string comparison in handling of every R_PPC64_REL24 (and make the code look ugly). I have read nearly every commit in the directory in the past 2 years. This kind of workaround is very unusual. I'd still suggest you consider whether such compatibility is really needed. Have other approaches (e.g. patching object files with a separate binary utility) been considered?

MaskRay requested changes to this revision.Aug 17 2020, 1:03 PM
This revision now requires changes to proceed.Aug 17 2020, 1:03 PM
stefanp updated this revision to Diff 286133.Aug 17 2020, 1:51 PM

I've tried to address the concerns for this patch. I'm sorry that it took so long to make the code changes.

I did try to generate the assembly equivalent of the tests but those will not work. The PowerPC MC layer will assert if I try to generate an object file from assembly that does not have the correct relocations.
I did the fix the tests using the original suggestion of yaml2obj.

I also fixed the out of bounds problem by passing the start iterator to the function.

stefanp edited the summary of this revision. (Show Details)Aug 17 2020, 1:59 PM

I've tried to address the concerns for this patch. I'm sorry that it took so long to make the code changes.

I did try to generate the assembly equivalent of the tests but those will not work. The PowerPC MC layer will assert if I try to generate an object file from assembly that does not have the correct relocations.
I did the fix the tests using the original suggestion of yaml2obj.

I also fixed the out of bounds problem by passing the start iterator to the function.

The YAML tests share a large portion in common. You can find some existing use cases of yamlobj -D. It is also possible to rewrite the tests with assembly, with might be even simpler. You can use .reloc to synthesize relocations.

lld/ELF/Relocations.cpp
1330

i-2 may go out of bounds.

1332

if

1334

errorOrWarn

NeHuang added inline comments.Aug 17 2020, 3:24 PM
lld/ELF/Relocations.cpp
1332

Why do we also need prevRel.r_offset != rel.r_offset? Is it for the case that previous relocation type is R_PPC64_TLSGD or R_PPC64_TLSLD but with a wrong relocation offset?

stefanp updated this revision to Diff 286241.Aug 18 2020, 4:00 AM

I've merged a couple of the tests using yaml2obj -D. I would prefer not to merge the tests any further because the actual code in the .text section is different between TLSGD and TLSLD.

I am unable to use the assembly files. The problem is that __tls_get_addr is a special name for the PPC backend and we cannot generate an object file from incorrect assembly. For example:

$ cat GeneralDynamic.s
	.text
GeneralDynamic:                         # @GeneralDynamic
	addis 3, 2, x@got@tlsgd@ha
	addi 3, 3, x@got@tlsgd@l
	bl __tls_get_addr()
	blr

Produces:

$ llvm-mc GeneralDynamic.s --filetype=obj -o /dev/null
GeneralDynamic.s:5:20: error: unknown token in expression
        bl __tls_get_addr()
                          ^
GeneralDynamic.s:5:20: error: invalid TLS call expression
        bl __tls_get_addr()
                          ^

I am also a little confused about the i - 2 overflow because I think I already check that i - 1 is not the start of the array.

I believe I have addressed the other two comments that you had.

stefanp marked 2 inline comments as not done.Aug 18 2020, 4:00 AM
stefanp added inline comments.
lld/ELF/Relocations.cpp
1330

Sorry, I'm probably missing something here.
I have checked that (i-1) is not array.begin(). I assumed that was the way this could go out of bounds (by iterating before the start of the array). How else can this go out of bounds?

1332

Yes, that is why we need it. Both of the relocations need to be in the same place as the call to __tls_get_addr and if we don't check the offsets we can technically have a situation where the TLSGD/TLSLD does come before the call but is not on the call.

I am unable to use the assembly files. The problem is that __tls_get_addr is a special name for the PPC backend and we cannot generate an object file from incorrect assembly. For example:

$ cat GeneralDynamic.s
	.text
GeneralDynamic:                         # @GeneralDynamic
	addis 3, 2, x@got@tlsgd@ha
	addi 3, 3, x@got@tlsgd@l
	bl __tls_get_addr()
	blr

Produces:

$ llvm-mc GeneralDynamic.s --filetype=obj -o /dev/null
GeneralDynamic.s:5:20: error: unknown token in expression
        bl __tls_get_addr()
                          ^
GeneralDynamic.s:5:20: error: invalid TLS call expression
        bl __tls_get_addr()
                          ^

I think that is because the parens trigger extra parsing. If we omit them we should be able to parse the assembly.

  .text
GeneralDynamic:                         # @GeneralDynamic
  addis 3, 2, x@got@tlsgd@ha
  addi 3, 3, x@got@tlsgd@l
  bl __tls_get_addr
  blr

produces

Disassembly of section .text:

0000000000000000 <GeneralDynamic>:
   0:   00 00 62 3c     addis   r3,r2,0
                        0: R_PPC64_GOT_TLSGD16_HA       x
   4:   00 00 63 38     addi    r3,r3,0
                        4: R_PPC64_GOT_TLSGD16_LO       x
   8:   01 00 00 48     bl      8 <GeneralDynamic+0x8>
                        8: R_PPC64_REL24        __tls_get_addr
   c:   20 00 80 4e     blr

Does that give you what you need for the test?

Yes! That's the test I need.
Let me add that test and we can choose which one is preferred.

stefanp updated this revision to Diff 286328.Aug 18 2020, 10:03 AM

Added the new assembly test.

MaskRay added inline comments.Aug 18 2020, 12:08 PM
lld/test/ELF/ppc64-error-missing-tlsgdld.s
2 ↗(On Diff #286328)

The assembly file looks better. Consider naming the test so that it shares a longer prefix with related tests (ppc64-tls-*.s): ppc64-tls-missing-gdld.s

stefanp updated this revision to Diff 286375.Aug 18 2020, 12:39 PM

Renamed the assembly test file and removed the other test file.

stefanp updated this revision to Diff 286616.Aug 19 2020, 11:21 AM

Fix some formatting.
Fix a couple of nits.
Add the --triple to the test case.

MaskRay added inline comments.Aug 19 2020, 2:44 PM
lld/ELF/Relocations.cpp
1311

I still think this is a diagnostic of lower value, so I'd hope we can spend less code on this (when it is no longer needed, delete it).

The comments are too verbose in my opinion. We have explained __tls_get_addr is used by General Dynamic/Local Dynamic TLS models, so there is no point spending more sentences on it. Suggest:

// For a call to __tls_get_addr, the instruction needs to relocated by two relocations, R_PPC64_TLSGD/R_PPC64_TLSLD and R_PPC64_REL24[_NOTOC]

I am not sure it is worth checking the relocation at i - 2.

lld/test/ELF/ppc64-tls-missing-gdld.s
1

REQUIRES:

stefanp updated this revision to Diff 286776.Aug 20 2020, 4:19 AM

Updated the comment to make it more concise.
Updated the test case to add the missing : .

lld/ELF/Relocations.cpp
1311

I can certainly reduce the comment.

However, checking i - 2 is the whole point of this check. The function is passed in i that is where rel comes from. We then do and i++ on line 1277 and so in my code I need to use i - 1 to get back to the original i. Once I have checked for R_PPC64_REL24[_NOTOC] and the name of the function I need to go back one step (to i - 2) to check and see if we really have a missing R_PPC64_TLSGD/R_PPC64_TLSLD.

The code and comment can be further reduced. Other than that, this mostly looks good.

lld/ELF/Relocations.cpp
1315

You may reduce the comments by adding one sentence to the previous comment:

// ... , preceded by a R_PPC64_TLSGD/R_PPC64_TLSLD

Then the comment below can be dropped as well.

1318

The two errorOrWarn can be combined.

bool err = i - start < 2;
if (!err) {
  const RelTy &prevRel = *(i - 2);
  if (...)
    err = true;
}
if (err)
  errOrWarn
stefanp updated this revision to Diff 286995.Aug 21 2020, 3:46 AM

Adjusted one comment and moved the code around to merge the two errorOrWarn.

MaskRay accepted this revision.Aug 21 2020, 9:19 AM

LGTM.

lld/ELF/Relocations.cpp
1329

LLD's diagnostics follow this style http://llvm.org/docs/CodingStandards.html#error-and-warning-messages : No capitalization. No full stop.

This revision is now accepted and ready to land.Aug 21 2020, 9:19 AM
MaskRay added inline comments.Aug 21 2020, 9:21 AM
lld/ELF/Relocations.cpp
1315

This comment can be merged into the previous one.

MaskRay added a comment.EditedOct 23 2020, 9:00 AM

First, if this issue is not common enough, I'd rather we don't have a diagnostic at all.

If we do need it, note that *(i - 2); may be out of bounds.

Last, we should avoid pre-built object files. You can synthesize them with yaml2obj. The preferred style is llvm/test/tools/llvm-readobj/ELF/*.test. lld/test/ELF has some yaml2obj tests as well.

@stefanp The error turns out to be a problem when building glibc.

elf/dl-sym.c has a raw __tls_get_addr call (it sets up the parameter by itself), without R_PPC64_TLSGD/R_PPC64_TLSLD (similar to CallOnly). Shall we drop the logic? I believe GNU ld does not have the logic.

/* Return the symbol address given the map of the module it is in and
   the symbol record.  This is used in dl-sym.c.  */
static void *
_dl_tls_symaddr (struct link_map *map, const ElfW(Sym) *ref)
{
# ifndef DONT_USE_TLS_INDEX
  tls_index tmp =
    {
      .ti_module = map->l_tls_modid,
      .ti_offset = ref->st_value
    };

  return __TLS_GET_ADDR (&tmp);
# else
  return __TLS_GET_ADDR (map->l_tls_modid, ref->st_value);
# endif
}
#endif
stefanp added a comment.EditedOct 23 2020, 10:19 AM

First, if this issue is not common enough, I'd rather we don't have a diagnostic at all.

If we do need it, note that *(i - 2); may be out of bounds.

Last, we should avoid pre-built object files. You can synthesize them with yaml2obj. The preferred style is llvm/test/tools/llvm-readobj/ELF/*.test. lld/test/ELF has some yaml2obj tests as well.

@stefanp The error turns out to be a problem when building glibc.

elf/dl-sym.c has a raw __tls_get_addr call (it sets up the parameter by itself), without R_PPC64_TLSGD/R_PPC64_TLSLD (similar to CallOnly). Shall we drop the logic? I believe GNU ld does not have the logic.

/* Return the symbol address given the map of the module it is in and
   the symbol record.  This is used in dl-sym.c.  */
static void *
_dl_tls_symaddr (struct link_map *map, const ElfW(Sym) *ref)
{
# ifndef DONT_USE_TLS_INDEX
  tls_index tmp =
    {
      .ti_module = map->l_tls_modid,
      .ti_offset = ref->st_value
    };

  return __TLS_GET_ADDR (&tmp);
# else
  return __TLS_GET_ADDR (map->l_tls_modid, ref->st_value);
# endif
}
#endif

I didn't realize that this was possible.
The problem is that now we can get silently bad code. But I guess it's the user that needs to watch out for problems.
Sure, I agree, we can get rid of this condition.

@MaskRay
Would you like me to just revert this change? Do you want to see something on Phabricator for it?

@MaskRay
Would you like me to just revert this change? Do you want to see something on Phabricator for it?

I have prepared a partial revert, as the test is still useful. I can do it if you allow:)

@MaskRay
Would you like me to just revert this change? Do you want to see something on Phabricator for it?

I have prepared a partial revert, as the test is still useful. I can do it if you allow:)

Sounds good to me. Go for it!

@MaskRay
Sorry... can you pause on this for a second. I want to talk to the glibc guys about this first.

@MaskRay
Sorry... can you pause on this for a second. I want to talk to the glibc guys about this first.

@stefanp I already did it... I think this is a problem for many ld.so implementations. For example, musl has

if ((def.sym->st_info&0xf) == STT_TLS)
         return __tls_get_addr((tls_mod_off_t []){def.dso->tls_id, def.sym->st_value-DTP_OFFSET});

It was broken as well...

@MaskRay
Sorry... can you pause on this for a second. I want to talk to the glibc guys about this first.

@stefanp I already did it... I think this is a problem for many ld.so implementations. For example, musl has

if ((def.sym->st_info&0xf) == STT_TLS)
         return __tls_get_addr((tls_mod_off_t []){def.dso->tls_id, def.sym->st_value-DTP_OFFSET});

It was broken as well...

Can you give me more information about one of these examples?
I want to try to reproduce this myself and I want to talk to some people on glibc as well.
I'm affraid that it is possible to get incorrect code gen in these situations and so I want to try it out myself.

So, what glibc source did you use and how did you compile it?

MaskRay added a comment.EditedOct 23 2020, 11:47 AM

@MaskRay
Sorry... can you pause on this for a second. I want to talk to the glibc guys about this first.

@stefanp I already did it... I think this is a problem for many ld.so implementations. For example, musl has

if ((def.sym->st_info&0xf) == STT_TLS)
         return __tls_get_addr((tls_mod_off_t []){def.dso->tls_id, def.sym->st_value-DTP_OFFSET});

It was broken as well...

Can you give me more information about one of these examples?
I want to try to reproduce this myself and I want to talk to some people on glibc as well.
I'm affraid that it is possible to get incorrect code gen in these situations and so I want to try it out myself.

So, what glibc source did you use and how did you compile it?

The simplest example is:

void __tls_get_addr(void *);

void foo(void *x) {
  __tls_get_addr(x);
}

dlsym implementations use this snippet to resolve the address of a STT_TLS symbol. You could add a !config->shared check but I would be unsure about the whole stuff. The symbol name check is expensive as it applies to every R_PPC64_REL24/R_PPC64_REL24_NOTOC. I think the signal-to-noise ratio for this diagnostic is now low.


I got a confirmation from a FreeBSD developer that this rigid check broke FreeBSD ld.so as well:

19:19 <@Bdragon> --- ld-elf.so.1.full ---
19:19 <@Bdragon> ld: error: call to __tls_get_addr is missing a R_PPC64_TLSGD/R_PPC64_TLSLD relocation
19:19 <@Bdragon> >>> defined in reloc.o
19:19 <@Bdragon> >>> referenced by rtld.c:3722 (/usr/src/libexec/rtld-elf/rtld.c:3722)
19:19 <@Bdragon> >>>               rtld.o:(do_dlsym)
19:19 <@Bdragon> --- all_subdir_usr.bin ---

@MaskRay
Sorry... can you pause on this for a second. I want to talk to the glibc guys about this first.

@stefanp I already did it... I think this is a problem for many ld.so implementations. For example, musl has

if ((def.sym->st_info&0xf) == STT_TLS)
         return __tls_get_addr((tls_mod_off_t []){def.dso->tls_id, def.sym->st_value-DTP_OFFSET});

It was broken as well...

Can you give me more information about one of these examples?
I want to try to reproduce this myself and I want to talk to some people on glibc as well.
I'm affraid that it is possible to get incorrect code gen in these situations and so I want to try it out myself.

So, what glibc source did you use and how did you compile it?

This comes from Google's version of glibc, which is the google/grte/v5-2.27/master branch of glibc at sourceware.org. The checkout has a README.grte that tells how to build with clang (since glibc head is still gcc-only), a native ppc configure/make should fail with this error.