This is an archive of the discontinued LLVM Phabricator instance.

[ELF][PPC64] Improve "call lacks nop" diagnostic and make it compatible with GCC<5.5 and GCC<6.4
ClosedPublic

Authored by MaskRay on Dec 17 2019, 5:04 PM.

Details

Summary

GCC before r245813 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79439)
did not emit nop after b/bl. This can happen with recursive calls.
r245813 was back ported to GCC 5.5 and GCC 6.4.

This is common, for example, libstdc++.a(locale.o) shipped with GCC 4.9
and many objects in netlib lapack can cause lld to error. gold allows
such calls to the same section. Our __plt_foo symbol's section field
is used for ThunkSection, so we can't implement a similar loosen rule
easily. But we can make use of its file field which is currently NULL.

Diff Detail

Event Timeline

MaskRay created this revision.Dec 17 2019, 5:04 PM
Herald added a project: Restricted Project. · View Herald Transcript
MaskRay added a reviewer: Restricted Project.Dec 17 2019, 5:19 PM

🤖 🤖 🤖

MaskRay updated this revision to Diff 235479.Dec 28 2019, 9:16 AM
MaskRay retitled this revision from [ELF][PPC64] Improve "call lacks nop" diagnostic and make it compatible with GCC<5 to [ELF][PPC64] Improve "call lacks nop" diagnostic and make it compatible with GCC<5.5 and GCC<6.4.
MaskRay edited the summary of this revision. (Show Details)
MaskRay updated this revision to Diff 235549.Dec 29 2019, 11:04 PM

Add more comments

This revision was not accepted when it landed; it landed in state Needs Review.Dec 29 2019, 11:12 PM
This revision was automatically updated to reflect the committed changes.

Relaxing the check will allow those old miss-compiled objects to link, but we are loosing some utility here as well: we are turning a valid link time failure into a potential runtime failure. Even worse a runtime failure that depends on the environment at load time so the failure is not consistent. Could we guard the error relaxation with a specific option so that 1) It must be opted into specifically and 2) We can eventually phase the option out once the OS that used gcc 5.* and 6.* go out of service?

MaskRay added a comment.EditedJan 2 2020, 12:42 PM

Sorry for my being rechless...

Relaxing the check will allow those old miss-compiled objects to link, but we are loosing some utility here as well: we are turning a valid link time failure into a potential runtime failure. Even worse a runtime failure that depends on the environment at load time so the failure is not consistent. Could we guard the error relaxation with a specific option so that 1) It must be opted into specifically and 2) We can eventually phase the option out once the OS that used gcc 5.* and 6.* go out of service?

I think gold does a very similar thing here. Now the question is whether we consider this diagnostic helpful...

Let me contribute one data point: It seems that now this error only flags hand-written assembly that does not have nop after bl. I have tested in our internal code base. There were many link failures due to precompiled libstdc++/Fortran LAPACK object files. With this fix, there is no link failure.

I can definitely see it from the opposite side: We want LLD to be usable on these systems that shipped with this bug in libstdc++ and gcc/gfortran. And adding a link option can be hard to do in some build setups so its not as simple as I make it sound. What worries me though is that one of the compilers could subtly break and with the linker[s] being sloppy with the error then we don't find out until long after the toolchain has shipped.

I think you are right that this will only be triggered on handwritten assembly right now. That is still useful even though not many people write assembly now days. Even experts can mess up, for example taking code written for an executable, then linking it into a shared object since it looks like its already position independent could trigger this error because a tail-call that was valid in an exec isn't in the shared object. These kinds of details aren't really documented anywhere. They are difficult to figure out with the error, but much much worse when you are diagnosing runtime corruptions.

I can definitely see it from the opposite side: We want LLD to be usable on these systems that shipped with this bug in libstdc++ and gcc/gfortran. And adding a link option can be hard to do in some build setups so its not as simple as I make it sound. What worries me though is that one of the compilers could subtly break and with the linker[s] being sloppy with the error then we don't find out until long after the toolchain has shipped.

I think you are right that this will only be triggered on handwritten assembly right now. That is still useful even though not many people write assembly now days. Even experts can mess up, for example taking code written for an executable, then linking it into a shared object since it looks like its already position independent could trigger this error because a tail-call that was valid in an exec isn't in the shared object. These kinds of details aren't really documented anywhere. They are difficult to figure out with the error, but much much worse when you are diagnosing runtime corruptions.

I am hesitant to call it a gcc/gfortran bug (that's why I don't use the term "bug" to describe the issue). The GCC report was originally motivated by a glibc use case that does dlopen RTLD_LOCAL. RTLD_LOCAL does not work in various situations and it just appears to work in some circumstances. In a C++ context it will likely be an one-definition-rule violation.

For the hand-written assembly case, to trigger an uncaught error after this patch, the code needs to have the following setting:

// a.so
bar:
  bl foo
  # no nop
  ...

.globl foo   # foo defined by a.so is preempted by the executable or another shared object at runtime
foo:
  ...

The symbol preemption here makes it not a likely scenario. If you still feel strong about this, I can add an option to warn such cases, and make it off by default (if it is on by default, Linux distributions will have to turn off this option. And in the future, when we remove the code, the distributions will have to delete it)

I agree that the runtime symbol preemption makes the failure unlikely. I disagree its not a bug though: it doesn't matter that FORTRAN and C++ have the 'one-definition-rule' because the generated code isn't ABI compliant. A compiler producing a valid C++/Fortran program that isn't ABI compliant is definitely a bug. The case I am much more concerned about is code like this:

    .abiversion 2

    .text
    .globl  Nop_Needed
    .p2align        2
    .type   Nop_Needed,@function
Nop_Needed:
.Lfunc_begin0:
    blr
    .long   0
    .quad   0
.Lfunc_end0:
    .size   Nop_Needed, .Lfunc_end0-.Lfunc_begin0

    .globl  No_Nop_Needed
    .p2align        2
    .type   No_Nop_Needed,@function
No_Nop_Needed:                          # @No_Nop_Needed
.Lfunc_begin1:
.Lfunc_gep1:
    addis 2, 12, .TOC.-.Lfunc_gep1@ha
    addi 2, 2, .TOC.-.Lfunc_gep1@l
.Lfunc_lep1:
    .localentry     No_Nop_Needed, .Lfunc_lep1-.Lfunc_gep1
    li 3, 0
    b Nop_Needed
    .long   0
    .quad   0
.Lfunc_end1:
    .size   No_Nop_Needed, .Lfunc_end1-.Lfunc_begin1

I realize now though that if you assemble this with clang instead of gas, we don't get the error because clang will not emit a relocation for b Nop_Needed. This is by design ( clang may have done IPO between the 2 function definitions) but breaks ELF shared-object interposition semantics :frown:. If assembled with gas, or if the functions are in different sections, then with both bfd and gold I get the call lacks nop, can't restore toc; error with linkers from binutils 2.26,2.27 and 2.29 and don't with LLD after your change.

I can add an option to warn such cases, and make it off by default (if it is on by default, Linux distributions will have to turn off this option. And in the future, when we remove the code, the distributions will have to delete it)

Why would Linux distributions have to turn it off? How many Linux distros were compiled with a gcc with this problem? RHEL 7.x will be affected but I think RHEL 8.x has fixed system compilers. Ubuntu and Debian should not need to relax the error either. (Jessie might have the problem but that goes out of service soon).

Is there a way to configure the toolchain driver to add the option when targeting/configuring for RHEL 7.x only? For example I think Ubuntu switched to always PIE with release 18.04, that must be done through the driver configuration. Could we do something similar?

Sorry for my being rechless...

Sorry I missed this earlier I should have got to reviewing it before Xmas break. FWIW I don't think your change is reckless. This is a huge problem for any of the distros affected and we do need to fix it, and relaxing the error to make LLD work on those systems is unavoidable. I feel strongly that the code given in my example should always be a fatal link time error. It can lead to a shared object that is corrupt when linked into one particular program but perfectly valid when linked into a different program. The case you presented I think can be relaxed although ideally we do it in a way where similar future codegen problems can't sneak in.

I agree that the runtime symbol preemption makes the failure unlikely. I disagree its not a bug though: it doesn't matter that FORTRAN and C++ have the 'one-definition-rule' because the generated code isn't ABI compliant. A compiler producing a valid C++/Fortran program that isn't ABI compliant is definitely a bug. The case I am much more concerned about is code like this:

    .abiversion 2

    .text
    .globl  Nop_Needed
    .p2align        2
    .type   Nop_Needed,@function
Nop_Needed:
.Lfunc_begin0:
    blr
    .long   0
    .quad   0
.Lfunc_end0:
    .size   Nop_Needed, .Lfunc_end0-.Lfunc_begin0

    .globl  No_Nop_Needed
    .p2align        2
    .type   No_Nop_Needed,@function
No_Nop_Needed:                          # @No_Nop_Needed
.Lfunc_begin1:
.Lfunc_gep1:
    addis 2, 12, .TOC.-.Lfunc_gep1@ha
    addi 2, 2, .TOC.-.Lfunc_gep1@l
.Lfunc_lep1:
    .localentry     No_Nop_Needed, .Lfunc_lep1-.Lfunc_gep1
    li 3, 0
    b Nop_Needed
    .long   0
    .quad   0
.Lfunc_end1:
    .size   No_Nop_Needed, .Lfunc_end1-.Lfunc_begin1

I realize now though that if you assemble this with clang instead of gas, we don't get the error because clang will not emit a relocation for b Nop_Needed. This is by design ( clang may have done IPO between the 2 function definitions) but breaks ELF shared-object interposition semantics :frown:. If assembled with gas, or if the functions are in different sections, then with both bfd and gold I get the call lacks nop, can't restore toc; error with linkers from binutils 2.26,2.27 and 2.29 and don't with LLD after your change.

I can add an option to warn such cases, and make it off by default (if it is on by default, Linux distributions will have to turn off this option. And in the future, when we remove the code, the distributions will have to delete it)

Why would Linux distributions have to turn it off? How many Linux distros were compiled with a gcc with this problem? RHEL 7.x will be affected but I think RHEL 8.x has fixed system compilers. Ubuntu and Debian should not need to relax the error either. (Jessie might have the problem but that goes out of service soon).

Is there a way to configure the toolchain driver to add the option when targeting/configuring for RHEL 7.x only? For example I think Ubuntu switched to always PIE with release 18.04, that must be done through the driver configuration. Could we do something similar?

Sorry for my being rechless...

Sorry I missed this earlier I should have got to reviewing it before Xmas break. FWIW I don't think your change is reckless. This is a huge problem for any of the distros affected and we do need to fix it, and relaxing the error to make LLD work on those systems is unavoidable. I feel strongly that the code given in my example should always be a fatal link time error. It can lead to a shared object that is corrupt when linked into one particular program but perfectly valid when linked into a different program. The case you presented I think can be relaxed although ideally we do it in a way where similar future codegen problems can't sneak in.

I am still hesitant to call any ABI non-conformance a bug. It is not uncommon that implementations differ a bit from ABI and ABI may describe something that is unnecessary. I agree that for the problem in question, the ABI is correct, and GCC<5.4 and GCC<6.4 had a problem with symbol preemption under RTLD_LOCAL.

Created D72183 to add --lax-call-lacks-nop. As you observed (b Nop_Needed), ppc64-error-toc-tail-call.s does not break when I make the suppresion opt-in (via --lax-call-lacks-nop). I'm investigating the assembler bug.