Page MenuHomePhabricator

New clang option -fno-plt to avoid PLT for external calls
ClosedPublic

Authored by tmsriram on Oct 18 2017, 9:47 PM.

Details

Summary

New clang option -fno-plt which avoids the PLT and lazy binding while making external calls.

GCC supports -fno-plt, https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00001.html. This patch adds this to clang which marks all externally defined functions with the "nolazybind" attribute. This LLVM patch skips the PLT for calls to these functions, https://reviews.llvm.org/D39065.

Diff Detail

Repository
rL LLVM

Event Timeline

tmsriram created this revision.Oct 18 2017, 9:47 PM
rnk edited edge metadata.Oct 23 2017, 1:56 PM

No tests?

+1, there should be an -emit-llvm test in clang/test/CodeGen/.

lib/CodeGen/CGCall.cpp
1859 ↗(On Diff #119543)

Remind me what happens when the definition is within the current DSO after linking. I seem to recall that the call through memory is 6 bytes and the direct pcrel call is 5 bytes, and the linker is required to know to rewrite the indirect call to a nop. Is that accurate?

tmsriram updated this revision to Diff 119950.Oct 23 2017, 3:35 PM

Added test test/CodeGen/noplt.c

tmsriram added inline comments.Oct 23 2017, 3:38 PM
lib/CodeGen/CGCall.cpp
1859 ↗(On Diff #119543)

That's accurate, the linker rewrites it with a nop equivalent.

rnk accepted this revision.Oct 23 2017, 3:44 PM

lgtm

This revision is now accepted and ready to land.Oct 23 2017, 3:44 PM

Noting that, as @vit9696 pointed out in D38554, this does not suppress uses of the PLT that occur from backend/optimizer-generated functions (e.g., calls into compiler-rt and similar).

Noting that, as @vit9696 pointed out in D38554, this does not suppress uses of the PLT that occur from backend/optimizer-generated functions (e.g., calls into compiler-rt and similar).

Can I work on this as a follow-up?

joerg added a subscriber: joerg.Oct 24 2017, 10:13 AM

Why again is this a good idea? This is an even worse hack than -Bsymbolic, the latter at least is visible in ELF header without code inspection. This is breaking core premises of ELF.

Why again is this a good idea? This is an even worse hack than -Bsymbolic, the latter at least is visible in ELF header without code inspection. This is breaking core premises of ELF.

Could you elaborate a bit more on what ELF promises this is breaking? I haven't fully read through D38554 yet.

Noting that, as @vit9696 pointed out in D38554, this does not suppress uses of the PLT that occur from backend/optimizer-generated functions (e.g., calls into compiler-rt and similar).

Can I work on this as a follow-up?

I have no objection to that, although adding a mechanism to fix this (which I imagine would be an attribute tied to the caller, not the callee, would probably end up replacing this mechanism).

Let me phrase it differently. What is this patch (and the matching backend PR) supposed to achieve? There are effectively two ways to get rid of PLT entries:
(1) Bind references locally. This is effectively what -Bsymbolic does and what is breaking the ELF interposition rules.
(2) Do an indirect call via the GOT. Requires knowing what an external symbol is, making it non-attractive for anything but LTO, since it will create performance issues for all non-local accesses (i.e. anything private).

rnk added a comment.Oct 24 2017, 10:26 AM

Why again is this a good idea?

It saves the direct jump to the PLT, reducing icache pressure, which is a major cost in some workloads.

This is an even worse hack than -Bsymbolic,

Personally, I would like to build LLVM with -Bsymbolic so that we can build LLVM as a DSO and load it from clang without regressing startup time, so I don't see what's so terrible about -Bsymbolic, especially for C++ programs.

the latter at least is visible in ELF header without code inspection. This is breaking core premises of ELF.

What are you talking about?

Anyway, LLVM already has an attribute, nonlazybind, and this just provides a flag to apply it to all declarations. It gives the user access to the GOTPCREL relocations that we, and loaders, already support.

rnk added a comment.Oct 24 2017, 10:31 AM

Let me phrase it differently. What is this patch (and the matching backend PR) supposed to achieve? There are effectively two ways to get rid of PLT entries:
(1) Bind references locally. This is effectively what -Bsymbolic does and what is breaking the ELF interposition rules.
(2) Do an indirect call via the GOT. Requires knowing what an external symbol is, making it non-attractive for anything but LTO, since it will create performance issues for all non-local accesses (i.e. anything private).

This patch does 2. According to @tmsriram, clever linkers can turn the indirect call back into a nop+call_pcrel32. If this isn't universal, the user must know what their linker supports. I don't see how it causes performance issues for non-local calls, since the PLT will do a jump through the GOT anyway.

what is breaking the ELF interposition rules

Frankly, in retrospect, ELF's interposition rules (or at least the defaults for those rules), seem suboptimal on several fronts. While breaking them has some unfortunate consistency implications, I don't consider breaking them a bad thing.

In D39079#905396, @rnk wrote:

Why again is this a good idea?

It saves the direct jump to the PLT, reducing icache pressure, which is a major cost in some workloads.

It also increases the pressure on the branch predictor, so it is not really black and white.

This is an even worse hack than -Bsymbolic,

Personally, I would like to build LLVM with -Bsymbolic so that we can build LLVM as a DSO and load it from clang without regressing startup time, so I don't see what's so terrible about -Bsymbolic, especially for C++ programs.

Qt5 tries that. Requires further hacks as the main binary must be compiled as fully position independent code to not run into fun latter. Fun with copy relocations is only part of it.

Anyway, LLVM already has an attribute, nonlazybind, and this just provides a flag to apply it to all declarations. It gives the user access to the GOTPCREL relocations that we, and loaders, already support.

The loader doesn't see GOTPCREL anymore. It also requires a linker that disassembles instructions, because it can't distinguish between a normal pointer load and a call, to be able to optimize it.

In D39079#905423, @rnk wrote:

Let me phrase it differently. What is this patch (and the matching backend PR) supposed to achieve? There are effectively two ways to get rid of PLT entries:
(1) Bind references locally. This is effectively what -Bsymbolic does and what is breaking the ELF interposition rules.
(2) Do an indirect call via the GOT. Requires knowing what an external symbol is, making it non-attractive for anything but LTO, since it will create performance issues for all non-local accesses (i.e. anything private).

This patch does 2. According to @tmsriram, clever linkers can turn the indirect call back into a nop+call_pcrel32. If this isn't universal, the user must know what their linker supports. I don't see how it causes performance issues for non-local calls, since the PLT will do a jump through the GOT anyway.

Yes, please see this for GOLD linkers: https://sourceware.org/ml/binutils/2016-05/msg00322.html

In D39079#905396, @rnk wrote:

Why again is this a good idea?

It saves the direct jump to the PLT, reducing icache pressure, which is a major cost in some workloads.

It also increases the pressure on the branch predictor, so it is not really black and white.

My experiments show that doing this improves performance of some our large workloads by upto 1% and it happens with a reduction in iTLB misses.

This is an even worse hack than -Bsymbolic,

Personally, I would like to build LLVM with -Bsymbolic so that we can build LLVM as a DSO and load it from clang without regressing startup time, so I don't see what's so terrible about -Bsymbolic, especially for C++ programs.

Qt5 tries that. Requires further hacks as the main binary must be compiled as fully position independent code to not run into fun latter. Fun with copy relocations is only part of it.

Anyway, LLVM already has an attribute, nonlazybind, and this just provides a flag to apply it to all declarations. It gives the user access to the GOTPCREL relocations that we, and loaders, already support.

The loader doesn't see GOTPCREL anymore. It also requires a linker that disassembles instructions, because it can't distinguish between a normal pointer load and a call, to be able to optimize it.

The linker can replace indirect calls via GOTPCREL with direct calls, both GOLD and BFD linker support this today.

rnk added a comment.Oct 24 2017, 11:02 AM

It also increases the pressure on the branch predictor, so it is not really black and white.

I don't understand this objection. I'm assuming that the PLT stub is an indirect jump through the PLTGOT, not a hotpatched stub that jumps directly to the definition chosen by the loader. This is the ELF model that I'm familiar with, especially since calls to code more than 2GB away generally need to be indirect anyway.

Qt5 tries that. Requires further hacks as the main binary must be compiled as fully position independent code to not run into fun latter. Fun with copy relocations is only part of it.

I'm not sure I understand, but this patch isn't introducing copy relocations, to be clear.

The loader doesn't see GOTPCREL anymore. It also requires a linker that disassembles instructions, because it can't distinguish between a normal pointer load and a call, to be able to optimize it.

Well, yes. The user needs to know that they have an x86-encoding-aware linker, or using this flag is probably going to slow their code down. From my perspective, this is a performance tuning flag, so that's reasonable.

In D39079#905468, @rnk wrote:

It also increases the pressure on the branch predictor, so it is not really black and white.

I don't understand this objection. I'm assuming that the PLT stub is an indirect jump through the PLTGOT, not a hotpatched stub that jumps directly to the definition chosen by the loader. This is the ELF model that I'm familiar with, especially since calls to code more than 2GB away generally need to be indirect anyway.

Yes, this is correct. A PLT stub for x86_64 looks like this:

jmpq   *0x2ada(%rip)        # 403000 <_GLOBAL_OFFSET_TABLE_+0x18>
pushq  $0x0
jmpq   400510 <_init+0x30>

It has three instructions and the last two are only useful if lazy binding is done. With early binding, the last two instructions is dead code. What this patch does is to take that first instruction and put it at the point where the call is made to the PLT, that's it. Really, with early binding, the PLT stub is a completely redundant piece of code. I can't see how you argue with this.

Qt5 tries that. Requires further hacks as the main binary must be compiled as fully position independent code to not run into fun latter. Fun with copy relocations is only part of it.

I'm not sure I understand, but this patch isn't introducing copy relocations, to be clear.

The loader doesn't see GOTPCREL anymore. It also requires a linker that disassembles instructions, because it can't distinguish between a normal pointer load and a call, to be able to optimize it.

Well, yes. The user needs to know that they have an x86-encoding-aware linker, or using this flag is probably going to slow their code down. From my perspective, this is a performance tuning flag, so that's reasonable.

In D39079#905468, @rnk wrote:

It also increases the pressure on the branch predictor, so it is not really black and white.

I don't understand this objection. I'm assuming that the PLT stub is an indirect jump through the PLTGOT,
not a hotpatched stub that jumps directly to the definition chosen by the loader. This is the ELF model
that I'm familiar with, especially since calls to code more than 2GB away generally need to be indirect anyway.

Correct, so all local calls to the same function go via the same location and share the predication of the indirect jump.

Qt5 tries that. Requires further hacks as the main binary must be compiled as fully position independent code to not run into fun latter. Fun with copy relocations is only part of it.

I'm not sure I understand, but this patch isn't introducing copy relocations, to be clear.

That was in reference to using it for clang.

In D39079#905468, @rnk wrote:

It also increases the pressure on the branch predictor, so it is not really black and white.

I don't understand this objection. I'm assuming that the PLT stub is an indirect jump through the PLTGOT,
not a hotpatched stub that jumps directly to the definition chosen by the loader. This is the ELF model
that I'm familiar with, especially since calls to code more than 2GB away generally need to be indirect anyway.

Correct, so all local calls to the same function go via the same location and share the predication of the indirect jump.

Weigh this against a .plt that is far away from the actual calls, and is causing itlb pressure and is clearly improving itlb behavior by eliminating it on our larger workloads. This is an optimization feature and if this does not improve performance in your case, do not enable it.

Qt5 tries that. Requires further hacks as the main binary must be compiled as fully position independent code to not run into fun latter. Fun with copy relocations is only part of it.

I'm not sure I understand, but this patch isn't introducing copy relocations, to be clear.

That was in reference to using it for clang.

rnk added a reviewer: ruiu.Oct 25 2017, 12:38 PM
rnk removed a reviewer: ruiu.
rnk added a subscriber: ruiu.

Ping. How do we take this forward?

rnk added a comment.Oct 31 2017, 11:42 AM

I'd also like to highlight that LLVM has many users with different needs. Sometimes we need to "disagree and commit". Flags are one of the primary ways that we have to do that.

This revision was automatically updated to reflect the committed changes.