This is an archive of the discontinued LLVM Phabricator instance.

[ELF, draft] - Combine GOTPLT and GOT slots.
AbandonedPublic

Authored by grimar on Aug 31 2017, 7:01 AM.

Details

Reviewers
ruiu
rafael
Summary

This is the PR27972 and PR32938.
I am posting it to show problems I faced when tried
to support this optimization and have some discussion
about it. Main question is do we want to support this optimization ?
And how to deal with 2, 3 paragraphs from below if yes,

x86_64 ABI (section B.1) tolds that when there are both
GOT and PLT references to the same symbol, normally linker
creates GOTPLT entry and GOT entry. Two dynamic relocations:
JUMP_SLOT and GLOB_DAT serves to handle things. That is
what LLD already implemented.

As optimization, linker may skip creating GOTPLT entry and create
special PLT entry that will use GOT instead. That allows to use single
GLOB_DAT dynamic relocation. Also since PLT entry is special,
it can be 8 bytes only and ABI suggests to use separate section for that.

Patch do next things:

  1. It introduces .plt.got section. Name is consistent with bfd.

Section keeps special 8 PLT bytes entries with jump instruction which
uses address from GOT as destination. It was possible to use regular
.plt section (16 bytes for x86_64), but it would be suboptimal and probably not clean
from code POV.

  1. When scanning relocations, new logic does not create got.plt entry if it is known

that symbol has got entry already. But not vise-versa. So currently it will optimize following code
correctly:

movq  foo@GOTPCREL(%rip), %rax
callq foo@PLT

but not:

callq foo@PLT
movq  foo@GOTPCREL(%rip), %rax

I thought about how to implement both.
I think it is possible if we delay creating plt entries until all relocations are
scanned. Then we will know if symbol uses got and so can avoid creation of .got.plt for it.
It should not be hard, but to keep patch cleaner, smaller and simpler I did not do that in draft.
Not sure what is better way to do that.

  1. ABI says that optimization must be avoided if pointer equality is needed.

Looks it is possible to support if we scan relocations for checking that somehow.
That is what bfd do I believe. Not sure what is correct/best way to check that equality is needed ?
It may bring additional complication. Or should we disable this relaxation by default ?

Diff Detail

Event Timeline

grimar created this revision.Aug 31 2017, 7:01 AM
ruiu edited edge metadata.Aug 31 2017, 2:08 PM

When you attempt to implement a documented ABI, please include a reference to the ABI. It seems you are implementing this: https://github.com/hjl-tools/x86-psABI/blob/68edce4f22070cc83ebc4a5df4b74222300dd24d/linker-optimization.tex

So, IIUC, this optimization aims to reduce the number of relocations for an interruptible function from 2 to 1. I'm skeptical if that makes some noticeable difference in performance because .got.plt isn't usually that big, and processing relocations is usually fast. IIRC, in lld (modulo the difference between static and dynamic linkers), we can process millions of relocations in a few seconds, so that's not slow at least.

We don't have to implement all optimizations that the ABI defines. We instead want to implement some of them that have non-marginal performance improvement. You presumably could see a difference in performance in a benchmark for the sake of benchmark, but my feeling is that I don't want to implement it unless someone comes to us with some numbers such as "we want to use this optimization because it makes our application starts up XX% faster" or something like that.

So, what is the motivation of implementing it? If it is because the ABI defines it, I don't think I'm convinced that we want it.

grimar added a comment.Sep 1 2017, 2:40 AM
In D37333#858170, @ruiu wrote:

When you attempt to implement a documented ABI, please include a reference to the ABI. It seems you are implementing this: https://github.com/hjl-tools/x86-psABI/blob/68edce4f22070cc83ebc4a5df4b74222300dd24d/linker-optimization.tex

So, IIUC, this optimization aims to reduce the number of relocations for an interruptible function from 2 to 1. I'm skeptical if that makes some noticeable difference in performance because .got.plt isn't usually that big, and processing relocations is usually fast. IIRC, in lld (modulo the difference between static and dynamic linkers), we can process millions of relocations in a few seconds, so that's not slow at least.

We don't have to implement all optimizations that the ABI defines. We instead want to implement some of them that have non-marginal performance improvement. You presumably could see a difference in performance in a benchmark for the sake of benchmark, but my feeling is that I don't want to implement it unless someone comes to us with some numbers such as "we want to use this optimization because it makes our application starts up XX% faster" or something like that.

So, what is the motivation of implementing it? If it is because the ABI defines it, I don't think I'm convinced that we want it.

I am agree with you.
We had 2 open issues about this. In one of them Rafael also mentioned that it is not clear how much it is useful,
and I just wanted to check by myself how much hard to support it. If it would be few lines, we could implement it for feature completeness,
but when I faced problems mentioned in description of patch decided to tell about them first,
because I also suppose it is not very useful optimization and given the amount of changes it requires I would not
implement it.

So I suggest to close this bug with explicit "Will not fix" statement and look at possible objections from users then.

grimar abandoned this revision.Sep 11 2017, 6:16 AM