This is an archive of the discontinued LLVM Phabricator instance.

[ELF][MIPS] Reduce number of redundant entries in the local part of MIPS GOT
ClosedPublic

Authored by atanasyan on Mar 22 2016, 5:59 AM.

Details

Summary

Local symbol which requires GOT entry initialized by "page" address. This address is high 16 bits of sum of the symbol value and the relocation addend. In the relocation scanning phase final values of symbols are unknown so to reduce number of allocated GOT entries do the following trick. Save all output sections referenced by GOT relocations during the relocation scanning phase. Then later in the GotSection::finalize method calculate number of "pages" required to cover all saved output sections and allocate appropriate number of GOT entries. We assume the worst case - each 64kb page of the output section has at least one GOT relocation against it.

Diff Detail

Repository
rL LLVM

Event Timeline

atanasyan updated this revision to Diff 51277.Mar 22 2016, 5:59 AM
atanasyan retitled this revision from to [ELF][MIPS] Reduce number of redundant entries in the local part of MIPS GOT.
atanasyan updated this object.
atanasyan added reviewers: ruiu, rafael.
atanasyan set the repository for this revision to rL LLVM.
atanasyan added a project: lld.
atanasyan added a subscriber: llvm-commits.
rafael edited edge metadata.Mar 22 2016, 8:21 AM
rafael added a subscriber: rafael.

That is still more than needed, no?

A single symbol pointing to a 8k page would create 2 got entries but
only needs one. Is that correct?

Cheers,
Rafael

That is still more than needed, no?

Yes. If symbols point only to the thirst 64kb page of large output section we allocate "page" entries covers the section completely.

A single symbol pointing to a 8k page would create 2 got entries but
only needs one. Is that correct?

I do not consider local symbols individually at all. If a single symbol pointing to a 8k page we get one redundant entry. It is not good but acceptable. If we have more than one symbol pointing to the adjacent addresses, exact number of required GOT entries depends on the final values of these addresses. In the best case they all belong to the same final 64kb page and have the same high 16 bits. In the worst case we will need two page addresses. But to get optimal result we will have to calculate number of required GOT entries after all output section will be finalized and aligned.

BTW, what result do you get with gold/bfd?

Cheers,
Rafael

I wonder if with SymbolBodies being allocated to local symbols if we
can compute the exact answer.

Would this work?:

  • Represent the .got section with just a vector of symbol bodies.
  • Make sure .got is placed after all sections that a got entry can point to.
  • Layout all sections that go before got.
  • We can now compute the values of the symbols.

There are two problems in this approach (though it allows to get ideal result):

  • We need to create a local symbol body for each unique combination of local symbol referenced by GOT relocation and addend. Number of such combinations might be very large. And we need to read relocation addend in relocation scanning phase.
  • For some end-users like embedded developers it might be essential to be able to place sections in specific order.

BTW, what result do you get with gold/bfd?

Almost all test cases from LLVM Test suite show similar results. For example (size of .got section):
7zip-benchmark: BFD: 0x17a4 LLD: 0x177c
tramp3d-v4: BFD: 0x11a4 LLD: 0x11b8

There are couple exceptions mafft/pairlocalalign and NPB-serial/is/is. These executables have large .bss section and the patch produces the GOT bigger than necessary. But anyway it is better than current trunk implementation.

atanasyan updated this revision to Diff 51392.Mar 23 2016, 2:41 AM
atanasyan edited edge metadata.

Rebased the patch.

There are two problems in this approach (though it allows to get ideal result):

  • We need to create a local symbol body for each unique combination of local symbol referenced by GOT relocation and addend. Number of such combinations might be very large. And we need to read relocation addend in relocation scanning phase.

Well, it is one SymbolBody,Addend pair per relocation.

  • For some end-users like embedded developers it might be essential to be able to place sections in specific order.

BTW, what result do you get with gold/bfd?

Almost all test cases from LLVM Test suite show similar results. For example (size of .got section):
7zip-benchmark: BFD: 0x17a4 LLD: 0x177c
tramp3d-v4: BFD: 0x11a4 LLD: 0x11b8

There are couple exceptions mafft/pairlocalalign and NPB-serial/is/is. These executables have large .bss section and the patch produces the GOT bigger than necessary. But anyway it is better than current trunk implementation.

I guess the question is more: do you know in which cases they produce
a better result?

Sending a few more comments via Phab.

Cheers,
Rafael

OK, I think this is probably fine.

If we want to compute a better result in the future and still support got pointing forward in the file I think we can do:

  • Delay the scan just a bit so we know the position of each symbol in its output section.
  • Compute the offset in the output section that each got entry will have
  • Use that to compute an upper bound on the number of entries.

Can you just upload a new version with the inline comments fixed?

ELF/OutputSections.cpp
209 ↗(On Diff #51392)

The description is not exactly true.

You are computing an upper bound, not the exact number. Given that, you are not assuming that the "relocations are spread" uniformly. It is just that the worst case is for every page in the section to have a relocation pointing at it.

ELF/Writer.cpp
1096 ↗(On Diff #51392)

Do you have a testcase for this? If not, could you leave it for a follow up patch?

Almost all test cases from LLVM Test suite show similar results. For example (size of .got section):
7zip-benchmark: BFD: 0x17a4 LLD: 0x177c
tramp3d-v4: BFD: 0x11a4 LLD: 0x11b8

There are couple exceptions mafft/pairlocalalign and NPB-serial/is/is. These executables have large .bss section and the patch produces the GOT bigger than necessary. But anyway it is better than current trunk implementation.

I guess the question is more: do you know in which cases they produce
a better result?

Suppose that you have a large output section. All relocations against this section target only a small range of addresses in this section. In that case bfd and gold generates only a few GOT entries. This patch in contrast generates GOT entry for each 64kb of the output section.

atanasyan updated this revision to Diff 51883.Mar 29 2016, 2:55 AM
atanasyan updated this object.
  • Fixed the comment describes how we count the number of required GOT entries.
  • Removed changes from Writer.cpp. I reviewed this code once again and realized that the changes can be rolled back. The only section (except .dynamic, .got.plt etc) which can modify its sh_size in the finalize method is the MergeOutputSection. But such sections always go before the .got section in the OutputSections container because we support only read-only merge sections and read-only sections go before writable ones. So it is does not necessary to postpone finalizing of the .got section. It would be nice to have an assert to check that fact but I cannot find how to do that without significant modification of the code.
rafael accepted this revision.Mar 29 2016, 6:26 AM
rafael edited edge metadata.

LGTM. Thanks.

This revision is now accepted and ready to land.Mar 29 2016, 6:26 AM

Thanks for review.

This revision was automatically updated to reflect the committed changes.