This is an archive of the discontinued LLVM Phabricator instance.

Mitigate relocation overflow [part 1 of 2]
ClosedPublic

Authored by shenhan on Apr 18 2018, 3:22 PM.

Details

Summary

This CL is to mitigate R_X86_64_PC32 relocation overflow problems for huge binaries that has near 2G allocated sections.

By examining those binaries, I found these 2 issues contributes to the problem:
1). huge ".dynsym" and ".dynstr" stands in the way between .rodata and .text
2). _init_array_start/end are placed at 0 if no ".init_array" presents, this causes .text relocation against them become more prone to overflow.

This CL addresses 1st problem (the 2nd will be addressed in another CL.) by assigning a smaller sortrank to .dynsym and .dynstr thus they no longer stand in between

Note: this CL affects multiple tests, I include in this CL only 2 of many changes to the tests. And I'll provide the rest once this main part of this CL gets approved.

Diff Detail

Repository
rLLD LLVM Linker

Event Timeline

shenhan created this revision.Apr 18 2018, 3:22 PM
ruiu added inline comments.Apr 18 2018, 3:51 PM
ELF/Writer.cpp
721–722

Please expand the comment to explain that we want to put progbits sections closer so that relocations as an effort to prevent relocation overflow between these sections.

722

What is this isRelroSection condition for?

1694

Please use clang-format.

I think that Config->ImageBase is a better default value than an address of .text section because a section whose name is .text may not exist.

shenhan updated this revision to Diff 143114.Apr 19 2018, 10:06 AM
shenhan marked an inline comment as done.
shenhan added inline comments.
ELF/Writer.cpp
722

This is used to skip reordering for relro sections, because these to be put into RelRo segment must be contiguous, otherwise lld complains.

1694

Done clang-format.

In some cases, Config->ImageBase is too small, like I see 0x1000. And 0x1000 is still too far away from some .text sections.

ruiu added inline comments.Apr 19 2018, 10:23 AM
ELF/Writer.cpp
722

How did it work before this patch? I thought that all relro sections are progbits sections, so setting PF_PROGBITS to everybody doesn't change the existing behavior. It is confusing if you don't set RF_PROGBITS to everybody unconditionally.

1693–1697

Please find .text section outside of this lambda to call findSection only once.

p needs to be P by the LLVM coding style. But I prefer Text.

OutputSection* p -> OutputSection *P by the LLVM coding style.

Please look around to make sure your new code matches with other code in the same file in style.

shenhan updated this revision to Diff 143133.Apr 19 2018, 11:16 AM
shenhan marked an inline comment as done.
shenhan added inline comments.
ELF/Writer.cpp
722

Most of relro have PROGBITS, but ".dynamic" & ".tbss" don't. So the condition test is to keep those 2 sections contiguous with other relro sections.

ruiu added inline comments.Apr 19 2018, 4:13 PM
ELF/Writer.cpp
722

Doesn't that mean that the priority you gave to RF_NON_TLS_BSS_RO is not appropriate? If you give it lower priority than the RELRO priority (i.e. RF_NON_TLS_BSS_RO), it should automatically work, no?

grimar added a subscriber: grimar.Apr 23 2018, 3:26 AM
shenhan updated this revision to Diff 144180.Apr 26 2018, 12:49 PM
shenhan edited the summary of this revision. (Show Details)
shenhan added a reviewer: pcc.
shenhan updated this revision to Diff 144205.Apr 26 2018, 2:52 PM
shenhan edited the summary of this revision. (Show Details)
shenhan removed reviewers: espindola, pcc.
shenhan updated this revision to Diff 144208.Apr 26 2018, 2:58 PM
ruiu added inline comments.Apr 26 2018, 2:59 PM
ELF/Writer.cpp
1690

Please separate this part as a separate patch, as this patch contains two unrelated changes.

shenhan updated this revision to Diff 144224.Apr 26 2018, 3:37 PM
shenhan retitled this revision from Mitigate relocation overflow to Mitigate relocation overflow [part 1 of 2].
shenhan edited the summary of this revision. (Show Details)
shenhan marked an inline comment as done.
shenhan updated this revision to Diff 144229.Apr 26 2018, 3:39 PM

@espindola would you mind take a look?

Can you add a test case showing the difference in layout after this patch applied?

ELF/Writer.cpp
720

"ALLOC" -> SHF_ALLOC or "ALLOC" -> allocatable

725

Please be consistent. You mix ALLOC and "ALLOC". I would just use SHF_ALLOC everywhere.

shenhan updated this revision to Diff 144362.Apr 27 2018, 9:45 AM
shenhan marked an inline comment as done and an inline comment as not done.
shenhan marked an inline comment as done.

Thanks. Done with a test case.

Any further thoughts? If not, I'll proceed to fixing all the other test failures caused by section re-arrangement.

grimar added inline comments.May 1 2018, 11:14 AM
ELF/Writer.cpp
744

So, this places SHT_DYNSYM and SHT_SYMTAB sections first
in "a" sections group just because they do not get RF_NOT_TLS rank
(like other "a" sections) because of this early return, right?

Then it seems that behavior is based on an accident.
I would suggest adding new RF_* rank probably to make it explicit.

test/ELF/dynsec-at-beginning.s
17

I am not sure having "aw" section is useful here.
I would instead expect to see foo as "a" section instead. To show
how non-synthetic SHF _ALLOC sections are placed with the new code.
Because currently, you show the layout of "a" synthetic sections
(.dynsym, .dynstr, .hash) only.

shenhan updated this revision to Diff 144769.May 1 2018, 1:04 PM
shenhan marked 2 inline comments as done.
grimar added a comment.May 1 2018, 2:00 PM

This change looks good for me (with a better name for rank).
Currently, as you know, there are 57 failing test cases.
Please fix them and update the patch.

ELF/Writer.cpp
705

RF_DYN_SYM does not seem to be a good name because
you use it for both SHT_DYNSYM and SHT_STRTAB.

I do not have an idea about the good name now. Maybe RF_ALLOC_FIRST?

shenhan updated this revision to Diff 144979.May 2 2018, 9:36 PM

Fixed all affected test cases.

shenhan marked an inline comment as done.May 2 2018, 9:37 PM
grimar added a comment.May 3 2018, 5:27 AM

Few comments for now + please upload the patch with the context (just read the https://llvm.org/docs/Phabricator.html).

test/ELF/basic-ppc.s
15

Remove trailing whitespace, please.

60

(1) is new and excessive. The same issues below.

test/ELF/dynsec-at-beginning.s
13

.text is AX please keep the full attributes name.

ruiu accepted this revision.May 3 2018, 8:37 AM

LGTM

This revision is now accepted and ready to land.May 3 2018, 8:37 AM
grimar requested changes to this revision.May 3 2018, 8:46 AM

Please update the patch with my comments addressed, I would like to review the test cases carefully.

This revision now requires changes to proceed.May 3 2018, 8:46 AM
shenhan updated this revision to Diff 145041.May 3 2018, 10:19 AM
shenhan marked 2 inline comments as done.

Refined test cases changes and provide more context.

grimar added a comment.May 4 2018, 1:30 AM

Ok, I reviewed the changed test cases and unfortunately, now I have many concerns about this patch.
Issues I found are in comments, I think they are critical and needs fix/investigation first of all.

But I also think that given the issues observed and the fact that it makes
output different with the default output from gnu linkers,
I probably want to suggest to hide this new functionality under some -z <option> flag.

Will be happy to see the opinions of other people.

test/ELF/arm-exidx-shared.s
43–44

This calculation comment needs an update since values changed.

test/ELF/dynsec-at-beginning.s
14

Please address this comment (.text is AX, not just A)

test/ELF/linkerscript/no-space.s
4

I do not think this is correct, sorry. This change broke the intention of the test case I believe.

You did that because LLD starts triggering an error:

error: section .dynsym file range overlaps with foo
>>> .dynsym range is [0x1000, 0x1017]
>>> foo range is [0x1000, 0x1007]

What happens because orphan section .dynsym is placed before foo. And then during
assigning of VA, Dot is explicitly set to 0 for foo, making it intersect with .dynsym.

I think the expected behavior is to not place these orphan sections before foo, since
the position of foo is explicitly requested by linker script. And sure I do not think
we should emit an error in that case.

This case needs some investigation.
I'll be happy to hear other people opinion on that too.

test/ELF/linkerscript/out-of-order.s
3

The same here. I do not think you should "fix" this script. It was valid by itself. It should just work.

With the original linker script, this patch makes LLD to produce invalid output (actually no output at all as FileSize is broken).

shenhan added inline comments.May 4 2018, 10:13 AM
test/ELF/linkerscript/no-space.s
4

Thanks for pointing this out, yes, I agree we need to have consensus on this (and the "out-of-order.s" test case below).

Here is my thoughts:

  1. The test writer assume the fact that "foo" must be the first section in its segment (which is a group of output sections), otherwise, its address intersects with the first output section in the segment, because DOT is set to 0 for "foo" by linkerscript.
  1. There is no fact in the test case supports this assumption.
  1. So this test case might cause linker error whenever output sections change relative order in the segment.

So my suggestion:
Honor the assumption, so instead of change the DOT for foo, change it for .dynsym (the first output section of segment):

  • # RUN: echo "SECTIONS {foo 0 : {*(foo*)} }" > %t.script

+ # RUN: echo "SECTIONS {.dynsym 0 : {*(.dynsym)} }" > %t.script
This reserves the origin test intention.

.OR.
Always layout output sections with explicit address assignment first, then layout those without. So for this particular case, the origin output section order is:

.dynsym .dynstr foo

We layout it like below:

foo .dynsym .dynstr

.OR.
Report error on rewinding "." to a conflict address.

[This also applies to out-of-order.s test case below.]

I'm happy to go any direction.

Comments?

grimar added inline comments.May 4 2018, 11:13 AM
test/ELF/linkerscript/no-space.s
4

.OR.
Always layout output sections with explicit address assignment first

I believe it is the correct way. A script should specify the layout, any orphans
should be placed after the explicitly specified sections. That seems logically correct
and consistent with gnu linkers behavior (I think).

I would really want to hear from other people before proceeding this place probably.

shenhan added inline comments.May 4 2018, 1:40 PM
test/ELF/linkerscript/no-space.s
4

Thanks, George.

@ruiu Can you comment?

grimar added inline comments.May 7 2018, 8:40 AM
test/ELF/linkerscript/no-space.s
4

Can you update the test cases you had to change so that these sections be not the orphans (will be listed explicitly)?
Let's see if we can try to go with that change then and watch how it works in practice.

shenhan updated this revision to Diff 145501.May 7 2018, 10:51 AM

Revised linkerscript/out-of-order.s and linkerscript/no-space.s to honor layout constraint (also added notes). Also addressed previous 2 minor comments.

shenhan marked 2 inline comments as done.May 7 2018, 10:52 AM
grimar added a comment.May 8 2018, 3:21 AM

Thanks for an update!

I tested check-lld both on windows and ubuntu and
there are still 6 test cases that are failing for me.
(I think that was true for previous iteration too).

50> Failing Tests (6):
50> lld :: ELF/aarch64-copy.s
50> lld :: ELF/arm-copy.s
50> lld :: ELF/map-file.s
50> lld :: ELF/relocation.s
50> lld :: ELF/version-script-extern.s
50> lld :: ELF/x86-64-retpoline-znow-linkerscript.s

My suggestions about how we should adjust the test cases are also inline.
Let's discuss if anything I am suggesting does not look OK for some reason to you.

test/ELF/linkerscript/no-space.s
8

synsym -> dynsym

SECTIONS {foo : {*(foo*)} .dynsym : { *(.dynsym) } .dynstr : { *(.dynstr) } }

10

.dynstr is still an orphan.

Let's just list them explicitly here:
SECTIONS {foo 0 : {*(foo*)} .dynsym : { *(.dynsym) } .dynstr : { *(.dynstr) } }

test/ELF/linkerscript/out-of-order.s
3

I do not understand why you changed .data to .dynamic actually.
What your patch changed is the order of "a" sections. That breaks the current scripts
which previously would put dynsym/dynstr correctly, but now - don't.
And I think we should adjust only that part, no need to update anything else probably.

Lets do this:
SECTIONS { .data 0x4000 : { *(.data) } .dynsym 0x2000 : { *(.dynsym ) } .dynstr : { *(.dynstr ) } }

So what changed is that we explicitly start the second segment from .dynsym instead of .text,
so that '.dynsym'/.dynstr are not an orphans now, and that allows to place it correctly. Layout is then:

1 .data         00000008 0000000000004000 DATA
2 .dynamic      00000060 0000000000004008
3 .dynsym       00000018 0000000000002000
4 .dynstr       00000001 0000000000002018
5 .text         00000008 000000000000201c TEXT DATA
shenhan updated this revision to Diff 145718.May 8 2018, 10:15 AM
shenhan marked 3 inline comments as done.

Updated according to review feedback. Also synced to upstream HEAD and re-run all tests. All passed now.

shenhan updated this revision to Diff 145722.May 8 2018, 10:18 AM

Thanks for the review.

grimar accepted this revision.May 10 2018, 8:24 AM

I do not know what is wrong, but the following tests are still failing for me.

30> Failing Tests (5):
30> lld :: ELF/aarch64-copy.s
30> lld :: ELF/arm-copy.s
30> lld :: ELF/note-noalloc.s
30> lld :: ELF/relocation.s
30> lld :: ELF/version-script-extern.s

note-noalloc.s is a new one, you have to update it, but others are old and I see you already updated them in the patch,
they seem to be trivial, so not sure what is wrong here, please check your configuration because I tested on 2 and result was the same.

The rest of changes looks fine to me mostly I think (I'll probably adjust few minor things after your commit).

I do not want to be a stopper for this one anymore, so please feel free to commit after fixing above tests.
And let's see how this change works in practice.

This revision is now accepted and ready to land.May 10 2018, 8:24 AM
shenhan updated this revision to Diff 146216.May 10 2018, 1:51 PM

Synced to newest and re-ran tests, make sure everything passes.

Some clarification why my test result is different from George's.

I noticed that my build tree is configured with option "-DLLVM_OPTIMIZED_TABLEGEN=On", this seemingly innocuous option changed the behavior of at least 4 cases. After I dropped these 4 options, I saw the 4 new failures, and a new CL has been submitted to fix these 4. https://reviews.llvm.org/D46730

Some clarification why my test result is different from George's.

I noticed that my build tree is configured with option "-DLLVM_OPTIMIZED_TABLEGEN=On", this seemingly innocuous option changed the behavior of at least 4 cases. After I dropped these 4 options, I saw the 4 new failures, and a new CL has been submitted to fix these 4. https://reviews.llvm.org/D46730

FWIW, I am using -DLLVM_OPTIMIZED_TABLEGEN=1 for windows build too.

Sorry, but I had to revert this in r332085 "[ELF] - Revert of: r332038, r332054, r332060, r332061, r332062, r332063"

Bots were still broken. We need to find a reason for the failures.
Please, next time check the bots after commit. That is undesirable to have them broken for a long time.

Failing bots I observed were:
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-windows10pro-fast/builds/17042
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-ubuntu-fast/builds/29845

Now they are OK.

Sorry, but I had to revert this in r332085 "[ELF] - Revert of: r332038, r332054, r332060, r332061, r332062, r332063"

Thanks for reverting, I was also scratching my head over this.

Bots were still broken. We need to find a reason for the failures.
Please, next time check the bots after commit. That is undesirable to have them broken for a long time.

Failing bots I observed were:
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-windows10pro-fast/builds/17042
http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-ubuntu-fast/builds/29845

A couple more ones:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-bootstrap/builds/5428
http://lab.llvm.org:8011/builders/clang-with-lto-ubuntu/builds/8464
http://lab.llvm.org:8011/builders/clang-lld-x86_64-2stage/builds/5405
http://lab.llvm.org:8011/builders/clang-cmake-aarch64-lld/builds/4560

btw, before reverting I tried check-lld locally. I had test failures too, though I had 4 when bots had 2. My guess is that something is wrong with the sorting of output sections.
Maybe stable/unstable sorting is a key to resolving the issue, I had no chance to look closer yet, I am busy with a different task right now. I'll try to take a look when finish things I am working on.

Sorry for breaking the builders. Should double check again after my fix for the broken 4 cases.

I'll look into the unstableness of output section sorting.

shenhan updated this revision to Diff 146743.May 14 2018, 9:03 PM
shenhan edited the summary of this revision. (Show Details)

Merge D46764 (Fix test cases that check addresses that are not always invariable) into this.

Also address Peter Smith's comments.

grimar accepted this revision.May 15 2018, 2:32 AM

All tests pass for me now. LGTM, thanks!

This revision was automatically updated to reflect the committed changes.
NickHung added inline comments.
test/ELF/linkerscript/addr-zero.test
11

This change breaks the value of foo.
The expected value of foo is 0x0.

grimar added inline comments.Aug 15 2018, 6:15 AM
test/ELF/linkerscript/addr-zero.test
11

I think the script should be just changed to

foo = ADDR(.dynsym) - ABSOLUTE(ADDR(.dynsym));

where .dynsym is the first section now.

The problem with this script seems to be that it tries to evaluate ABSOLUTE value for .text
too early. When we assign to foo we do not yet know the absolute address of the .text.

Before this patch, .text was the first section at VA = 0, so this just worked.
Such script would not produce zero value for foo with any other
section other than .text I believe.

I am not sure (and did not find) what was the initial intention to support script like this.
Perhaps there was some other issue which produced some broken value instead of zero.