Page MenuHomePhabricator

[LLD] Place .nv_fatbin section at the beginning of the executable.
Needs ReviewPublic

Authored by tra on May 25 2018, 3:47 PM.

Details

Reviewers
ruiu
espindola
Summary

This avoids relocation overflows in large CUDA binaries.

Event Timeline

tra created this revision.May 25 2018, 3:47 PM

Does this just not work well with other linkers, do other linkers also have special handling, or does it work better elsewhere for some other reason?

ruiu added a comment.May 25 2018, 4:03 PM

Though this should work as a local patch, I think this is too specific to the NVidia-supplied binary blob that gets linked to your program. Is there any way to generalize this or change your binary?

pcc added a subscriber: pcc.May 25 2018, 4:06 PM

Here's one generalization idea: move .rodata to the end of the r/o segment. That should let us drop the special cases for .dynsym and .dynstr as well.

ruiu added subscribers: shenhan, tra.May 25 2018, 4:16 PM

+Han Shen <shenhan@google.com>

Han, did you try that idea before?

+Han Shen <shenhan@google.com>

Han, did you try that idea before?

No, I didn't try placing rodata to end of r/o segment. I think it might work as well (need to verify it on 15+ huge google binaries). Is there good advantage over what is currently done?

tra added a comment.May 25 2018, 4:49 PM

Does this just not work well with other linkers, do other linkers also have special handling, or does it work better elsewhere for some other reason?

With ld.bfd I could use ldscript with "INSERT BEFORE" in it to tell it where to place .nv_fatbin. Alas, we're not using it.
ld.gold does not support INSERT BEFORE at all and currently just silently corrupts the executable when relocs overflow.
ld.lld sort of supports custom lld script with INSERT BEFORE, but the problem is that it only overrides explicitly provided full ldscript and does not work without it.
I've attempted to feed ld.lld linker scripts from ld.bfd, but that didn't get me far I've failed to create a ldscript good enough to produce a working app. Either LLD is not happy about section placement, or the executable does not work (usually failing in dynamic linker or early libc init phases).

So, tweaking LLD is the only practical option I see ATM.

Though this should work as a local patch, I think this is too specific to the NVidia-supplied binary blob that gets linked to your program. Is there any way to generalize this or change your binary?

This is not just for nvidia-supplied libs. This is also the section where clang puts generated GPU code as well.
I'm not sure what we can generalize here or how we can change the binary.

In D47396#1113019, @pcc wrote:

Here's one generalization idea: move .rodata to the end of the r/o segment. That should let us drop the special cases for .dynsym and .dynstr as well.

It only works if there are no other read-only sections that need to be accessed from .text.
For instance various exception handling bits that may end up being placed before .nv_fatbin, and we'd still get reloc overflows.

grimar added a subscriber: grimar.May 28 2018, 2:09 AM
ruiu added a comment.May 30 2018, 11:04 AM

While discussing this with Artem, I got this idea: sorting by size. We could sort section by size so that larger sections are inserted before smaller sections, we could solve the issue. This workaround is based on the observation that, if you have a very large section, it is unlikely that you refer that section by a relative relocation, but because doing so could easily cause relocation overflow. So, even though it's heuristics, I don't think it's a bad idea.

What do you think?

pcc added a comment.May 30 2018, 11:34 AM

That seems like it would work. If it doesn't, I guess another idea would be to test for the property directly: does an output section have limited-range relocations referring to or from it? If so, move it to the end of the segment. This seems feasible because we scan relocations before ordering output sections.