This avoids relocation overflows in large CUDA binaries.
Details
Diff Detail
- Build Status
Buildable 18606 Build 18606: arc lint + arc unit
Event Timeline
Does this just not work well with other linkers, do other linkers also have special handling, or does it work better elsewhere for some other reason?
Though this should work as a local patch, I think this is too specific to the NVidia-supplied binary blob that gets linked to your program. Is there any way to generalize this or change your binary?
Here's one generalization idea: move .rodata to the end of the r/o segment. That should let us drop the special cases for .dynsym and .dynstr as well.
No, I didn't try placing rodata to end of r/o segment. I think it might work as well (need to verify it on 15+ huge google binaries). Is there good advantage over what is currently done?
With ld.bfd I could use ldscript with "INSERT BEFORE" in it to tell it where to place .nv_fatbin. Alas, we're not using it.
ld.gold does not support INSERT BEFORE at all and currently just silently corrupts the executable when relocs overflow.
ld.lld sort of supports custom lld script with INSERT BEFORE, but the problem is that it only overrides explicitly provided full ldscript and does not work without it.
I've attempted to feed ld.lld linker scripts from ld.bfd, but that didn't get me far I've failed to create a ldscript good enough to produce a working app. Either LLD is not happy about section placement, or the executable does not work (usually failing in dynamic linker or early libc init phases).
So, tweaking LLD is the only practical option I see ATM.
This is not just for nvidia-supplied libs. This is also the section where clang puts generated GPU code as well.
I'm not sure what we can generalize here or how we can change the binary.
It only works if there are no other read-only sections that need to be accessed from .text.
For instance various exception handling bits that may end up being placed before .nv_fatbin, and we'd still get reloc overflows.
While discussing this with Artem, I got this idea: sorting by size. We could sort section by size so that larger sections are inserted before smaller sections, we could solve the issue. This workaround is based on the observation that, if you have a very large section, it is unlikely that you refer that section by a relative relocation, but because doing so could easily cause relocation overflow. So, even though it's heuristics, I don't think it's a bad idea.
What do you think?
That seems like it would work. If it doesn't, I guess another idea would be to test for the property directly: does an output section have limited-range relocations referring to or from it? If so, move it to the end of the segment. This seems feasible because we scan relocations before ordering output sections.