This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/
-
ELF/
-
Writer.cpp
-
test/ELF/
-
ELF/
-
nv_fatbin-at-beginning.s

Differential D47396

[LLD] Place .nv_fatbin section at the beginning of the executable.
Needs ReviewPublic

Authored by tra on May 25 2018, 3:47 PM.

Download Raw Diff

Details

Reviewers

ruiu
• espindola

Summary

This avoids relocation overflows in large CUDA binaries.

Diff Detail

Build Status

Buildable 18606
Build 18606: arc lint + arc unit

Event Timeline

tra created this revision.May 25 2018, 3:47 PM

Herald added a reviewer: • espindola. · View Herald TranscriptMay 25 2018, 3:47 PM

Herald added subscribers: bixia, arichardson, jlebar and 2 others. · View Herald Transcript

Does this just not work well with other linkers, do other linkers also have special handling, or does it work better elsewhere for some other reason?

Though this should work as a local patch, I think this is too specific to the NVidia-supplied binary blob that gets linked to your program. Is there any way to generalize this or change your binary?

Here's one generalization idea: move .rodata to the end of the r/o segment. That should let us drop the special cases for .dynsym and .dynstr as well.

+Han Shen <shenhan@google.com>

Han, did you try that idea before?

In D47396#1113043, @ruiu wrote:

+Han Shen <shenhan@google.com>

Han, did you try that idea before?

No, I didn't try placing rodata to end of r/o segment. I think it might work as well (need to verify it on 15+ huge google binaries). Is there good advantage over what is currently done?

In D47396#1113016, @hfinkel wrote:

Does this just not work well with other linkers, do other linkers also have special handling, or does it work better elsewhere for some other reason?

With ld.bfd I could use ldscript with "INSERT BEFORE" in it to tell it where to place .nv_fatbin. Alas, we're not using it.
ld.gold does not support INSERT BEFORE at all and currently just silently corrupts the executable when relocs overflow.
ld.lld sort of supports custom lld script with INSERT BEFORE, but the problem is that it only overrides explicitly provided full ldscript and does not work without it.
I've attempted to feed ld.lld linker scripts from ld.bfd, but that didn't get me far I've failed to create a ldscript good enough to produce a working app. Either LLD is not happy about section placement, or the executable does not work (usually failing in dynamic linker or early libc init phases).

So, tweaking LLD is the only practical option I see ATM.

In D47396#1113018, @ruiu wrote:

Though this should work as a local patch, I think this is too specific to the NVidia-supplied binary blob that gets linked to your program. Is there any way to generalize this or change your binary?

This is not just for nvidia-supplied libs. This is also the section where clang puts generated GPU code as well.
I'm not sure what we can generalize here or how we can change the binary.

In D47396#1113019, @pcc wrote:

Here's one generalization idea: move .rodata to the end of the r/o segment. That should let us drop the special cases for .dynsym and .dynstr as well.

It only works if there are no other read-only sections that need to be accessed from .text.
For instance various exception handling bits that may end up being placed before .nv_fatbin, and we'd still get reloc overflows.

grimar added a subscriber: grimar.May 28 2018, 2:09 AM

While discussing this with Artem, I got this idea: sorting by size. We could sort section by size so that larger sections are inserted before smaller sections, we could solve the issue. This workaround is based on the observation that, if you have a very large section, it is unlikely that you refer that section by a relative relocation, but because doing so could easily cause relocation overflow. So, even though it's heuristics, I don't think it's a bad idea.

What do you think?

That seems like it would work. If it doesn't, I guess another idea would be to test for the property directly: does an output section have limited-range relocations referring to or from it? If so, move it to the end of the segment. This seems feasible because we scan relocations before ordering output sections.

MaskRay mentioned this in D74339: Make .rodata* and .eh_frame* the last of all PROGBITS sections..Feb 10 2020, 12:28 PM

Revision Contents

Path

Size

lld/

ELF/

Writer.cpp

7 lines

test/

ELF/

nv_fatbin-at-beginning.s

19 lines

Diff 148680

lld/ELF/Writer.cpp

Show First 20 Lines • Show All 738 Lines • ▼ Show 20 Lines	static unsigned getSectionRank(const OutputSection *Sec) {
// huge .dynsym and .dynstr sections placed between ro-data and text		// huge .dynsym and .dynstr sections placed between ro-data and text
// sections cause relocation overflow. Note: .dynstr has SHT_STRTAB		// sections cause relocation overflow. Note: .dynstr has SHT_STRTAB
// type and SHF_ALLOC attribute, whereas sections that only have		// type and SHF_ALLOC attribute, whereas sections that only have
// SHT_STRTAB but without SHF_ALLOC is placed at the end. All "Sec"		// SHT_STRTAB but without SHF_ALLOC is placed at the end. All "Sec"
// reaching here has SHF_ALLOC bit set.		// reaching here has SHF_ALLOC bit set.
if (Sec->Type == SHT_DYNSYM \|\| Sec->Type == SHT_STRTAB)		if (Sec->Type == SHT_DYNSYM \|\| Sec->Type == SHT_STRTAB)
return Rank \| RF_ALLOC_FIRST;		return Rank \| RF_ALLOC_FIRST;

		// GPU binaries may grow quite large which may lead to relocation overflows if
		// the section is placed between .text and regular data segments. Placing the
		// section at the beginning of SHF_ALLOC, similarly to .dynstr/.dynsym above
		// mitigates the problem.
		if (Sec->Name == ".nv_fatbin")
		return Rank \|= RF_ALLOC_FIRST;

// Sort sections based on their access permission in the following		// Sort sections based on their access permission in the following
// order: R, RX, RWX, RW. This order is based on the following		// order: R, RX, RWX, RW. This order is based on the following
// considerations:		// considerations:
// * Read-only sections come first such that they go in the		// * Read-only sections come first such that they go in the
// PT_LOAD covering the program headers at the start of the file.		// PT_LOAD covering the program headers at the start of the file.
// * Read-only, executable sections come next, unless the		// * Read-only, executable sections come next, unless the
// -no-rosegment option is used.		// -no-rosegment option is used.
// * Writable, executable sections follow such that .plt on		// * Writable, executable sections follow such that .plt on
▲ Show 20 Lines • Show All 1,614 Lines • Show Last 20 Lines

lld/test/ELF/nv_fatbin-at-beginning.s

This file was added.

				# REQUIRES: x86
				# RUN: llvm-mc -filetype=obj -triple=x86_64-pc-linux %s -o %t

				# RUN: ld.lld --hash-style=gnu -o %t1 %t -shared
				# RUN: llvm-readobj -elf-output-style=GNU -s %t1 \| FileCheck %s

				# .nv_fatbin has the same priority as .dynsym/.dynstr and is expected to
				# located before .rodata and .text
				# CHECK-DAG: .nv_fatbin {{.*}} A
				# CHECK-DAG: .dynsym {{.*}} A
				# CHECK-DAG: .dynstr {{.*}} A
				# CHECK: .rodata {{.*}} A
				# CHECK: .text {{.*}} AX

				.section .rodata, "a"
				.byte 1

				.section .nv_fatbin, "a"
				.byte 0