This is an archive of the discontinued LLVM Phabricator instance.

[lld/mac] Leave more room for thunks in thunk placement code
ClosedPublic

Authored by thakis on Aug 30 2021, 11:39 AM.

Download Raw Diff

Details

Reviewers

gkm
int3

Group Reviewers

Restricted Project

Commits

rG86c8f395ae7a: [lld/mac] Leave more room for thunks in thunk placement code

Summary

Fixes PR51578 in practice.

Currently there's only enough room for a single thunk, which for real-life code
isn't enough. The error case only happens when there are many branch statements
very close to each other (0 or 1 instructions apart), with the function at the
finalization barrier small.

There's a FIXME on what to do if we hit this case, but that suggestion sounds
complicated to me (see end of PR51578 comment 5 for why).

Instead, just leave more room for thunks. Chromium's unit_tests links fine with
room for 3 thunks. Leave room for 100, which should fix this for most cases in
practice.

There's little cost for leaving lots of room: This slop value only determines
when we finalize sections, and we insert thunks for forward jumps into
unfinalized sections. So leaving room means we'll need a few more thunks, but
the thunk jump range is 128 MiB while a single thunk is just 12 bytes.

For Chromium's unit_tests:
With a slop of 3: thunk calls = 355418, thunks = 10903
With a slop of 100: thunk calls = 355426, thunks = 10904

Chances are 100 is enough for all use cases we'll hit in practice, but even
bumping it to 1000 would probably be fine.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

thakis created this revision.Aug 30 2021, 11:39 AM

Herald added a reviewer: gkm. · View Herald TranscriptAug 30 2021, 11:39 AM

Herald added a project: Restricted Project. · View Herald Transcript

thakis requested review of this revision.Aug 30 2021, 11:39 AM

Harbormaster completed remote builds in B121787: Diff 369507.Aug 30 2021, 11:40 AM

rebase?

Harbormaster completed remote builds in B121791: Diff 369511.Aug 30 2021, 12:18 PM

lgtm. I suppose it's a bit too awkward to unit test?

lld/MachO/ConcatOutputSection.cpp
242	I'm confused about why the number of bytes for a call instruction is relevant here, since the call instruction will always be present regardless of whether the thunk is inserted

This revision is now accepted and ready to land.Aug 30 2021, 12:32 PM

I thought it'd be awkward to test, but it actually wasn't. Added a test.

Thanks for the review!

lld/MachO/ConcatOutputSection.cpp
242	What matters is the distance to the instruction after the call instruction. If there are at least 12 bytes between call instructions, all's well. Since we're at a call instruction, there are guaranteed to be at least 4 bytes until the next call instruction. So we lose 8 bytes for every two directly consecutive calls.

Closed by commit rG86c8f395ae7a: [lld/mac] Leave more room for thunks in thunk placement code (authored by thakis). · Explain WhyAug 30 2021, 7:09 PM

This revision was automatically updated to reflect the committed changes.

thakis added a commit: rG86c8f395ae7a: [lld/mac] Leave more room for thunks in thunk placement code.

Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2021, 7:09 PM

thevinster mentioned this in D116705: [lld-macho] Increase slops to prevent thunk out of range.Jan 5 2022, 9:54 PM

Revision Contents

Path

Size

lld/

MachO/

ConcatOutputSection.cpp

30 lines

test/

MachO/

arm64-thunk-starvation.s

57 lines

Diff 369600

lld/MachO/ConcatOutputSection.cpp

Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	void ConcatOutputSection::finalize() {
uint64_t backwardBranchRange = target->backwardBranchRange;		uint64_t backwardBranchRange = target->backwardBranchRange;
uint64_t stubsInRangeVA = TargetInfo::outOfRangeVA;		uint64_t stubsInRangeVA = TargetInfo::outOfRangeVA;
size_t thunkSize = target->thunkSize;		size_t thunkSize = target->thunkSize;
size_t relocCount = 0;		size_t relocCount = 0;
size_t callSiteCount = 0;		size_t callSiteCount = 0;
size_t thunkCallCount = 0;		size_t thunkCallCount = 0;
size_t thunkCount = 0;		size_t thunkCount = 0;

		// Walk all sections in order. Finalize all sections that are less than
		// forwardBranchRange in front of it.
		// isecVA is the address of the current section.
		// isecAddr is the start address of the first non-finalized section.

// inputs[finalIdx] is for finalization (address-assignment)		// inputs[finalIdx] is for finalization (address-assignment)
size_t finalIdx = 0;		size_t finalIdx = 0;
// Kick-off by ensuring that the first input section has an address		// Kick-off by ensuring that the first input section has an address
for (size_t callIdx = 0, endIdx = inputs.size(); callIdx < endIdx;		for (size_t callIdx = 0, endIdx = inputs.size(); callIdx < endIdx;
++callIdx) {		++callIdx) {
if (finalIdx == callIdx)		if (finalIdx == callIdx)
finalizeOne(inputs[finalIdx++]);		finalizeOne(inputs[finalIdx++]);
ConcatInputSection *isec = inputs[callIdx];		ConcatInputSection *isec = inputs[callIdx];
assert(isec->isFinal);		assert(isec->isFinal);
uint64_t isecVA = isec->getVA();		uint64_t isecVA = isec->getVA();
// Assign addresses up-to the forward branch-range limit
		// Assign addresses up-to the forward branch-range limit.
		// Every call instruction needs a small number of bytes (on Arm64: 4),
		int3Unsubmitted Not Done Reply Inline Actions I'm confused about why the number of bytes for a call instruction is relevant here, since the call instruction will always be present regardless of whether the thunk is inserted int3: I'm confused about why the number of bytes for a call instruction is relevant here, since the…
		thakisAuthorUnsubmitted Done Reply Inline Actions What matters is the distance to the instruction after the call instruction. If there are at least 12 bytes between call instructions, all's well. Since we're at a call instruction, there are guaranteed to be at least 4 bytes until the next call instruction. So we lose 8 bytes for every two directly consecutive calls. thakis: What matters is the distance to the instruction after the call instruction. If there are at…
		// and each inserted thunk needs a slighly larger number of bytes
		// (on Arm64: 12). If a section starts with a branch instruction and
		// contains several branch instructions in succession, then the distance
		// from the current position to the position where the thunks are inserted
		// grows. So leave room for a bunch of thunks.
		unsigned slop = 100 * thunkSize;
while (finalIdx < endIdx && isecAddr + inputs[finalIdx]->getSize() <		while (finalIdx < endIdx && isecAddr + inputs[finalIdx]->getSize() <
isecVA + forwardBranchRange - thunkSize)		isecVA + forwardBranchRange - slop)
finalizeOne(inputs[finalIdx++]);		finalizeOne(inputs[finalIdx++]);

if (isec->callSiteCount == 0)		if (isec->callSiteCount == 0)
continue;		continue;

if (finalIdx == endIdx && stubsInRangeVA == TargetInfo::outOfRangeVA) {		if (finalIdx == endIdx && stubsInRangeVA == TargetInfo::outOfRangeVA) {
// When we have finalized all input sections, __stubs (destined		// When we have finalized all input sections, __stubs (destined
// to follow __text) comes within range of forward branches and		// to follow __text) comes within range of forward branches and
// we can estimate the threshold address after which we can		// we can estimate the threshold address after which we can
// reach any stub with a forward branch. Note that although it		// reach any stub with a forward branch. Note that although it
// sits in the middle of a loop, this code executes only once.		// sits in the middle of a loop, this code executes only once.
// It is in the loop because we need to call it at the proper		// It is in the loop because we need to call it at the proper
// time: the earliest call site from which the end of __text		// time: the earliest call site from which the end of __text
Show All 39 Lines	for (Reloc &r : reverse(relocs)) {
uint64_t thunkVA = thunkInfo.isec->getVA();		uint64_t thunkVA = thunkInfo.isec->getVA();
if (lowVA <= thunkVA && thunkVA <= highVA) {		if (lowVA <= thunkVA && thunkVA <= highVA) {
r.referent = thunkInfo.sym;		r.referent = thunkInfo.sym;
continue;		continue;
}		}
}		}
// ... otherwise, create a new thunk.		// ... otherwise, create a new thunk.
if (isecAddr > highVA) {		if (isecAddr > highVA) {
// When there is small-to-no margin between highVA and		// There were too many consecutive branch instructions for `slop`
// isecAddr and the distance between subsequent call sites is		// above. If you hit this: For the current algorithm, just bumping up
// smaller than thunkSize, then a new thunk can go out of		// slop above and trying again is probably simplest. (See also PR51578
// range. Fix by unfinalizing inputs[finalIdx] to reduce the		// comment 5).
// distance between callVA and highVA, then shift some thunks
// to occupy address-space formerly occupied by the
// unfinalized inputs[finalIdx].
fatal(Twine(__FUNCTION__) + ": FIXME: thunk range overrun");		fatal(Twine(__FUNCTION__) + ": FIXME: thunk range overrun");
}		}
thunkInfo.isec =		thunkInfo.isec =
make<ConcatInputSection>(isec->getSegName(), isec->getName());		make<ConcatInputSection>(isec->getSegName(), isec->getName());
thunkInfo.isec->parent = this;		thunkInfo.isec->parent = this;

// This code runs after dead code removal. Need to set the `live` bit		// This code runs after dead code removal. Need to set the `live` bit
// on the thunk isec so that asserts that check that only live sections		// on the thunk isec so that asserts that check that only live sections
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

lld/test/MachO/arm64-thunk-starvation.s

This file was added.

				# REQUIRES: aarch64
				# RUN: llvm-mc -filetype=obj -triple=arm64-apple-darwin %s -o %t.o
				# RUN: %lld -arch arm64 -lSystem -o %t.out %t.o

				## Regression test for PR51578.

				.subsections_via_symbols

				.globl _f1, _f2, _f3, _f4, _f5, _f6
				.p2align 2
				_f1: b _fn1
				_f2: b _fn2
				_f3: b _fn3
				_f4: b _fn4
				_f5: b _fn5
				_f6: b _fn6
				## 6 * 4 = 24 bytes for branches
				## Currently leaves 12 bytes for one thunk, so 36 bytes.
				## Uses < instead of <=, so 40 bytes.

				.global _spacer1, _spacer1
				## 0x8000000 is 128 MiB, one more than the forward branch limit,
				## distributed over two functions since our thunk insertion algorithm
				## can't deal with a single function that's 128 MiB.
				## We leave just enough room so that the old thunking algorithm finalized
				## both spacers when processing _f1 (24 bytes for the 4 bytes code for each
				## of the 6 _f functions, 12 bytes for one thunk, 4 bytes because the forward
				## branch range is 128 Mib - 4 bytes, and another 4 bytes because the algorithm
				## uses `<` instead of `<=`, for a total of 44 bytes slop.) Of the slop, 20
				## bytes are actually room for thunks.
				## _fn1-_fn6 aren't finalized because then there wouldn't be room for a thunk.
				## But when a thunk is inserted to jump from _f1 to _fn1, that needs 12 bytes
				## but _f2 is only 4 bytes later, so after _f1 there are only
				## 20-(12-4) = 12 bytes left, after _f2 only 12-(12-4) 4 bytes, and after
				## _f3 there's no more room for thunks and we can't make progress.
				## The fix is to leave room for many more thunks.
				## The same construction as this test case can defeat that too with enough
				## consecutive jumps, but in practice there aren't hundreds of consecutive
				## jump instructions.

				_spacer1:
				.space 0x4000000
				_spacer2:
				.space 0x4000000 - 44

				.globl _fn1, _fn2, _fn3, _fn4, _fn5, _fn6
				.p2align 2
				_fn1: ret
				_fn2: ret
				_fn3: ret
				_fn4: ret
				_fn5: ret
				_fn6: ret

				.globl _main
				_main:
				ret