This is an archive of the discontinued LLVM Phabricator instance.

[LLD][ELF][AArch64] Add AArch64 short range thunk support
ClosedPublic

Authored by peter.smith on Apr 19 2023, 3:22 AM.

Details

Summary

The AArch64 branch immediate instruction has a 128MiB range. This makes it suitable for use a short range thunk in the same way as short thunks are implemented in Arm and PPC. This patch adds support for short range thunks to AArch64.

Adding short range thunk support should mean that OutputSections can grow to nearly 256 MiB in size without needing long-range indirect branches.

Diff Detail

Event Timeline

peter.smith created this revision.Apr 19 2023, 3:22 AM
peter.smith requested review of this revision.Apr 19 2023, 3:22 AM
MaskRay added a comment.EditedApr 19 2023, 3:14 PM

Thanks for the patch. AArch64ADRPThunk and AArch64ABSLongThunk duplicate writeTo and getMayUseShortThunk now. Shall we define a base class for the two classes to share code?

This makes it suitable for use a short range thunk in the same way as short thunks are implemented in Arm and PPC.

Is Arm ambiguous here? AArch32 and PPC64?


Is there any analysis how frequently this short thunk mode is going to trigger?

We need a thunk for b far or bl far. By using a b instruction, we can reach from +-128MiB at the thunk section location (instead of the original call site).
This does not guarantee a 256MiB range without indirect branches, as we don't necessarily place the thunk 128MiB from the call site.
This is in part because ThunkCreator::getThunk picks the first available thunk, not the best one.

Let's use aarch64-call26-thunk.s as an example. If I change Inputs/abs.s to use big = 0x8210120 (shorter addresses don't need a thunk), I'll get a short range thunk. If I use big = 0x8210124 or higher, I'll get a long range thunk.

I can check how to try improving ThunkCreator::getThunk. We will likely need a more powerful API than target->inBranchRange to get the distance, not just a binary result.

Thanks for the patch. AArch64ADRPThunk and AArch64ABSLongThunk duplicate writeTo and getMayUseShortThunk now. Shall we define a base class for the two classes to share code?

That is possible as the short branch code will be the same in both cases. It was one of the cases where for two cases the duplication may end up simpler than a base class, but if there were 3 or more it wouldn't. I can certainly change that.

This makes it suitable for use a short range thunk in the same way as short thunks are implemented in Arm and PPC.

Is Arm ambiguous here? AArch32 and PPC64?

I meant this uses the same strategy of using a branch as the ARM/Thumb (AArch32) thunks and I think PPC64.


Is there any analysis how frequently this short thunk mode is going to trigger?

The vast majority of user-space AArch64 programs need no range-extension thunks at all as the executable segment is contiguous and smaller than 128 MiB. We have seen a small number of programs in the 128 MiB to 256 Mib such as a fully instrumented Chromium build, some Haskell programs also get this large naturally. For programs in this size range with a contiguous text segment I'd expect short thunks to replace the larger ones.

I would not expect it to trigger for linker scripts that separate the code into separate disjoint OutputSections.

We need a thunk for b far or bl far. By using a b instruction, we can reach from +-128MiB at the thunk section location (instead of the original call site).
This does not guarantee a 256MiB range without indirect branches, as we don't necessarily place the thunk 128MiB from the call site.
This is in part because ThunkCreator::getThunk picks the first available thunk, not the best one.

Let's use aarch64-call26-thunk.s as an example. If I change Inputs/abs.s to use big = 0x8210120 (shorter addresses don't need a thunk), I'll get a short range thunk. If I use big = 0x8210124 or higher, I'll get a long range thunk.

The way code is written today is not optimal for short thunks but it can work reasonably well for large contiguous OutputSections like a .text. The initial pools are spaced at roughly branch-range intervals, so in a 256 MiB .text OutputSection there will be two pools, one roughly central. Callers at the start of the .text section to a destination near the end can nearly double their range, although the closer a caller gets to a pool the lower the benefits of range extension.

Arm's proprietary linker armlink has a much more complicated thunk assignment algorithm. Essentially it works out from the callers addresses and the destination address what the valid address range for the thunk insertion is and uses the mid-point of that range. This does have its drawbacks as while the address ranges are close to continuous, the valid insertion points in between sections are not so we can end up with the only valid insertion points being in the middle of a section, which requires special case code. All possible, but quite a lot more complexity. It also makes the binary layout a lot messier and harder to predict as thunks are scattered around the image.

I'll send an update with a base class. My summary is that I think that this will help programs of the 128 MiB to 256 MiB in a single .text OutputSection. Most likely a small handful of programs to date though. They are much more useful on Arm (Thumb) as many programs are larger than 16 MiB.

Use base class to share code for Short Branch functionality.

tschuett added inline comments.
lld/ELF/Thunks.cpp
75

Either virtual or override, but never both. In this case you want override.

Remove spurious virtual from declaration.

lld/ELF/Thunks.cpp
75

Thanks for spotting and apologies for missing that. Will fix.

MaskRay accepted this revision.Apr 21 2023, 3:20 PM

Thanks for the update. With the base class, the number of lines have decreased by 4 even with the long comment "An AArch64 thunk may be either short or long ...", so the base class looks worthwhile!

lld/ELF/Thunks.cpp
57–58

It seems more conventional to place constructors before member functions.

lld/test/ELF/aarch64-long-thunk-converge.s
7

Perhaps remove %t/a after the link. It is very large (420MiB).

9

The paragraphs uses several spellings for a long thunk: long-thunk, long-range thunk, and long range thunk. Canonicalize them?

12

Our pass variable starts with 0.

/ In pass 0, bl foo requires a long-range thunk to reach foo. The thunk for bar increases the address of foo so that it can be reached by bl foo with a single b instruction.
/ In pass 1, we expect the long-range thunk for bl foo to remain long.

17
This revision is now accepted and ready to land.Apr 21 2023, 3:20 PM
peter.smith marked 5 inline comments as done.Apr 24 2023, 3:25 AM

Thanks for the review. I've uploaded a new diff with the suggestions.

Review comments prior to commit.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2023, 5:49 AM