This is an archive of the discontinued LLVM Phabricator instance.

[PPC64] implement Thunk Section Spacing
AbandonedPublic

Authored by adalava on May 6 2019, 2:02 PM.

Details

Summary

This implements PPC64 Thunk Section Spacing to be in PPC64_REL14 range suggestion by sfertile while discussing [1] corruption of ".init" section seen during tests building FreeBSD for PowerPC64-ELFv2 ABI.

[1] https://bugs.llvm.org/show_bug.cgi?id=40740

Event Timeline

adalava created this revision.May 6 2019, 2:02 PM
Herald added a project: Restricted Project. · View Herald Transcript
adalava edited the summary of this revision. (Show Details)May 6 2019, 2:14 PM
adalava edited reviewers, added: ruiu, sfertile; removed: espindola.
adalava added a subscriber: luporl.
ruiu added inline comments.May 6 2019, 10:32 PM
lld/ELF/Arch/PPC64.cpp
1056

Could you extend the comment a bit to make it a complete sentence explaining what this value means. I'd say we create initial thunks at regular spacing so that most relocations reach the thunks. On PPC64, REL14 (whose offset is 0x8000) is the most strict relocation.

@atanasyan Should MIPS override getThunkSectionSpacing as well? Then we can make the function pure virtual: virtual uint32_t getThunkSectionSpacing() const = 0;

lld/ELF/Arch/PPC64.cpp
1057

This probably should be slightly smaller than 0x8000.

@peter.smith probably has idea how the ARM/AArch64 constants were derived.

uint32_t AArch64::getThunkSectionSpacing() const {
  // See comment in Arch/ARM.cpp for a more detailed explanation of
  // getThunkSectionSpacing(). For AArch64 the only branches we are permitted to
  // Thunk have a range of +/- 128 MiB
  return (128 * 1024 * 1024) - 0x30000;
}
lld/test/ELF/ppc64-branch-thunkspacing.s
4

-o %t.o

6

No 2>&1. It is only used to check warnings/errors.

7

You can omit -triple=powerpc64-unknown-freebsd13.0. llvm-objdump auto detects the triple.

19

mixed tab and spaces? Probably move the bl instruction to another .init section:

.section .init,"ax",@progbits,unique,2
.align 2
  bl too_far1

to mimic the crti.o crtn.o .init

MaskRay removed a subscriber: MaskRay.

Thanks for posting this @adalava.

lld/ELF/Arch/PPC64.cpp
1057

The conditional branch instructions only encode a 14-bit immediate which gets shifted left 2 places to form a 4-byte aligned signed 16-bit offset. This gives us a range of [0x8000, 0x7ffc]. Since we have slightly less 'reach' with positive offsets we should start with a spacing of 0x7ffc instead of 0x8000. Then I guess we want to make it slightly smaller so that a branch can reach more then just the first/last instruction in the thunk section.

@atanasyan Should MIPS override getThunkSectionSpacing as well? Then we can make the function pure virtual: virtual uint32_t getThunkSectionSpacing() const = 0;

I believe the Mips LA25 Thunks are not range extension thunks, they are PI to non-PI and they are a special case where they have to precede their Target Section. We do this with getTargetInputSection() so I think it doesn't matter for Mips.

lld/ELF/Arch/PPC64.cpp
1057

Sorry for the delay in responding. The ThunkSectionSpacing is derived from the branch range - a slop-space (0x30000) bytes to account for the space of thunks that could be generated in between the branch and destination. This is quite pessimistic but given the range is 128 Megabytes, this isn't much of a problem. What I suggest is to think how many Thunks would get created in any [0x8000, 0x7ffc] span and then subtract the size in bytes from that.

It is worth noting that in terms of range, the implementation should fail safe, i.e. it will place any Thunk that can't read one of the pre-created ThunkSections prior in a newly created ThunkSection either before or after the Target.

I note that this caused a problem as there were two InputSections that should not be split by inserting a Thunk in between them. I think the only fool proof solution to that is to make the Thunk implementation aware of the restriction. Arm's proprietary linker has "inline thunks" for v4t that have control flow that falls-through into the following section, in that linker we had to make sure that no thunk got inserted in between the "inline thunk" and the following section.

adalava updated this revision to Diff 198480.May 7 2019, 8:29 AM
adalava edited the summary of this revision. (Show Details)
  • add space reservation, add comments and improve test
adalava updated this revision to Diff 198481.May 7 2019, 8:31 AM

fix patch

adalava marked 7 inline comments as done.May 7 2019, 8:36 AM

Thank you guys. I updated the patch,;

@sfertile , I posted the patch while you're adding your comment. I guess last version covers your concern, but please give another look.

MaskRay added a comment.EditedMay 8 2019, 12:06 AM

I was trying to find a good value used in
uint32_t PPC64::getThunkSectionSpacing() const { return 0x8000 - 0x1000; } before I noticed this caused an extreme slowdown in thunk creation.

We have a large program whose text segment is of 1.3GiB. Its OutputSection .text consists of 1948775 InputSections. It takes 20 seconds to link but with this patch it doesn't stop in 10 minutes.

The reason is that:

// O(|ThunkVec|). Not the critical path, but there is a bottleneck, too. Many ThunkVec's have hundreds/thousands of elements
            std::tie(T, IsNew) = getThunk(*Rel.Sym, Rel.Type, Src); 

            if (IsNew) { // called 112353 times
              // Find or create a ThunkSection for the new Thunk
              ThunkSection *TS;
              if (auto *TIS = T->getTargetInputSection())
                TS = getISThunkSec(TIS);
              else
                TS = getISDThunkSec(OS, IS, ISD, Rel.Type, Src); // it takes O(|ISD->ThunkSections|) time, `ISD->ThunkSections` is large (25980) when getThunkSectionSpacing is enabled.
              TS->addThunk(T);
              Thunks[T->getThunkTargetSym()] = T;
            }
lld/test/ELF/ppc64-branch-thunkspacing.s
19

This comment is not done.

You should create several .init sections to emulate .init used on FreeBSD: it is built from bits spread across several object files.

I was trying to find a good value used in
uint32_t PPC64::getThunkSectionSpacing() const { return 0x8000 - 0x1000; } before I noticed this caused an extreme slowdown in thunk creation.

We have a large program whose text segment is of 1.3GiB. Its OutputSection .text consists of 1948775 InputSections. It takes 20 seconds to link but with this patch it doesn't stop in 10 minutes.

The reason is that:

// O(|ThunkVec|). Not the critical path, but there is a bottleneck, too. Many ThunkVec's have hundreds/thousands of elements
            std::tie(T, IsNew) = getThunk(*Rel.Sym, Rel.Type, Src); 

            if (IsNew) { // called 112353 times
              // Find or create a ThunkSection for the new Thunk
              ThunkSection *TS;
              if (auto *TIS = T->getTargetInputSection())
                TS = getISThunkSec(TIS);
              else
                TS = getISDThunkSec(OS, IS, ISD, Rel.Type, Src); // it takes O(|ISD->ThunkSections|) time, `ISD->ThunkSections` is large (25980) when getThunkSectionSpacing is enabled.
              TS->addThunk(T);
              Thunks[T->getThunkTargetSym()] = T;
            }

Looking at the comment:

//      - b[l,a]  PPC64_REL24 range [33,554,432...33,554,428]
//      - bc[l,a] PPC64_REL14 range [-32,768...32764]
// We take the most strict range and intentionally use a lower
// size than the maximum branch range so the end of the ThunkSection
// is more likely to be within range of the branch instruction that 
// is furthest away.

I strongly recommend taking the longer range (32 Mb) for the ThunkSection spacing. If PPC is anything like Arm, I would be very surprised to see a lot of inter-section conditional branches, and with a range that low the chances of finding a reuse spot within 32k seems extremely low. In Arm we use the longest range, most common Thumb2 branch range of 16Mb to place the Thunk Sections and rely on the tidy up part in // No suitable ThunkSection exists. for the shorter-range 16Mb sections. I think that strategy could be used for PPC and would have roughly 40 ThunkSections in a 1.3Gb text section.

If there are many 32k conditional branches in the program then I think the original strategy of not using the ThunkSectionSpacing is the right way to go, there just won't be enough reuse opportunity to make it worthwhile.

MaskRay added a comment.EditedMay 8 2019, 2:08 AM

I was trying to find a good value used in
uint32_t PPC64::getThunkSectionSpacing() const { return 0x8000 - 0x1000; } before I noticed this caused an extreme slowdown in thunk creation.

We have a large program whose text segment is of 1.3GiB. Its OutputSection .text consists of 1948775 InputSections. It takes 20 seconds to link but with this patch it doesn't stop in 10 minutes.

The reason is that:

// O(|ThunkVec|). Not the critical path, but there is a bottleneck, too. Many ThunkVec's have hundreds/thousands of elements
            std::tie(T, IsNew) = getThunk(*Rel.Sym, Rel.Type, Src); 

            if (IsNew) { // called 112353 times
              // Find or create a ThunkSection for the new Thunk
              ThunkSection *TS;
              if (auto *TIS = T->getTargetInputSection())
                TS = getISThunkSec(TIS);
              else
                TS = getISDThunkSec(OS, IS, ISD, Rel.Type, Src); // it takes O(|ISD->ThunkSections|) time, `ISD->ThunkSections` is large (25980) when getThunkSectionSpacing is enabled.
              TS->addThunk(T);
              Thunks[T->getThunkTargetSym()] = T;
            }

Looking at the comment:

//      - b[l,a]  PPC64_REL24 range [33,554,432...33,554,428]
//      - bc[l,a] PPC64_REL14 range [-32,768...32764]
// We take the most strict range and intentionally use a lower
// size than the maximum branch range so the end of the ThunkSection
// is more likely to be within range of the branch instruction that 
// is furthest away.

I strongly recommend taking the longer range (32 Mb) for the ThunkSection spacing. If PPC is anything like Arm, I would be very surprised to see a lot of inter-section conditional branches, and with a range that low the chances of finding a reuse spot within 32k seems extremely low. In Arm we use the longest range, most common Thumb2 branch range of 16Mb to place the Thunk Sections and rely on the tidy up part in // No suitable ThunkSection exists. for the shorter-range 16Mb sections. I think that strategy could be used for PPC and would have roughly 40 ThunkSections in a 1.3Gb text section.

If there are many 32k conditional branches in the program then I think the original strategy of not using the ThunkSectionSpacing is the right way to go, there just won't be enough reuse opportunity to make it worthwhile.

I just wanted to make the same comment :) I think 0x8000 is too small. One of the bottlenecks is D61666, but it doesn't fix the problem.

The second bottleneck is at:

for (Thunk *T : *ThunkVec)
  if (T->isCompatibleWith(Type) && ////// many R_PPC64_REL24 branches can't be satisfied by the initially created thunks 
      Target->inBranchRange(Type, Src, T->getThunkTargetSym()->getVA()))
    return std::make_pair(T, false);
uint32_t PPC64::getThunkSectionSpacing() const {
  return 0x2000000; // good choice, but it should be decreased a bit
}

Then, I think D61629 is still useful. I'll revive it.

If PPC is anything like Arm, I would be very surprised to see a lot of inter-section conditional branches,

Yeah, in my experience theses are rare compared to the normal unconditional branch.

uint32_t PPC64::getThunkSectionSpacing() const {
  return 0x2000000; // good choice, but it should be decreased a bit
}

Since thunks sections will still get added closer if the pre-allocated ones can't satisfy a branch, and the number of pre-allocated sections drastically affects our link time maybe go with this (or maybe even slightly larger?).

We have a large program whose text segment is of 1.3GiB. Its OutputSection .text consists of 1948775 InputSections. It takes 20 seconds to link but with this patch it doesn't stop in 10 minutes.

I think you are in a unique situation where you have some *really good* test input for investigating how pre-allocating the thunks affects link time, do any of these larger programs have open-source equivalents? Most of the stuff I used for testing the thunk creation was either shared libraries or executable based on llvm/clang (for long-branch thunks). I would expect the shared libraries to use most of the thunk sections we create, but would love to be able to look at some larger executables to see how often the sections are used there. We might be better off not setting this value on PPC like Peter suggests.

lld/test/ELF/ppc64-branch-thunkspacing.s
19

Should we have a simple test simply checking the thunk section spacing for this patch and then have a follow on patch where we fix the thunk insertion to not insert into the middle of .init and .fini sections and test the multiple .init sections there?

adalava updated this revision to Diff 198679.May 8 2019, 8:59 AM

increase to PPC64_REL24 range

adalava marked 2 inline comments as done.May 8 2019, 9:00 AM
adalava updated this revision to Diff 198688.May 8 2019, 9:44 AM

update testcase to use readelf instead of objdump, so it doesn't depend on D61647 anymore

adalava added inline comments.May 8 2019, 10:15 AM
lld/test/ELF/ppc64-branch-thunkspacing.s
19

Sure. I'm not sure on how to check that spacing was applied. I know empirically the offsets are different, so I did this test.
Any guidance on how to improve this test is welcome.

@adalava We care about the link time. I need to do some testing to ensure this doesn't cause regression. I've created https://reviews.llvm.org/D61720, fixed the tests and added ppc64-long-branch-init.s that I think captures the idea of the fix. Hope you don't mind my commandeering the change :)

lld/test/ELF/ppc64-branch-thunkspacing.s
19

I think you can extract the .init section of crti.o crtbegin.o crtend.o crtn.o, and so something like:

.section .init,"ax",@progbits,unique,1
...

.section .init,"ax",@progbits,unique,2
...

.section .init,"ax",@progbits,unique,3
...

.section .init,"ax",@progbits,unique,4
...
adalava abandoned this revision.May 9 2019, 9:22 AM

@adalava We care about the link time. I need to do some testing to ensure this doesn't cause regression. I've created https://reviews.llvm.org/D61720, fixed the tests and added ppc64-long-branch-init.s that I think captures the idea of the fix. Hope you don't mind my commandeering the change :)

Np MaskRay, thank you for improving the patch. Abandoning this review in favor of D61720.