Page MenuHomePhabricator

[WIP][RISCV][ELF] Linker relaxation support
Needs ReviewPublic

Authored by jrtc27 on Apr 7 2020, 4:52 PM.

Details

Reviewers
espindola
MaskRay
Summary

This is a proof-of-concept implementation for R_RISCV_ALIGN that I've had lying
around for a while. I just rebased it, but it has not been build tested, let
alone run. It was previously able to link a working FreeBSD kernel, though, and
compared bit-for-bit identical to an -mno-relax build. I'm posting this here as
it came up on IRC and I figured it would be best to share it and avoid
duplicating work (and perhaps also promoting a bit more discussion about how to
implement this correctly in LLD; the mutableData/makeMutableDataCopy is a bit
of a gross hack that could potentially go wrong if consumers of data hang on to
now-stale pointers, and it would be nice if we could instead just
mprotect/mremap/etc the mapping and let the OS CoW as needed whilst retaining
the same address, although given I work on CHERI and CHERI has minor issues
with the mprotect/mremap interface _increasing_ permissions, perhaps I should
not be pushing for that).

Diff Detail

Event Timeline

jrtc27 created this revision.Apr 7 2020, 4:52 PM
MaskRay added a comment.EditedApr 7 2020, 6:07 PM

Interestingly, I was thinking about the same thing on Saturday! I wanted to
add optimizeBasicBlockJumps() in a proper place (D68065; basic block
sections). The current place
is wrong for a thunk target (e.g. AArch64). I did not write code because I am
still unsure how to properly do linker relaxations.

Title: It should mention R_RISCV_ALIGN.

I believe the current framework can only handle the BFD counterpart of
relax_pass==2 (_bfd_riscv_relax_align). Other instruction rewriting may require
intertwined scanRelocations and finalizeAddressDependentContent. The relocation
scanning pass may have to be split, but I don't know whether the boundary is.

It has occurred to me that splitting the relocation scanning pass may be good for

The above is basically limitation of a linker with one-pass relocation scanning. To have a complete support we may have to bite the bullet. More bookkeeping may
be needed. InputSectionBase::relocations will not be sufficient.
finalizeSynthetic() may be called more than once.

An incomplete list of passes we do after scanRelocations:

forEachRelSec(scanRelocations<ELFT>);
add symbols to in.symTab and partitions[*].dynSymTab
removeUnusedSyntheticSections()
sortSections()
finalizeSynthetic(in.*)
fixSectionAlignments()
finalizeAddressDependentContext // 
finalizeSynthetic(in.symTab)
finalizeSynthetic(in.ppc64LongBranchTarget) // conceptually it should be done after thunks are finalized

We need to move scanRelocations() as late as possible and move some passes (including finalizeSynthetic()) into finalizeAddressDependentContent(). These passes need to refactored to work if called more than once.

The finalizeAddressDependentContent() should be changed to several rounds of iterations (relax_pass). The last round handles R_RISCV_ALIGN.

A bit off-topic. For RISC-V's (ab)use of linker relaxations, my feeling is still complex. It is indeed a very convenient approach toward a good balance of code size/speed/convenience, but if we want to achieve more, some post-link time optimization frameworks may be more suitable. I don't really have enough experience with link time optimization but my understanding is that we currently use the term link time optimization (especially in the LLVM context) for optimizations performed on the LLVM IR level. Those low-level machine representations are not categorized as LTO.

lld/ELF/Arch/RISCV.cpp
551

llvm::erase_if

lld/ELF/Writer.cpp
1674

This loop should be merged with the previous for loop.

1679

Check SHF_EXECINSTR

+@grimar @psmith @ruiu

I know Peter is super busy and has very little time expendable on LLVM... Still, I am eager to hear whether we should move from one-pass relocation scanning to multi-pass relocation scanning. Maybe simple evidence from other linkers may give me more confidence to think more in this area.

grimar added a comment.Apr 8 2020, 1:57 AM

Still, I am eager to hear whether we should move from one-pass relocation scanning to multi-pass relocation scanning.

FTR, relocation scanning is an expensive path and in the past the general direction was to avoid doing it multiple times. I.e. my concern is perfomance.

psmith added a comment.Apr 8 2020, 3:49 AM

Will have to have a think about this in more detail over the Easter Weekend.

I do have some experience with relocations being scanned multiple times. As George mentions, performance, particularly for very large programs with millions of relocations is a concern, especially for a linker that is most attractive to its user base for high performance. The performance impact can be mitigated by only doing what you need when scanning the relocations early. For example in a debug build the number of relocations vastly outweighs the number of non-debug relocations yet we are unlikely to need to scan them early. The downside of this approach is that we have a large non-local dependency between the relocation scans which can make the implementation fragile. For example it is easy to forget that something needs to be done in scan pass 1, only to see some problem come up with scan pass 2 that depended on it. Overall we'll only know by measuring on several large programs.

Will have to have a think about this in more detail over the Easter Weekend.

I do have some experience with relocations being scanned multiple times. As George mentions, performance, particularly for very large programs with millions of relocations is a concern, especially for a linker that is most attractive to its user base for high performance. The performance impact can be mitigated by only doing what you need when scanning the relocations early. For example in a debug build the number of relocations vastly outweighs the number of non-debug relocations yet we are unlikely to need to scan them early. The downside of this approach is that we have a large non-local dependency between the relocation scans which can make the implementation fragile. For example it is easy to forget that something needs to be done in scan pass 1, only to see some problem come up with scan pass 2 that depended on it. Overall we'll only know by measuring on several large programs.

Apologies not got a lot more to add at the moment.

the mutableData/makeMutableDataCopy is a bit of a gross hack that could potentially go wrong if consumers of data hang on to now-stale pointers,

If we were single threaded I think the risk of holding on to stale pointers is low and probably could be managed with warnings in comments and code-review. I think our most likely cause of problem would be during multithreaded parts of the program. if Thread 1 gets an address to the read-only data, just after but before Thread 1 has finished, Thread 2 makes a mutableCopy, with possibilities of inconsistencies. I've not got any great suggestions on how to fix this. One way might be to make the interface a bit more explicit, for example any section containing relaxations, even if they aren't relaxed get copied.

The current place is wrong for a thunk target (e.g. AArch64). I did not write code because I am still unsure how to properly do linker relaxations.

One thing that concerns me is convergence. In finalizeAddressDependentContent() we have had to be careful to avoid the edge case where simultaneously adding and subtracting content end up with oscillations and non-convergence. I think that we've managed that so by locking the size of some sections like the Thumb relocations. My guess is that architectures will choose Thunks or Relaxations but can't easily support both at the same time, with Thunks adding content and Relaxations shrinking it. I do hope that RISCV doesn't end up having to write the equivalent of the Erratum patches, I guess there would be an implementation that could use relaxations.

We need to move scanRelocations() as late as possible and move some passes (including finalizeSynthetic()) into finalizeAddressDependentContent(). These passes need to refactored to work if called more than once.

After investigating https://bugs.llvm.org/show_bug.cgi?id=44824 .ARM.exidx needs this right now as a linker script can have non-monotonically increasing VMA despite having a monotonically increasing sectionIndex. This leads to an incorrect order in the .ARM.exidx table. I'll start work on pr44824 next week.

Hello, thanks for uploading this, it s great to see more work in the same area. I have also been working on relocation relaxation in LLD. I have been focussing on relaxing R_RISCV_CALL relocations for now. My approach is not too dissimilar from yours. I plan to upload a similar patch in the future.

lld/ELF/Arch/RISCV.cpp
37

This needs to be re-added.

jrtc27 marked an inline comment as done.Apr 12 2020, 3:44 PM
jrtc27 added inline comments.
lld/ELF/Arch/RISCV.cpp
37

Oops, yes, obviously wrong. As I said, this was just a quick-and-dirty rebase on top of several months of upstream changes and I haven't even tried building it, but was uploaded for others to take if they so desire (especially the deleteRanges as doing that efficiently is nasty code to write and debug; bfd just takes the simple-but-inefficient quadratic approach...).