Page MenuHomePhabricator

[WIP][LLD][RISCV] Linker Relaxation
Needs ReviewPublic

Authored by gkm on Apr 20 2021, 2:54 AM.

Details

Summary

This patch fully implements linker relaxation for RISC-V including relaxation for R_RISCV_CALL, R_RISCV_HI20/LO12, R_RISCV_PCREL_HI20/LO12 and handling for R_RISCV_ALIGN. Just for reference/link, there were some previous efforts/discussion to implementation linker relaxation in D77694 and D79105.

As linker relaxation is highly specific to RISC-V at the moment, most of the work are done in the Target::finalizeContents() function and isolated from generic code. For now I'm avoiding trying to come up with a common relaxation framework for multiple targets as their needs may be greatly different.

The relaxation process is split into several passes:

  • For each executable input section, search its relocation vector for R_RISCV_RELAX and determine how to relax the previous relocation.
  • If the relocation is R_RISCV_CALL (auipc+jalr pair), try to relax to jal or c.jal if the jump target is in range. This assumes that the PC-relative offset can only become smaller during the relaxation process.
  • If the relocation is R_RISCV_HI20/LO12 (absolute addressing) and the target symbol can be addressed from __global_pointer$ (defaulted to .sdata+0x800), delete the lui and rewrite the lo part to use gp as source register.
  • If the relocation is R_RISCV_PCREL_HI20/LO12, this requires two-pass to relax as PCREL_LO12 links to its PCREL_HI20 for address calculation and they may appear in arbitrary order in an input section. To preserve the look up from PCREL_LO12 the first pass relaxes/deletes`PCREL_LO12` and the second pass handles PCREL_HI20. The implementation is simpler than that in bfd because lld doesn't allow addends in PCREL_LO12 (https://github.com/riscv/riscv-elf-psabi-doc/issues/184), and so both parts can be relaxed independently.
  • The range of bytes are deleted after processing relaxation for a whole input section. This requires adjusting section content, symbol addresses/sizes and relocation offsets, which is handled in InputSectionBase::deleteRanges. Compared to other approaches we use a algorithm that is not quadratic by sorting symbols, relocations and bytes to be deleted by offset.
  • After relaxation, handle alignment as symbols addresses are now fixed modulo section alignment. This is always enabled regardless of the --relax option as this is required for correctness.

There are still some issues to solve:

  • --emit-relocs will be broken on relaxed section as currently it just copies the corresponding .rela section verbatim from input. It needs to be fixed to build relocation entries from the section's relocation vector.
  • The patch adds two additional RelExpr (R_RISCV_GPREL and R_RELAX_HINT) which unfortunately makes it unable to fit into a 64-bit mask, so it is now changed back to sequential checking. I think it should be somehow split into target-independent (R_ABS, R_PC, ...) exprs and target-dependent (R_RISCV_..., R_MIPS_...) exprs which can overlap in numeric range, but it is probably out of the scope of this patch.

Diff Detail

Event Timeline

PkmX created this revision.Apr 20 2021, 2:54 AM
PkmX requested review of this revision.Apr 20 2021, 2:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2021, 2:54 AM

Why write your own deleteRanges? I took great care in writing mine to ensure there were no hidden quadratic complexities, but your simpler version has introduced them. Just take mine, unless there is a problem with it nobody's mentioned. Same goes for mutableData being a less efficient version of mine.

I also don't like making this overly target-specific. We should take the time to get the interface right, not just punt on it and shove it all into target-specific code.

lld/ELF/Arch/RISCV.cpp
643

Can we not just keep two iterators around? That feels nicer to me than careful +1s everywhere (e.g. it's not immediately obvious this won't crash on the first element until you look at the start point of the loop).

663–664

This condition seems unnecessary

712

When would the addend ever be 0? That seems like invalid input.

759–760

Why a limit of 5? It is guaranteed to converge, I don't see why there needs to be a limit, and it'll probably take just 1 or 2 almost all of the time.

766

Not for relocatable

lld/ELF/InputSection.cpp
180–181

This looks quadratic

201–202

As does this

212–213

And this

lld/ELF/InputSection.h
167

Make the relevant fields mutable and mark this const, like data()

178

std::map seems like a high-overhead way to store what is just a sorted list of intervals.

256

Why do we need a whole new pointer when rawData is also being updated?

lld/ELF/Relocations.cpp
129

(a) this is unrelated to the diff
(b) this completely removes the whole point of oneof

PkmX added inline comments.Apr 20 2021, 8:36 PM
lld/ELF/Arch/RISCV.cpp
663–664

It's necessary as processRelaxations needs to access two iterators which would be invalid if the container is empty. I will move this into processRelaxations which is where it should be in the next patch.

759–760

Will be removed in the next patch.

766

Will do in the next patch.

lld/ELF/InputSection.cpp
180–181

This isn't quadratic.

We are really looping over the symbols in this section while simultaneously accumulating the number of bytes deleted. Since symbols and delete ranges are sorted by offset, if we know X bytes are deleted before offset A, and B > A, we only have to add up delete ranges between A and B to know the total number of bytes deleted before B. In essence every symbol and delete range is only visited once in this loop.

lld/ELF/InputSection.h
178

I agree. std::vector<std::pair<uint64_t, uint64_t>> should be sufficient although it needs to be sorted for deleteRanges as the vector may not sorted with two-pass handling for PCREL_HI20.

256

Will change it to use a copiedData bool as in your patch.

lld/ELF/Relocations.cpp
129

As noted in the summary adding more RelExpr makes it unable to fit in a 64-bit mask, so this is temporary change to make this patch work for now.

There were already 64 RelExprs before this patch so there is no room for growing and I don't think it is workable in the long run. The only solution I can think of is split them into target specific RelExpr with some common exprs like R_ABS being shared, so as long as the target doesn't define more than ~20 RelExpr it should be fine. However this probably should belong to another patch.

PkmX added inline comments.Apr 21 2021, 1:01 AM
lld/ELF/Arch/RISCV.cpp
712

Yes, valid addends should only be 2^n-2 for n >= 2. With that said I don't think lld is really checking this anywhere, so malicious input could cause lld to crash.

lld/ELF/Relocations.cpp
129

Alternatively I think we can use two 64-bit masks, so this should work up to 128 RelExprs.

PkmX updated this revision to Diff 339181.EditedApr 21 2021, 4:38 AM
PkmX marked an inline comment as done and an inline comment as not done.
PkmX edited the summary of this revision. (Show Details)

This addresses some of the comments by @jrtc27:

  • Check for empty relocation vector is moved into processRelaxations
  • Remove the iteration limit for relaxation
  • Don't perform alignment handling for relocatable link
  • No more mutData in InputSectionBase, now it uses copiedData to check if a mutable copy has been made.
  • DeleteRanges is now a std::vector of offset and size pairs.
  • oneof now tests for two 64-bit masks depending on which subset (<64 or >= 64) the expr is in.
arichardson added inline comments.Apr 21 2021, 4:46 AM
lld/ELF/Relocations.cpp
132–135

This change should be a separate review. I would very much like support for > 64 RelExpr values to land upstream since we also had to make that change that for our CHERI LLD.

Why write your own deleteRanges? I took great care in writing mine to ensure there were no hidden quadratic complexities, but your simpler version has introduced them. Just take mine, unless there is a problem with it nobody's mentioned. Same goes for mutableData being a less efficient version of mine.

I also don't like making this overly target-specific. We should take the time to get the interface right, not just punt on it and shove it all into target-specific code.

+1 for a low time complexity implementation. The quadratic time complexity of the binutils approach may run into scalability problems.

I haven't closely looked into this yet, but it seems that R_RISCV_GPREL_* support can be contributed separately.

PkmX added a comment.Apr 26 2021, 7:58 PM

Why write your own deleteRanges? I took great care in writing mine to ensure there were no hidden quadratic complexities, but your simpler version has introduced them. Just take mine, unless there is a problem with it nobody's mentioned. Same goes for mutableData being a less efficient version of mine.

I also don't like making this overly target-specific. We should take the time to get the interface right, not just punt on it and shove it all into target-specific code.

+1 for a low time complexity implementation. The quadratic time complexity of the binutils approach may run into scalability problems.

I haven't closely looked into this yet, but it seems that R_RISCV_GPREL_* support can be contributed separately.

Do you mean the whole gp-relaxation support? As only implementing relocations for R_RISCV_GPREL seems pointless as they are only generated and consumed by the linker.

lld/ELF/Relocations.cpp
132–135

That's okay for me. I will split this out into a new patch.

xgupta added a subscriber: xgupta.Jul 31 2021, 6:44 AM

Ping, What is the plan for this revisoin @PkmX?

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 10:18 AM

I rebuilt LLVM with this patch and used it to build a non-public RISC-V binary, compiled with -Os + LTO. I found that it reduced code size of a ~450 kB application by over 28 kB (6.2%). That application is a mixture of Rust and C code; the savings from rustc generated code and Clang generated code were similar. There are also accompanying performance improvements thanks to the removed unnecessary instructions.

It would be really nice to see this patch merged given the magnitude of the improvements.

gkm added a subscriber: gkm.Mar 28 2022, 11:01 AM

Thanks for the patch. It would be great to finally have linker relaxations for RISC-V in LLD.

This patch fully implements linker relaxation for RISC-V

Well, it doesn't implement things like:

  lui a0, %hi(x)
  addi a0, a0, %lo(x)
->
  addi a0, zero, %lo(x)

For &lo that fits in the 12 bit addi immediate. BFD implements that.

Likewise for auipc+addi with %pcrel_*. It's only implemented for gp / __global_pointer$, not for x0.

The patched LLD also doesn't seem to support relaxations for an explicitly defined __global_pointer$. Whether it's supposed to or not, I think that case should be included in the tests.

luismarques added inline comments.Apr 12 2022, 7:11 AM
lld/ELF/Relocations.cpp
132–135

we also had to make that change that for our CHERI LLD.

That's okay for me. I will split this out into a new patch.

I ran into the same problem in another context. It would be good to have some guidance about what's the best way forward to address this for the long term (e.g. two 64-bit masks?). @ruiu?

@PkmX Are you still planning to submit your separate patch for this mask issue? And, more broadly, to update this LLD RISC-V relaxation patch?

This was discussed in the RISC-V sync-up call. As suggested by Alex, I'm sharing in writing here some of the things I mentioned or were discussed in the call:

  • I found a linking failure issue with this patch. I hadn't yet reported that here because I am still in the process of reducing and analyzing the problematic case.
  • We touched on the topic of relaxation performance and I shared this graph of my benchmarking results of LLD with this patch compared with BFD:
  • @gkm reported that he was also analyzing/working on this. If anybody else is planning to work on this it would be good to coordinate with him and the rest of the community, to avoid duplicate work.
lld/ELF/Arch/RISCV.cpp
502

This is also called for R_RISCV_CALL_PLT

I have also planned to pick up the work. I think binutils is trying to improve their time complexity, and we should just aim for the best time complexity initially.

I have also planned to pick up the work. I think binutils is trying to improve their time complexity, and we should just aim for the best time complexity initially.

Is this patch lacking in some way in terms of performance (absolute or asymptotic)?

I have also planned to pick up the work. I think binutils is trying to improve their time complexity, and we should just aim for the best time complexity initially.

Is this patch lacking in some way in terms of performance (absolute or asymptotic)?

Asymptotic. See https://reviews.llvm.org/D100835#2707537
It has been long time since I looked at this patch, I'll need to re-read :)

I have also planned to pick up the work. I think binutils is trying to improve their time complexity, and we should just aim for the best time complexity initially.

Is this patch lacking in some way in terms of performance (absolute or asymptotic)?

Asymptotic. See https://reviews.llvm.org/D100835#2707537
It has been long time since I looked at this patch, I'll need to re-read :)

I think that’s been addressed?

  • I found a linking failure issue with this patch. I hadn't yet reported that here because I am still in the process of reducing and analyzing the problematic case.

It was a big pain to do the initial steps of the reduction (I had to start with the object files) but, eventually, I managed to do it and from there I reduced it further. Here's a simpler version of the problematic case:

$ cat test.s
.global a
a:
tail a
.rept 2008
.byte 0
.endr

$ clang --target=riscv64 -c test.s
$ ld.lld -shared test.o
ld.lld: error: test.o:(.text+0x0): relocation R_RISCV_RVC_JUMP out of range: 1028 is not in [-1024, 1023]; references a
>>> defined in test.o

It was a big pain to do the initial steps of the reduction (I had to start with the object files) but, eventually, I managed to do it and from there I reduced it further. Here's a simpler version of the problematic case:

BTW, the issue seems to be that the call is going through the PLT but the relaxation offset check assumes that it's jumping directly to the symbol so it incorrectly concludes that it's within bounds to relax to a compressed jump. (It seems like we could avoid going through the PLT anyway. If so, it would be good to optimize that, as that's a preexisting behavior).

a could be preempted, so no, it has to go via the PLT for that specific example.

Incidentally, this issue of relaxation correctness is why I want the optimisation relaxations separated out from the alignment support that’s required; I can see the optimisations otherwise holding up supporting but not optimising -mrelax code.

Incidentally, this issue of relaxation correctness is why I want the optimisation relaxations separated out from the alignment support that’s required;

That sounds sensible.

gkm commandeered this revision.Apr 21 2022, 3:33 PM
gkm added a reviewer: PkmX.

Original author @PkmX has been inactive here for a year, and did not respond to email.

gkm edited reviewers, added: luismarques; removed: PkmX.Apr 21 2022, 3:34 PM
gkm updated this revision to Diff 424315.EditedApr 21 2022, 3:35 PM

Rebase to get 1 year of LLVM changes

reames added a subscriber: reames.Mon, Apr 25, 9:29 AM