This is a feature that MS link.exe lacks; it currently errors out on such relocations, just like lld did before.
This allows linking clang.exe for ARM - practically, any image over 16 MB will likely run into the issue.
Differential D51089
[LLD] [COFF] Add support for creating range extension thunks for ARM mstorsjo on Aug 22 2018, 2:14 AM. Authored by
Details
This is a feature that MS link.exe lacks; it currently errors out on such relocations, just like lld did before. This allows linking clang.exe for ARM - practically, any image over 16 MB will likely run into the issue.
Diff Detail Event TimelineComment Actions Just FWIW, this change plus D51032 seems to be enough to build a working (at least for small trivial examples) clang+lld for ARM/Windows (AArch64 seems to work just fine as is). Comment Actions I've left a couple of comments where I think that there may be some unintentional inefficiencies, but as far as I can tell it looks like it will be correct. I suggest adding a lot more test cases as you go for things like thunk reuse.
Comment Actions Thanks for taking a look! Yup, some more tests definitely would be good. Do you have any suggestions on how to test things like this efficiently without creating >16 MB binaries? The existing tests (that used to check for errors) just use absolute symbols as targets to make it fail. I see that the ELF tests either do padding with .space or huge alignment with .balign - I guess something like that would work here as well. And I see that the existing ELF tests also produce huge binaries (although they hopefully are stored sparsely).
Comment Actions Sadly I had to resort to using .space to create large binaries. Creating the binaries is usually quick, disassembling them is unfortunately not. I tended to used gnu objdump first as that skips 0 by default to find the address ranges I needed, then used --start-address and --stop-address in llvm-objdump to pull out the bits that I need. In ELF we can use Linker Scripts for quite a few of the cases; although Linker Scripts make some corner cases possible that wouldn't exist otherwise. You may be able to do most of your tests with the conditional branch (+- 1 Mb), I didn't do that with ELF as the ThunkSections were placed at 16 Mb intervals as the vast majority of the relocations we would encounter were +- 16Mb.
Comment Actions Thanks for the feedback! I'll fold this into the next iteration of the patch; I have a bunch of other improvements planned.
Comment Actions Updated taking @ruiu and @peter.smith's feedback into account. I still haven't added more tests though, so that's still a clear todo, but reposting for more potential feedback meanwhile. I updated the code to keep the thunk maps between passes, which leads to much fewer additions in later passes (originally I could occasionally hit up to 8 passes before things were done, now it gets done in 3 passes), and fixed code to avoid chaining thunks in case the originally chosen thunk went out of range. Comment Actions Added a pretty complete testcase testing most aspects of the algorithm that is implemented, added a testcase for when unable to fix range issues with thunks.
Comment Actions Changed the thunk implementation to a shorter, non-interworking, form as suggested by @efriedma.
Comment Actions Split out readRelocTargets() to a separate method, which is called in parallel. This reduced the performance drop quite a bit, now the difference is much smaller. In this test round, the fastest run for the original version was 395 ms, after this patch the fastest run was 401 ms. So in this case, the slowdown is around 2% which hopefully is more acceptable. Comment Actions @ruiu - Do you have time to proceed with this one? Is the performance regression, which now is smaller thanks to your suggeation, acceptable? Or should I try other alternatives which make messier code with different codepaths for architectures that don't need thunks? The thunk algorithm here should be good to go now (no longer RFC level), in case @peter.smith wants to have another look (it's just a few minor improvements over the original one, which was more or less ok'd). Comment Actions The algorithm looks good to me. For ELF I preferred to give the Thunk a name related to the destination as it makes it a bit easier to follow disassembled binaries, but it is not essential. Comment Actions Sorry for the belated response. I was thinking of this patch for a while. Every time I saw the code of thunk range extension, I wonder if we really need this multi-pass algorithm which add thunks iterative on each pass. I believe in almost all cases, the algorithm finishes on the first iteration, if we allow a very small margin when determining "reachability". As long as a margin is small, size increase by allowing a margin should be negligible. For pathetic executables for which we need to generate tons of thunks (which enlarges distance between callers and callees and thus need multiple passes with the current algorithm), we can simply discard everything that we made in the previous iteration instead of keeping them, double the margin, and then try again from scratch. In practice, I believe that fallback doesn't happen too frequently. What do you think of the algorithm? If it works, I prefer that algorithm because discarding everything and redo with a larger margin is simpler than keeping thunks created in previous passes.
Comment Actions It can work; I worked on a proprietary linker for embedded systems that used that algorithm, It worked well enough 99% of the time. It could fail with nasty corner cases though, for example increasing the margin means more calls go out of range which leads to more thunks etc. Having said that I suspect that won't be a problem for COFF as most of the failures were a combination of a strange linker script and number of Thunks (Thumb branch range used to be 4 Megabytes, so there could be Thousands of thunks in a large project). Comment Actions I actually thought about that before, but either didn't think it through properly or didn't feel it was necessary - but by adding that in this current design I can make it succeed after the first pass already (previously it required two passes adding thunks on my testcase).
The approach you describe feels a bit fragile. Unless you're really sure the margin tradeoff is right and it will be done on the first pass in real-world cases, it'll degrade pretty badly. I was going to try to give this an honest and objective try, but I don't feel it will be much simpler. For your suggestion, we would need to have two kinds of loops over the chunks - one loop which checks whether thunks are needed with margin, adding them as necessary, and a second loop which checks if all relocations now are in range (not trying to add any thunks, but just aborting the loop, then resetting everything back to the original state and starting over. While the current code (which also is very similar to the corresponding ELF thunk code) does verification at the same time as runs a new pass trying to add more thunks if needed. If no more were needed, the algorithm was done.
Comment Actions Optimized the algorithm further by checking ranges with a margin in the first pass, making it succeed after the first run in my testcase. Shortened a variable name and added whitespace as @ruiu suggested. The testcase isn't updated after the last adjustments though - it's pretty much work to hand craft a testcase which triggers as many of the cornercases of the algorithm, I'll update it once we settle on the algorithm to choose. Since using a margin for adding thunks, I think it will be extremely tedious to actually create a testcase which would trigger more than one pass. Comment Actions In order to get sensible test coverage, I could make the margin (used only in the first pass) configurable (somehow), and run the multipass test with a very small margin. Even with the other approach suggested by @ruiu, testing of the case when one pass isn't enough would require a huge, pathological test. Comment Actions If I understand it, to get a test case we'd need to have a branch that is just in range (including margin) such that no thunk is generated, but adding sufficient thunks causes that branch to go out of range. The brute force way to do it would be to generate greater than (margin/thunk-size) thunks but even with macros that would be a large tedious test to write. One possibility in ELF that I don't know would transition into COFF is to have some sections with high alignment so that inserting a thunk could displace one of these sections off an alignment boundary and hence add much more size than just the Thunk. Comment Actions Interesting. I think that the implementation you have here will converge faster as it makes it more likely that pass 1 has all the thunks, at the expense of potentially generating more thunks than is strictly required. However if the goal is simplicity I think that you'll need to do everything in one pass, and accept that there will be corner cases that might not link if the margin isn't sufficient. For ELF and arbitrary linker scripts I thought the chance of failure too high, for COFF the chance of failure may be low enough. I think the most likely edge case in COFF will be the presence of sections with high alignment requirements as inserting thunks could cause a lot of bytes of alignment padding to be added. Comment Actions My apologies I clicked on the wrong link in the email, I should have been looking at D52156. Please ignore the past comment as it won't make much sense. The comment about writing the test might still make sense. |
Maybe something like RelocTargets is better? Symbols looks like it represents symbol table contents.
Please add a comment to explain why we want to cache relocation targets in this table (i.e. we need to modify relocation targets when a relocation is redirected to other symbol due to thunk insertion.)