This is an archive of the discontinued LLVM Phabricator instance.

[LLD] [COFF] Add support for creating range extension thunks for ARM
AbandonedPublic

Authored by mstorsjo on Aug 22 2018, 2:14 AM.

Details

Summary

This is a feature that MS link.exe lacks; it currently errors out on such relocations, just like lld did before.

This allows linking clang.exe for ARM - practically, any image over 16 MB will likely run into the issue.

Diff Detail

Event Timeline

mstorsjo created this revision.Aug 22 2018, 2:14 AM

Just FWIW, this change plus D51032 seems to be enough to build a working (at least for small trivial examples) clang+lld for ARM/Windows (AArch64 seems to work just fine as is).

I've left a couple of comments where I think that there may be some unintentional inefficiencies, but as far as I can tell it looks like it will be correct. I suggest adding a lot more test cases as you go for things like thunk reuse.

COFF/Writer.cpp
372

Assuming getRVA() is the virtual address of the symbol, is Target->getRVA() stable between passes? Presumably if thunks are inserted then assignAddresses() may cause some symbols to change address? I'm not too familiar with the COFF code base so I could be missing something here.

If I'm right the reuse between passes may not work as well as it could do.

424

In theory if you are iterating a fixed number (Chunks.size()) of the Chunks vector then inserting thunks into the Chunks vector in the same loop will mean that Chunks near the end may not be scanned for Thunks. Given that the algorithm will only terminate when 0 thunks are inserted you'll eventually scan all of them but it may cost you more passes than you would need if you inserted all Thunks in one go. I think you'll be unlikely to hit 10 passes without a contrived test case though.

I've left a couple of comments where I think that there may be some unintentional inefficiencies, but as far as I can tell it looks like it will be correct. I suggest adding a lot more test cases as you go for things like thunk reuse.

Thanks for taking a look!

Yup, some more tests definitely would be good. Do you have any suggestions on how to test things like this efficiently without creating >16 MB binaries? The existing tests (that used to check for errors) just use absolute symbols as targets to make it fail. I see that the ELF tests either do padding with .space or huge alignment with .balign - I guess something like that would work here as well. And I see that the existing ELF tests also produce huge binaries (although they hopefully are stored sparsely).

COFF/Writer.cpp
372

No, the RVAs aren't stable between passes, but we don't keep the map between passes either; it's a local variable in createThunks below.

I guess it would be useful to allow this to find a different thunk from the previous pass (that would also reduce the amount of changes in later passes, reducing the number of passes required before it converges), but that's not implemented (yet).

424

Hmm, I'm not quite sure I understand what you mean here. You mean that since I'm adding more elements to the Chunks vector, I'd miss the last few ones that were pushed forward? The limit on the outer loop, on line 379, explicitly checks for Chunks.size(), so it will loop until the very end of the vector, even if Chunks grows meanwhile.

Sadly I had to resort to using .space to create large binaries. Creating the binaries is usually quick, disassembling them is unfortunately not. I tended to used gnu objdump first as that skips 0 by default to find the address ranges I needed, then used --start-address and --stop-address in llvm-objdump to pull out the bits that I need.

In ELF we can use Linker Scripts for quite a few of the cases; although Linker Scripts make some corner cases possible that wouldn't exist otherwise. You may be able to do most of your tests with the conditional branch (+- 1 Mb), I didn't do that with ELF as the ThunkSections were placed at 16 Mb intervals as the vast majority of the relocations we would encounter were +- 16Mb.

COFF/Writer.cpp
424

Apologies, I had it my mind that Chunks.size() would only be calculated once per pass.

mstorsjo added inline comments.Aug 24 2018, 12:01 PM
COFF/Chunks.cpp
54

Does anyone have an opinion on the mechanism of overriding what symbol an individual reloc points to? Here I provide a full vector of symbols (which can't be initialized directly but after all symbols actually exist) - an alternative would be e.g. a DenseMap to only provide the individual symbols that are overridden. Or something else?

ruiu added inline comments.Aug 26 2018, 9:52 PM
COFF/Chunks.cpp
51–52

Can this happen?

53–54

Since this vector can be very large, it is perhaps better to call reserve().

54

I think this vector should be fine because this vector will be used very heavily and a vector lookup is extremely fast.

430–431

It is more straightforward to write this loop in as a plain old for loop instead of a range-based for loop with Counter.

COFF/Chunks.h
221

Maybe something like RelocTargets is better? Symbols looks like it represents symbol table contents.

Please add a comment to explain why we want to cache relocation targets in this table (i.e. we need to modify relocation targets when a relocation is redirected to other symbol due to thunk insertion.)

COFF/ICF.cpp
150 ↗(On Diff #161900)

I'd think this comment is not easy to understand if you don't know about this comment is in the context of ARM thunk support. Maybe we can just omit it?

Does ICF run after ARM thunk creation?

178 ↗(On Diff #161900)

Ditto.

COFF/Writer.cpp
355

nit: move this assert at the beginning of this function.

376

I don't know if your above comment is true, but if the one this loop is looking for is likely at the end of the vector, I'd search in the reverse order. I.e.

for (Defined *Sym : llvm::reverse(TargetThunks))
378

Can this just be return {Sym, false}?

379

nit: omit {}

383

I don't think you need to pass this ThunkCounter around; I'd define this as a static local variable in this function.

386

Error messages should start with a lowercase letter.

Shouldn't this be an assert? If this can be triggered by a valid user input, this error message should contain more info as to what is the problem.

388

Maybe return {D, true}

392

std::map is an ordered map and usually much slower than DenseMap, so please use DenseMap.

401

nit: please insert a blank line before a comment.

422

Nesting is too deep. Please consider splitting to multiple functions.

445

Adding -> adding

Thanks for the feedback! I'll fold this into the next iteration of the patch; I have a bunch of other improvements planned.

COFF/Chunks.cpp
51–52

I don't think so - perhaps I should make it an assert. I had to insert a call to finalizeContents() in relocateDebugChunk() in PDB.cpp though, so I wanted to make sure.

53–54

Indeed, I'll make it use reserve in the next iteration.

430–431

Ok, will do.

COFF/ICF.cpp
150 ↗(On Diff #161900)

Yeah, it's probably best to omit the comment. ICF runs before the thunk creation.

The original reason for the comment was that I wanted to replace all uses of coff_relocation::SymbolTableIndex with the Symbols vector (outside the case when initializing the vector), to make things consistent, but it wasn't practical in all cases. In reality it mostly is necessary in SectionChunk::writeTo() and SectionChunk::getBaserels().

COFF/Writer.cpp
383

I later figured out I didn't need to name the thunk symbol at all; I just create the DefinedSynthetic directly without a name, and don't add it to Symtab - just like we do with object file local symbols.

386

I managed to remove this error altogether by not adding the thunk symbols to the symbol table.

mstorsjo added inline comments.Aug 27 2018, 12:25 AM
COFF/Writer.cpp
376

That's a nice idea. I'm changing code to keep the mapping of existing thunks stable across passes, so then the comment doesn't apply quite as much as before. I tested this and it didn't really save any measurable runtime, but it might be worthwhile anyway, as long as we iterate through the whole vector.

378

Indeed, will simplify the syntax of these.

mstorsjo added inline comments.Aug 27 2018, 4:11 AM
COFF/Chunks.cpp
51–52

Actually, yes, it does happen. finalizeContents gets called by assignAddresses, which gets called repeatedly when relayouting after adding thunks. (After realizing this, I had to make sure MergeChunk::finalizeContents works properly for this case as well.)

The alternative would be to add another callback to Chunk, which we'd call just once, when all symbols are available.

COFF/Writer.cpp
376

In practice with the case I'm testing, TargetThunks will only have 0 or 1 members, so it doesn't really matter much, but in general I guess it could be useful. (My testcase produces a 46 MB clang.exe, so it's only roughly twice as large as the branch range which is 16 MB.)

422

It's a bit hard to split this part to a separate method since it touches almost every single local variable from this method, but I can easily change the if (!isInRange()) into an if (isInRange()) continue; to reduce the nesting a little.

mstorsjo updated this revision to Diff 162652.Aug 27 2018, 4:32 AM

Updated taking @ruiu and @peter.smith's feedback into account. I still haven't added more tests though, so that's still a clear todo, but reposting for more potential feedback meanwhile.

I updated the code to keep the thunk maps between passes, which leads to much fewer additions in later passes (originally I could occasionally hit up to 8 passes before things were done, now it gets done in 3 passes), and fixed code to avoid chaining thunks in case the originally chosen thunk went out of range.

mstorsjo updated this revision to Diff 162943.Aug 28 2018, 1:40 PM
mstorsjo retitled this revision from [LLD] [COFF] [RFC] Add support for creating range extension thunks for ARM to [LLD] [COFF] Add support for creating range extension thunks for ARM.
mstorsjo edited the summary of this revision. (Show Details)

Added a pretty complete testcase testing most aspects of the algorithm that is implemented, added a testcase for when unable to fix range issues with thunks.

mstorsjo updated this revision to Diff 162948.Aug 28 2018, 1:51 PM

Missed the RelocTargets part of the diff in the previous update.

efriedma added inline comments.
COFF/Chunks.cpp
646

Can you use "add pc, ip" instead? That's not an interworking branch, but I think we can assume the target is Thumb mode here.

mstorsjo added inline comments.Aug 28 2018, 2:20 PM
COFF/Chunks.cpp
646

Oh, indeed, that'll make it as small as the other ones, while being PIC. Will update.

mstorsjo updated this revision to Diff 162956.Aug 28 2018, 2:21 PM

Changed the thunk implementation to a shorter, non-interworking, form as suggested by @efriedma.

ruiu added inline comments.Aug 28 2018, 11:28 PM
COFF/Chunks.cpp
50

We generally don't micro-optimize code, but for relocations, we do, because the number of relocations can be an order of tens of millions for large programs. Spending one more microsecond for each relocation adds up to one second if your program has one million relocations. This function is a bit concerning in that regard. Could you measure the performance impact?

Also, it looks odd that you do this in finalizeContents, as it doesn't correspond to finalizing contents. Perhaps this function should be given a new name.

COFF/Chunks.h
356–359

We generally don't define trivial accessors like this; instead, just define a member as a public one.

mstorsjo added inline comments.Aug 29 2018, 2:11 AM
COFF/Chunks.cpp
50

I tried measuring it, and I think it's making things slower, but it's mostly within measurement noise.

My testcase was linking a 66 MB clang.exe (for x86_64). Before this change, the fastest link was in 480 ms, after the change the fastest link was 520 ms. But in both cases the runtimes occasionally go up to over 600 ms. So it's not huge but I think it's consistently measurable.

Do you have any other suggestions on how to achieve this without affecting performance? Only use the RelocTargets vector if Machine==ARMNT? Or make it a DenseMap for overridden targets, which is empty for all other cases than when we have added thunks?

Yes, it's a bit odd with finalizeContents, maybe a new method initRelocTargets which just gets called once before we start doing assignAddresses?

COFF/Chunks.h
356–359

Ok, will include that change into the next update.

ruiu added inline comments.Aug 29 2018, 2:19 AM
COFF/Chunks.cpp
50

Of 520ms, it'd be interesting to know how much time this function is spending. gprof might help, but I'm not sure if it works on Windows. 480ms to 520ms isn't I think a marginal difference; it's 10% slowdown if the measurement is accurate.

One idea to make it faster (and could potentially be faster than it is now) is to parallelize it. I believe you can make it a separate function, say readRelocTargets, and call on all input sections in parallel. It should be safe to do because filling this new vector doesn't affect other threads.

mstorsjo added inline comments.Aug 29 2018, 2:58 AM
COFF/Chunks.cpp
50

I don't run things on Windows myself, I'm mostly working with cross compilation here.

I'll try making this a separate parallel pass and see what difference it makes!

mstorsjo updated this revision to Diff 163043.Aug 29 2018, 4:00 AM

Split out readRelocTargets() to a separate method, which is called in parallel. This reduced the performance drop quite a bit, now the difference is much smaller. In this test round, the fastest run for the original version was 395 ms, after this patch the fastest run was 401 ms. So in this case, the slowdown is around 2% which hopefully is more acceptable.

Split out readRelocTargets() to a separate method, which is called in parallel. This reduced the performance drop quite a bit, now the difference is much smaller. In this test round, the fastest run for the original version was 395 ms, after this patch the fastest run was 401 ms. So in this case, the slowdown is around 2% which hopefully is more acceptable.

@ruiu - any further comments? Is this form more acceptable?

Split out readRelocTargets() to a separate method, which is called in parallel. This reduced the performance drop quite a bit, now the difference is much smaller. In this test round, the fastest run for the original version was 395 ms, after this patch the fastest run was 401 ms. So in this case, the slowdown is around 2% which hopefully is more acceptable.

@ruiu - any further comments? Is this form more acceptable?

Ping @ruiu

Split out readRelocTargets() to a separate method, which is called in parallel. This reduced the performance drop quite a bit, now the difference is much smaller. In this test round, the fastest run for the original version was 395 ms, after this patch the fastest run was 401 ms. So in this case, the slowdown is around 2% which hopefully is more acceptable.

@ruiu - any further comments? Is this form more acceptable?

Ping @ruiu

Another ping for @ruiu

@ruiu - Do you have time to proceed with this one? Is the performance regression, which now is smaller thanks to your suggeation, acceptable? Or should I try other alternatives which make messier code with different codepaths for architectures that don't need thunks?

The thunk algorithm here should be good to go now (no longer RFC level), in case @peter.smith wants to have another look (it's just a few minor improvements over the original one, which was more or less ok'd).

The algorithm looks good to me. For ELF I preferred to give the Thunk a name related to the destination as it makes it a bit easier to follow disassembled binaries, but it is not essential.

ruiu added a comment.Sep 12 2018, 7:57 AM

Sorry for the belated response. I was thinking of this patch for a while.

Every time I saw the code of thunk range extension, I wonder if we really need this multi-pass algorithm which add thunks iterative on each pass. I believe in almost all cases, the algorithm finishes on the first iteration, if we allow a very small margin when determining "reachability". As long as a margin is small, size increase by allowing a margin should be negligible.

For pathetic executables for which we need to generate tons of thunks (which enlarges distance between callers and callees and thus need multiple passes with the current algorithm), we can simply discard everything that we made in the previous iteration instead of keeping them, double the margin, and then try again from scratch. In practice, I believe that fallback doesn't happen too frequently.

What do you think of the algorithm? If it works, I prefer that algorithm because discarding everything and redo with a larger margin is simpler than keeping thunks created in previous passes.

COFF/Chunks.cpp
642

This variable name seems a bit too long to my taste; I'd name ArmThunk or something like that, and that should be fine as long as this is a file-scope variable.

807–814

This part is not guarded by Finalized -- is that intended?

COFF/PDB.cpp
793

If you change this line to

cast<SectionChunk>(DebugChunk)->readRelocTargets();

then can you make readRelocTargets a non-virtual member function that belongs to SectionChunk?

COFF/Writer.cpp
381

Please insert a blank line before a multi-line comment.

Sorry for the belated response. I was thinking of this patch for a while.

Every time I saw the code of thunk range extension, I wonder if we really need this multi-pass algorithm which add thunks iterative on each pass. I believe in almost all cases, the algorithm finishes on the first iteration, if we allow a very small margin when determining "reachability". As long as a margin is small, size increase by allowing a margin should be negligible.

For pathetic executables for which we need to generate tons of thunks (which enlarges distance between callers and callees and thus need multiple passes with the current algorithm), we can simply discard everything that we made in the previous iteration instead of keeping them, double the margin, and then try again from scratch. In practice, I believe that fallback doesn't happen too frequently.

What do you think of the algorithm? If it works, I prefer that algorithm because discarding everything and redo with a larger margin is simpler than keeping thunks created in previous passes.

It can work; I worked on a proprietary linker for embedded systems that used that algorithm, It worked well enough 99% of the time. It could fail with nasty corner cases though, for example increasing the margin means more calls go out of range which leads to more thunks etc. Having said that I suspect that won't be a problem for COFF as most of the failures were a combination of a strange linker script and number of Thunks (Thumb branch range used to be 4 Megabytes, so there could be Thousands of thunks in a large project).

Sorry for the belated response. I was thinking of this patch for a while.

Every time I saw the code of thunk range extension, I wonder if we really need this multi-pass algorithm which add thunks iterative on each pass. I believe in almost all cases, the algorithm finishes on the first iteration, if we allow a very small margin when determining "reachability". As long as a margin is small, size increase by allowing a margin should be negligible.

I actually thought about that before, but either didn't think it through properly or didn't feel it was necessary - but by adding that in this current design I can make it succeed after the first pass already (previously it required two passes adding thunks on my testcase).

For pathetic executables for which we need to generate tons of thunks (which enlarges distance between callers and callees and thus need multiple passes with the current algorithm), we can simply discard everything that we made in the previous iteration instead of keeping them, double the margin, and then try again from scratch. In practice, I believe that fallback doesn't happen too frequently.

What do you think of the algorithm? If it works, I prefer that algorithm because discarding everything and redo with a larger margin is simpler than keeping thunks created in previous passes.

The approach you describe feels a bit fragile. Unless you're really sure the margin tradeoff is right and it will be done on the first pass in real-world cases, it'll degrade pretty badly.

I was going to try to give this an honest and objective try, but I don't feel it will be much simpler. For your suggestion, we would need to have two kinds of loops over the chunks - one loop which checks whether thunks are needed with margin, adding them as necessary, and a second loop which checks if all relocations now are in range (not trying to add any thunks, but just aborting the loop, then resetting everything back to the original state and starting over. While the current code (which also is very similar to the corresponding ELF thunk code) does verification at the same time as runs a new pass trying to add more thunks if needed. If no more were needed, the algorithm was done.

COFF/Chunks.cpp
642

Sure, I'll shorten it.

807–814

Yes - this is called from assignAddresses on each relayout, to propagate the current location in the layout.

COFF/PDB.cpp
793

That also requires changes like if (SectionChunk *SC = dyn_cast_or_null<SectionChunk>(C)) in readRelocTargets() in Writer.cpp. Or we move calling that to somewhere else, but then it's probably not quite as easy to parallelize.

mstorsjo updated this revision to Diff 165136.Sep 12 2018, 12:58 PM

Optimized the algorithm further by checking ranges with a margin in the first pass, making it succeed after the first run in my testcase. Shortened a variable name and added whitespace as @ruiu suggested.

The testcase isn't updated after the last adjustments though - it's pretty much work to hand craft a testcase which triggers as many of the cornercases of the algorithm, I'll update it once we settle on the algorithm to choose.

Since using a margin for adding thunks, I think it will be extremely tedious to actually create a testcase which would trigger more than one pass.

Since using a margin for adding thunks, I think it will be extremely tedious to actually create a testcase which would trigger more than one pass.

In order to get sensible test coverage, I could make the margin (used only in the first pass) configurable (somehow), and run the multipass test with a very small margin.

Even with the other approach suggested by @ruiu, testing of the case when one pass isn't enough would require a huge, pathological test.

If I understand it, to get a test case we'd need to have a branch that is just in range (including margin) such that no thunk is generated, but adding sufficient thunks causes that branch to go out of range. The brute force way to do it would be to generate greater than (margin/thunk-size) thunks but even with macros that would be a large tedious test to write. One possibility in ELF that I don't know would transition into COFF is to have some sections with high alignment so that inserting a thunk could displace one of these sections off an alignment boundary and hence add much more size than just the Thunk.

Interesting. I think that the implementation you have here will converge faster as it makes it more likely that pass 1 has all the thunks, at the expense of potentially generating more thunks than is strictly required. However if the goal is simplicity I think that you'll need to do everything in one pass, and accept that there will be corner cases that might not link if the margin isn't sufficient. For ELF and arbitrary linker scripts I thought the chance of failure too high, for COFF the chance of failure may be low enough. I think the most likely edge case in COFF will be the presence of sections with high alignment requirements as inserting thunks could cause a lot of bytes of alignment padding to be added.

Interesting. I think that the implementation you have here will converge faster as it makes it more likely that pass 1 has all the thunks, at the expense of potentially generating more thunks than is strictly required. However if the goal is simplicity I think that you'll need to do everything in one pass, and accept that there will be corner cases that might not link if the margin isn't sufficient. For ELF and arbitrary linker scripts I thought the chance of failure too high, for COFF the chance of failure may be low enough. I think the most likely edge case in COFF will be the presence of sections with high alignment requirements as inserting thunks could cause a lot of bytes of alignment padding to be added.

My apologies I clicked on the wrong link in the email, I should have been looking at D52156. Please ignore the past comment as it won't make much sense. The comment about writing the test might still make sense.

mstorsjo abandoned this revision.Sep 25 2018, 4:05 AM

Went with the alternative algorithm in D52156 instead.

COFF/Writer.cpp