The linker today will always put hot contributions in the middle of cold ones when targeting
RISC machine, so to minimize the chances of generating branch thunks for hot code calling
into cold code. This is not necessary if less than two input sections contain anything that
is executable, for example, when user specifies an ordering of read-only data (instead of
function) symbols. It is also not necessary if total size of output section is small such
that no branch thunk would ever be required, which is common for mobile apps. For example,
among all the native shared libraries in Facebook Instagram App for Android, 80% of them
have text section smaller than 64KB and the largest text section seen is less than 8MB, well
below the distance a BRANCH26 can reach. With these two tweaks we could potentially save
one page occupied by hot code or data.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I assume the purpose of the test was to check specifically for the old behavior, so we should still have a part of the test that checks for that. Maybe we could use a linker script or a large .space directive to simulate a large text section? https://reviews.llvm.org/source/llvm-github/browse/main/lld/test/ELF/arm-thunk-edgecase.s is an example of using a linker script for such a purpose.
Come to think of it, I'm not sure if the size check will handle sections placed by linker script correctly, but I'm not sure how the old logic was handling that either.
lld/ELF/Writer.cpp | ||
---|---|---|
1336 | This comment should probably be updated a bit to reflect the new conditional. |
You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).
This patch tries to adjust D44969 heuristics a bit.
This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such
that no branch thunk would ever be required, which is common for mobile apps.
I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?
Such a motivation is fine. Perhaps @srhines can test this for other applications.
lld/ELF/Writer.cpp | ||
---|---|---|
1325 |
This could potentially save one page that hot code occupies when branch thunk is definitely not required.
How does it save one page?
4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.
Please answer why executableInputSections > 1 is needed.
If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.
Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.
The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.
Then just let existing code handle it?
Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.
I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.
OK. Please use split-file. Placing the two cases in one file will be clearer.
If no branch thunk will be created, then there is no need to place hot code behind something else in the output section. Or if we are just ordering data symbols, we don't need to do this too.
Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.
I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.
This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.
OK. Please use split-file. Placing the two cases in one file will be clearer.
Yes will check in a revision to use the split-file feature.
This optimization requires profile data, which isn't generally in use for 1p NDK usage (to my knowledge). Maybe @trybka might know if there are PGO/AFDO users in our 1p apps. As for the Android platform, we do use AutoFDO, but it isn't possible for us to collect enough AFDO data to validate this a priori. We'd have to merge this patch into a toolchain that is currently being used for droidfood, and then experiment later to see what impact it had. Based on the way that Android schedules work, it would be months before we have enough droidfooders on next year's release. I think it might be best to see if this has an impact on any Google 1p apps for now.
Another way to look at the logic in this code is that, if no branch thunk is required, then to symbol ordering there is no difference between targeting RISC machine and CISC machine. While for CISC machine we just simply follow what ordering says, we should do the same for RISC machine as well.
We have a non-zero number of 1p apps who are generating data for PGO, but
it is not widely used (which is to say, I would not rely on that for test
coverage / soak testing if that's what you're asking).
The original heuristics (D44969) saves app size by reducing the number of branch thunks required for hot code calling into cold code within same output section. If we know such branch thunk definitely won't be required in certain cases, then the original reordering -- putting hot contributions in the middle of cold contributions -- won't be necessary. How would this change regress on app size for the two mentioned scenarios since no thunk will be generated anyway?
Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.
I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.
For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.
The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?
And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).
`-z noseparate-code is the default case. Since it is purely random, this patch does not justify its usefulness.
The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?
The patch introduced some non-trivial complexity and therefore it needs to justify it. For non-code sections I agree that not shuffling would look better but that only suggests that osec->flags & SHF_EXECINSTR (osec needs to passed from the caller of sortISDBySectionOrder) is fine, not executableInputSections > 1.
And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).
Yes. With no measurement we can't accept an arbitrary change which claims to be better.
One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.
Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.
Curious how the previous change (D44969) got accepted without any measurement for the impact on the "no need of branch thunk but still do shuffling" scenario. It makes sense when it implies higher probability of reducing number of required branch thunks, but sounds against what symbol ordering implies to user if no branch thunk is needed.
I didn't check output section's attribute but instead check on each input section because:
- It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.
- It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.
So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.
This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.
- It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.
No. The output section generally has all flags from input sections. See OutputSection::commitSection
So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.
Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?
For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.
Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox"))) const int x = 5; __attribute__((section(".text.row"))) const int w = 5; int foo() { return x; } int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS { ttt 0x1000 : { tt.o (.text.*) } }
Then in the for loop we will see two input sections are executable and the other two are not.