Page MenuHomePhabricator

[LLD] Two tweaks to symbol ordering scheme
Needs RevisionPublic

Authored by yozhu on Wed, Jun 22, 2:41 PM.

Details

Reviewers
MaskRay
smeenai
Summary

The linker today will always put hot contributions in the middle of cold ones when targeting
RISC machine, so to minimize the chances of generating branch thunks for hot code calling
into cold code. This is not necessary if less than two input sections contain anything that
is executable, for example, when user specifies an ordering of read-only data (instead of
function) symbols. It is also not necessary if total size of output section is small such
that no branch thunk would ever be required, which is common for mobile apps. For example,
among all the native shared libraries in Facebook Instagram App for Android, 80% of them
have text section smaller than 64KB and the largest text section seen is less than 8MB, well
below the distance a BRANCH26 can reach. With these two tweaks we could potentially save
one page occupied by hot code or data.

Diff Detail

Event Timeline

yozhu created this revision.Wed, Jun 22, 2:41 PM
Herald added a project: Restricted Project. · View Herald Transcript
yozhu published this revision for review.Wed, Jun 22, 2:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptWed, Jun 22, 2:44 PM

I assume the purpose of the test was to check specifically for the old behavior, so we should still have a part of the test that checks for that. Maybe we could use a linker script or a large .space directive to simulate a large text section? https://reviews.llvm.org/source/llvm-github/browse/main/lld/test/ELF/arm-thunk-edgecase.s is an example of using a linker script for such a purpose.

Come to think of it, I'm not sure if the size check will handle sections placed by linker script correctly, but I'm not sure how the old logic was handling that either.

lld/ELF/Writer.cpp
1336

This comment should probably be updated a bit to reflect the new conditional.

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

MaskRay added 1 blocking reviewer(s): MaskRay.Wed, Jun 22, 3:12 PM
MaskRay added a subscriber: srhines.

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

lld/ELF/Writer.cpp
1325

For testing, you may check D88037 for some tests using an output section address.

yozhu updated this revision to Diff 439184.Wed, Jun 22, 3:31 PM

Address review feedback

yozhu marked 2 inline comments as done.Wed, Jun 22, 3:32 PM
yozhu added a comment.Wed, Jun 22, 3:34 PM

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

This could potentially save one page that hot code occupies when branch thunk is definitely not required.

yozhu edited the summary of this revision. (Show Details)Wed, Jun 22, 3:51 PM
yozhu updated this revision to Diff 439208.Wed, Jun 22, 4:35 PM

Add a test case to verify the behavior when branch thunk might be needed

yozhu added a comment.Wed, Jun 22, 4:38 PM

@MaskRay @smeenai I have added a test case to cover the prior behavior when branch thunk might be required.

MaskRay requested changes to this revision.Wed, Jun 22, 4:49 PM

How does it save one page?

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.

Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

Please answer why executableInputSections > 1 is needed.

This revision now requires changes to proceed.Wed, Jun 22, 4:49 PM
yozhu added a comment.Wed, Jun 22, 5:04 PM

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

yozhu added a comment.Wed, Jun 22, 5:21 PM

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

Then just let existing code handle it?

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

OK. Please use split-file. Placing the two cases in one file will be clearer.

yozhu added a comment.Wed, Jun 22, 5:43 PM

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

Then just let existing code handle it?

If no branch thunk will be created, then there is no need to place hot code behind something else in the output section. Or if we are just ordering data symbols, we don't need to do this too.

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

OK. Please use split-file. Placing the two cases in one file will be clearer.

Yes will check in a revision to use the split-file feature.

yozhu updated this revision to Diff 439258.Wed, Jun 22, 8:57 PM

Rewrite the test cases using the split-file feature

srhines added a subscriber: kongyi.Thu, Jun 23, 3:25 AM
srhines added a subscriber: trybka.

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

This optimization requires profile data, which isn't generally in use for 1p NDK usage (to my knowledge). Maybe @trybka might know if there are PGO/AFDO users in our 1p apps. As for the Android platform, we do use AutoFDO, but it isn't possible for us to collect enough AFDO data to validate this a priori. We'd have to merge this patch into a toolchain that is currently being used for droidfood, and then experiment later to see what impact it had. Based on the way that Android schedules work, it would be months before we have enough droidfooders on next year's release. I think it might be best to see if this has an impact on any Google 1p apps for now.

Another way to look at the logic in this code is that, if no branch thunk is required, then to symbol ordering there is no difference between targeting RISC machine and CISC machine. While for CISC machine we just simply follow what ordering says, we should do the same for RISC machine as well.

We have a non-zero number of 1p apps who are generating data for PGO, but
it is not widely used (which is to say, I would not rely on that for test
coverage / soak testing if that's what you're asking).

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

This optimization requires profile data, which isn't generally in use for 1p NDK usage (to my knowledge). Maybe @trybka might know if there are PGO/AFDO users in our 1p apps. As for the Android platform, we do use AutoFDO, but it isn't possible for us to collect enough AFDO data to validate this a priori. We'd have to merge this patch into a toolchain that is currently being used for droidfood, and then experiment later to see what impact it had. Based on the way that Android schedules work, it would be months before we have enough droidfooders on next year's release. I think it might be best to see if this has an impact on any Google 1p apps for now.

The original heuristics (D44969) saves app size by reducing the number of branch thunks required for hot code calling into cold code within same output section. If we know such branch thunk definitely won't be required in certain cases, then the original reordering -- putting hot contributions in the middle of cold contributions -- won't be necessary. How would this change regress on app size for the two mentioned scenarios since no thunk will be generated anyway?

yozhu added a comment.Thu, Jun 23, 3:55 PM

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

`-z noseparate-code is the default case. Since it is purely random, this patch does not justify its usefulness.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

The patch introduced some non-trivial complexity and therefore it needs to justify it. For non-code sections I agree that not shuffling would look better but that only suggests that osec->flags & SHF_EXECINSTR (osec needs to passed from the caller of sortISDBySectionOrder) is fine, not executableInputSections > 1.

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

Yes. With no measurement we can't accept an arbitrary change which claims to be better.

lanza added a subscriber: lanza.Fri, Jun 24, 4:53 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

yozhu added a comment.Sun, Jun 26, 6:58 PM

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

`-z noseparate-code is the default case. Since it is purely random, this patch does not justify its usefulness.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

The patch introduced some non-trivial complexity and therefore it needs to justify it. For non-code sections I agree that not shuffling would look better but that only suggests that osec->flags & SHF_EXECINSTR (osec needs to passed from the caller of sortISDBySectionOrder) is fine, not executableInputSections > 1.

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

Yes. With no measurement we can't accept an arbitrary change which claims to be better.

Curious how the previous change (D44969) got accepted without any measurement for the impact on the "no need of branch thunk but still do shuffling" scenario. It makes sense when it implies higher probability of reducing number of required branch thunks, but sounds against what symbol ordering implies to user if no branch thunk is needed.

yozhu added a comment.Sun, Jun 26, 7:05 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.
  • It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.

So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

  • It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.

No. The output section generally has all flags from input sections. See OutputSection::commitSection

So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

MaskRay requested changes to this revision.Thu, Jun 30, 12:22 PM
This revision now requires changes to proceed.Thu, Jun 30, 12:22 PM
yozhu added a comment.Thu, Jun 30, 3:18 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.