Download Raw Diff

Details

Reviewers

MaskRay
smeenai
alexander-shaposhnikov

Commits

rG2324c2e3c311: [LLD] Two tweaks to symbol ordering scheme

Summary

When --symbol-ordering-file is specified, the linker today will always put
hot contributions in the middle of cold ones when targeting RISC machine, so
to minimize the chances that branch thunks need be generated for hot code
calling into cold code. This is not necessary when user specifies an ordering
of read-only data (vs. function) symbols, or when output section is small such
that no branch thunk would ever be required. The latter is common for mobile
apps. For example, among all the native ARM64 libraries in Facebook Instagram
App for Android, 80% of them have text section smaller than 64KB and the
largest text section seen is less than 8MB, well below the distance that a
BRANCH26 can reach.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yozhu created this revision.Jun 22 2022, 2:41 PM

Herald added a reviewer: MaskRay. · View Herald TranscriptJun 22 2022, 2:41 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: arichardson, emaste. · View Herald Transcript

yozhu added a reviewer: smeenai.Jun 22 2022, 2:43 PM

Herald added a subscriber: StephenFan. · View Herald TranscriptJun 22 2022, 2:43 PM

yozhu published this revision for review.Jun 22 2022, 2:44 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 2:44 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

I assume the purpose of the test was to check specifically for the old behavior, so we should still have a part of the test that checks for that. Maybe we could use a linker script or a large .space directive to simulate a large text section? https://reviews.llvm.org/source/llvm-github/browse/main/lld/test/ELF/arm-thunk-edgecase.s is an example of using a linker script for such a purpose.

Come to think of it, I'm not sure if the size check will handle sections placed by linker script correctly, but I'm not sure how the old logic was handling that either.

lld/ELF/Writer.cpp
1335	This comment should probably be updated a bit to reflect the new conditional.

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Harbormaster completed remote builds in B171426: Diff 439162.Jun 22 2022, 3:04 PM

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

In D128382#3603081, @smeenai wrote:

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

lld/ELF/Writer.cpp
1325	https://llvm.org/docs/CodingStandards.html#prefer-preincrement

For testing, you may check D88037 for some tests using an output section address.

Address review feedback

yozhu marked 2 inline comments as done.Jun 22 2022, 3:32 PM

In D128382#3603142, @MaskRay wrote:

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

This could potentially save one page that hot code occupies when branch thunk is definitely not required.

Harbormaster completed remote builds in B171443: Diff 439184.Jun 22 2022, 3:45 PM

yozhu edited the summary of this revision. (Show Details)Jun 22 2022, 3:51 PM

Add a test case to verify the behavior when branch thunk might be needed

@MaskRay @smeenai I have added a test case to cover the prior behavior when branch thunk might be required.

How does it save one page?

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.

Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

Please answer why executableInputSections > 1 is needed.

This revision now requires changes to proceed.Jun 22 2022, 4:49 PM

Harbormaster completed remote builds in B171463: Diff 439208.Jun 22 2022, 4:50 PM

(split-file is covered in https://llvm.org/docs/TestingGuide.html#extra-files)

In D128382#3603513, @MaskRay wrote:

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

In D128382#3603513, @MaskRay wrote:

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

In D128382#3603513, @MaskRay wrote:

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

In D128382#3603531, @yozhu wrote:

In D128382#3603513, @MaskRay wrote:

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

Then just let existing code handle it?

In D128382#3603513, @MaskRay wrote:

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

In D128382#3603560, @yozhu wrote:

In D128382#3603513, @MaskRay wrote:

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

OK. Please use split-file. Placing the two cases in one file will be clearer.

In D128382#3603570, @MaskRay wrote:

In D128382#3603531, @yozhu wrote:

In D128382#3603513, @MaskRay wrote:

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

Then just let existing code handle it?

If no branch thunk will be created, then there is no need to place hot code behind something else in the output section. Or if we are just ordering data symbols, we don't need to do this too.

In D128382#3603513, @MaskRay wrote:

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

In D128382#3603560, @yozhu wrote:

In D128382#3603513, @MaskRay wrote:

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

OK. Please use split-file. Placing the two cases in one file will be clearer.

Yes will check in a revision to use the split-file feature.

Rewrite the test cases using the split-file feature

Harbormaster completed remote builds in B171507: Diff 439258.Jun 22 2022, 9:11 PM

In D128382#3603142, @MaskRay wrote:

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

In D128382#3603081, @smeenai wrote:

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

This optimization requires profile data, which isn't generally in use for 1p NDK usage (to my knowledge). Maybe @trybka might know if there are PGO/AFDO users in our 1p apps. As for the Android platform, we do use AutoFDO, but it isn't possible for us to collect enough AFDO data to validate this a priori. We'd have to merge this patch into a toolchain that is currently being used for droidfood, and then experiment later to see what impact it had. Based on the way that Android schedules work, it would be months before we have enough droidfooders on next year's release. I think it might be best to see if this has an impact on any Google 1p apps for now.

Another way to look at the logic in this code is that, if no branch thunk is required, then to symbol ordering there is no difference between targeting RISC machine and CISC machine. While for CISC machine we just simply follow what ordering says, we should do the same for RISC machine as well.

We have a non-zero number of 1p apps who are generating data for PGO, but
it is not widely used (which is to say, I would not rely on that for test
coverage / soak testing if that's what you're asking).

In D128382#3604310, @srhines wrote:

In D128382#3603142, @MaskRay wrote:

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

In D128382#3603081, @smeenai wrote:

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

This optimization requires profile data, which isn't generally in use for 1p NDK usage (to my knowledge). Maybe @trybka might know if there are PGO/AFDO users in our 1p apps. As for the Android platform, we do use AutoFDO, but it isn't possible for us to collect enough AFDO data to validate this a priori. We'd have to merge this patch into a toolchain that is currently being used for droidfood, and then experiment later to see what impact it had. Based on the way that Android schedules work, it would be months before we have enough droidfooders on next year's release. I think it might be best to see if this has an impact on any Google 1p apps for now.

The original heuristics (D44969) saves app size by reducing the number of branch thunks required for hot code calling into cold code within same output section. If we know such branch thunk definitely won't be required in certain cases, then the original reordering -- putting hot contributions in the middle of cold contributions -- won't be necessary. How would this change regress on app size for the two mentioned scenarios since no thunk will be generated anyway?

In D128382#3603513, @MaskRay wrote:

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

In D128382#3606454, @yozhu wrote:

In D128382#3603513, @MaskRay wrote:

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

`-z noseparate-code is the default case. Since it is purely random, this patch does not justify its usefulness.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

The patch introduced some non-trivial complexity and therefore it needs to justify it. For non-code sections I agree that not shuffling would look better but that only suggests that osec->flags & SHF_EXECINSTR (osec needs to passed from the caller of sortISDBySectionOrder) is fine, not executableInputSections > 1.

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

Yes. With no measurement we can't accept an arbitrary change which claims to be better.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

In D128382#3608798, @MaskRay wrote:

In D128382#3606454, @yozhu wrote:

In D128382#3603513, @MaskRay wrote:

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

`-z noseparate-code is the default case. Since it is purely random, this patch does not justify its usefulness.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

The patch introduced some non-trivial complexity and therefore it needs to justify it. For non-code sections I agree that not shuffling would look better but that only suggests that osec->flags & SHF_EXECINSTR (osec needs to passed from the caller of sortISDBySectionOrder) is fine, not executableInputSections > 1.

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

Yes. With no measurement we can't accept an arbitrary change which claims to be better.

Curious how the previous change (D44969) got accepted without any measurement for the impact on the "no need of branch thunk but still do shuffling" scenario. It makes sense when it implies higher probability of reducing number of required branch thunks, but sounds against what symbol ordering implies to user if no branch thunk is needed.

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.
It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.

So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.

No. The output section generally has all flags from input sections. See OutputSection::commitSection

So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

MaskRay requested changes to this revision.Jun 30 2022, 12:22 PM

This revision now requires changes to proceed.Jun 30 2022, 12:22 PM

In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

yozhu requested review of this revision.Jul 7 2022, 3:52 PM

In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()
I don't think it is useful to special case executableInputSections > 1

In D128382#3637822, @MaskRay wrote:
In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.
I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

In D128382#3637828, @yozhu wrote:
In D128382#3637822, @MaskRay wrote:
In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.
I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

In D128382#3639333, @smeenai wrote:
In D128382#3637828, @yozhu wrote:
In D128382#3637822, @MaskRay wrote:
In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.
I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

In D128382#3639367, @MaskRay wrote:
In D128382#3639333, @smeenai wrote:
In D128382#3637828, @yozhu wrote:
In D128382#3637822, @MaskRay wrote:
In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.
I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.
My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.

Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

In D128382#3639411, @yozhu wrote:
In D128382#3639367, @MaskRay wrote:
In D128382#3639333, @smeenai wrote:
In D128382#3637828, @yozhu wrote:
In D128382#3637822, @MaskRay wrote:
In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.
I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.
My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.
Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.

We are using heuristics here. And I don't think there is a heuristic which can cover all valid scenarios.
totalSize >= target->getThunkSectionSpacing() is a useful condition. That is what (@smeenai mentioned) I don't fundamentally object to the idea.
When that is satisfied, if the output section has code sections, it is almost assured that the number of input code sections is greater than 1, so it's not really necessary to test executableInputSections > 1.
Again, there are linker scripts combining rodata and code sections, but they only do this for very small output sections. I use my experience reading many linker scripts here.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

Answered in the previous paragraph.

(from a previous comment) Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

I don't view the problem this way as I just see a patch with unneeded complexity. I judge a patch by its own merit, not where the author comes from. If you check many patches I am reviewing/have reviewed, it's very common for my to "Request Changes" even if the author works at Google.

In D128382#3639411, @yozhu wrote:

In D128382#3639367, @MaskRay wrote:

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

I think it'd be helpful for us to step back a bit. We all want to maintain the overall health and cleanliness of LLVM and LLD while also being able to make changes that help with the use cases we care about. We have different aspects of code we value more highly (e.g. simplicity vs. thoroughness), but all of us ultimately want the project to function as best as it can.

@ruiu always valued code simplicity highly for LLD, and I think @MaskRay has similar sensibilities :) Edge cases are important, of course, but LLD has historically considered it okay to not handle some rare ones until we have a proven need. The current version of the patch reads straightforwardly enough to me, but we can test it vs. a version where we only check if the output section is executable and see if it makes a difference for any of our internal libraries, to confirm if we have a use case for one check over the other.

In D128382#3639492, @MaskRay wrote:
In D128382#3639411, @yozhu wrote:
In D128382#3639367, @MaskRay wrote:
In D128382#3639333, @smeenai wrote:
In D128382#3637828, @yozhu wrote:
In D128382#3637822, @MaskRay wrote:
In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.
I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.
My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.
Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.
We are using heuristics here. And I don't think there is a heuristic which can cover all valid scenarios.
totalSize >= target->getThunkSectionSpacing() is a useful condition. That is what (@smeenai mentioned) I don't fundamentally object to the idea.
When that is satisfied, if the output section has code sections, it is almost assured that the number of input code sections is greater than 1, so it's not really necessary to test executableInputSections > 1.
Again, there are linker scripts combining rodata and code sections, but they only do this for very small output sections. I use my experience reading many linker scripts here.

Heuristics doesn't have to be perfect, I agree with that. However, if with small tweaks we can make the heuristics cover more valid cases, then I think it is worth the effort. ELF linker supports multiple executable output sections, and supports merging differently named input sections into one output section. Why we can't make the linker bullet proof for extra scenarios (even if they seem to be less common) with simple and low overhead tweak?

Your bar of complexity and simplicity seems unreasonably high.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

Answered in the previous paragraph.

(from a previous comment) Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

I don't view the problem this way as I just see a patch with unneeded complexity. I judge a patch by its own merit, not where the author comes from. If you check many patches I am reviewing/have reviewed, it's very common for my to "Request Changes" even if the author works at Google.

In D128382#3639847, @yozhu wrote:
In D128382#3639492, @MaskRay wrote:
In D128382#3639411, @yozhu wrote:
In D128382#3639367, @MaskRay wrote:
In D128382#3639333, @smeenai wrote:
In D128382#3637828, @yozhu wrote:
In D128382#3637822, @MaskRay wrote:
In D128382#3623746, @yozhu wrote:
In D128382#3620661, @MaskRay wrote:

In D128382#3611295, @yozhu wrote:

In D128382#3611207, @MaskRay wrote:

In D128382#3611193, @yozhu wrote:

In D128382#3610276, @MaskRay wrote:

In D128382#3609648, @lanza wrote:

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like
__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }
and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like
SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}
Then in the for loop we will see two input sections are executable and the other two are not.
I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.
My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.
Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.
We are using heuristics here. And I don't think there is a heuristic which can cover all valid scenarios.
totalSize >= target->getThunkSectionSpacing() is a useful condition. That is what (@smeenai mentioned) I don't fundamentally object to the idea.
When that is satisfied, if the output section has code sections, it is almost assured that the number of input code sections is greater than 1, so it's not really necessary to test executableInputSections > 1.
Again, there are linker scripts combining rodata and code sections, but they only do this for very small output sections. I use my experience reading many linker scripts here.
Heuristics doesn't have to be perfect, I agree with that. However, if with small tweaks we can make the heuristics cover more valid cases, then I think it is worth the effort. ELF linker supports multiple executable output sections, and supports merging differently named input sections into one output section. Why we can't make the linker bullet proof for extra scenarios (even if they seem to be less common) with simple and low overhead tweak?

executableInputSections > 1 doesn't cover more valid cases than a simple output section flag test for practical usage.
I encourage that you do some experiments how likely executableInputSections == 1 when the other conditions are satisfied (or executableInputSections == 0 with an executable output section).
I mentioned that with a linker script such a case can be constructed but it is far from realistic. The comment was made way before you raised your linker script example.

Your bar of complexity and simplicity seems unreasonably high.

I realize that such opinions are always subjective. So in the end such a comment never contributes to a constructive discussion.
Please try to focus more on the technical side of things, not make it personal or political.
Stay away from "Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team)." and "Please don't abuse the administrator rights."

Updating D128382: [LLD] Two tweaks to symbol ordering scheme

yozhu edited the summary of this revision. (Show Details)Jul 11 2022, 4:30 PM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptJul 11 2022, 4:30 PM

MaskRay added inline comments.Jul 11 2022, 4:39 PM

lld/ELF/Writer.cpp
1323	Delete `if (executableOutputSection)`
1369	it's more natural to place two `target->getThunkSectionSpacing()` together.
lld/test/ELF/arm-symbol-ordering-file.s
16	We need a test that readonly sections on'y trigger the logic. You can use `llvm-objcopy --set-section-flags .foo=alloc,readonly %t.small.o` to drop SHF_EXECINSTR.

Harbormaster completed remote builds in B174766: Diff 443785.Jul 11 2022, 4:56 PM

yozhu added inline comments.Jul 11 2022, 10:16 PM

lld/ELF/Writer.cpp
1323	If output section is executable, we don't need to count `totalSize`.

yozhu added inline comments.Jul 11 2022, 10:22 PM

lld/ELF/Writer.cpp
1369	It is kind of to do cheaper (equal or not equal to zero) checks first and then do relatively more expensive (comparing two 64-bit integers) check. But I can swap the 2nd and 3rd checks; it likely doesn't matter that much.

MaskRay added inline comments.Jul 11 2022, 10:29 PM

lld/ELF/Writer.cpp
1323	Since `getSize` actually takes not-too-few instructions. Keeping `if (executableOutputSection)` is fine.

Updating D128382: [LLD] Two tweaks to symbol ordering scheme

Herald added a reviewer: alexander-shaposhnikov. · View Herald TranscriptJul 12 2022, 8:59 AM

yozhu marked 2 inline comments as done.Jul 12 2022, 9:00 AM

Harbormaster completed remote builds in B174898: Diff 443976.Jul 12 2022, 9:45 AM

The linker today will always put hot contributions in the middle of cold ones when targeting RISC machine, so to minimize the chances that branch thunks need be generated for hot code calling into cold code.

Mention --symbol-ordering-file in the subject or the commit message, otherwise readers may thought this applies to normal usage.

lld/ELF/Writer.cpp
1363	total size of isd

This revision is now accepted and ready to land.Jul 12 2022, 10:20 AM

yozhu edited the summary of this revision. (Show Details)Jul 12 2022, 10:37 AM

Updating D128382: [LLD] Two tweaks to symbol ordering scheme

yozhu marked an inline comment as done.Jul 12 2022, 10:50 AM

Harbormaster completed remote builds in B174925: Diff 444011.Jul 12 2022, 11:35 AM

Closed by commit rG2324c2e3c311: [LLD] Two tweaks to symbol ordering scheme (authored by yozhu, committed by smeenai). · Explain WhyJul 12 2022, 11:36 AM

This revision was automatically updated to reflect the committed changes.

smeenai added a commit: rG2324c2e3c311: [LLD] Two tweaks to symbol ordering scheme.

This is an archive of the discontinued LLVM Phabricator instance.

[LLD] Two tweaks to symbol ordering scheme
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 444027

lld/ELF/Writer.cpp

lld/test/ELF/arm-symbol-ordering-file.s

This is an archive of the discontinued LLVM Phabricator instance.

[LLD] Two tweaks to symbol ordering schemeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 444027

lld/ELF/Writer.cpp

lld/test/ELF/arm-symbol-ordering-file.s

[LLD] Two tweaks to symbol ordering scheme
ClosedPublic