This is an archive of the discontinued LLVM Phabricator instance.

[LLD] Two tweaks to symbol ordering scheme
ClosedPublic

Authored by yozhu on Jun 22 2022, 2:41 PM.

Details

Summary

When --symbol-ordering-file is specified, the linker today will always put
hot contributions in the middle of cold ones when targeting RISC machine, so
to minimize the chances that branch thunks need be generated for hot code
calling into cold code. This is not necessary when user specifies an ordering
of read-only data (vs. function) symbols, or when output section is small such
that no branch thunk would ever be required. The latter is common for mobile
apps. For example, among all the native ARM64 libraries in Facebook Instagram
App for Android, 80% of them have text section smaller than 64KB and the
largest text section seen is less than 8MB, well below the distance that a
BRANCH26 can reach.

Diff Detail

Event Timeline

yozhu created this revision.Jun 22 2022, 2:41 PM
Herald added a project: Restricted Project. · View Herald Transcript
yozhu published this revision for review.Jun 22 2022, 2:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 2:44 PM

I assume the purpose of the test was to check specifically for the old behavior, so we should still have a part of the test that checks for that. Maybe we could use a linker script or a large .space directive to simulate a large text section? https://reviews.llvm.org/source/llvm-github/browse/main/lld/test/ELF/arm-thunk-edgecase.s is an example of using a linker script for such a purpose.

Come to think of it, I'm not sure if the size check will handle sections placed by linker script correctly, but I'm not sure how the old logic was handling that either.

lld/ELF/Writer.cpp
1335

This comment should probably be updated a bit to reflect the new conditional.

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

MaskRay added 1 blocking reviewer(s): MaskRay.Jun 22 2022, 3:12 PM
MaskRay added a subscriber: srhines.

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

lld/ELF/Writer.cpp
1325

For testing, you may check D88037 for some tests using an output section address.

yozhu updated this revision to Diff 439184.Jun 22 2022, 3:31 PM

Address review feedback

yozhu marked 2 inline comments as done.Jun 22 2022, 3:32 PM
yozhu added a comment.Jun 22 2022, 3:34 PM

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

This could potentially save one page that hot code occupies when branch thunk is definitely not required.

yozhu edited the summary of this revision. (Show Details)Jun 22 2022, 3:51 PM
yozhu updated this revision to Diff 439208.Jun 22 2022, 4:35 PM

Add a test case to verify the behavior when branch thunk might be needed

yozhu added a comment.Jun 22 2022, 4:38 PM

@MaskRay @smeenai I have added a test case to cover the prior behavior when branch thunk might be required.

MaskRay requested changes to this revision.Jun 22 2022, 4:49 PM

How does it save one page?

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.

Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

Please answer why executableInputSections > 1 is needed.

This revision now requires changes to proceed.Jun 22 2022, 4:49 PM
yozhu added a comment.Jun 22 2022, 5:04 PM

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

yozhu added a comment.Jun 22 2022, 5:21 PM

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

Then just let existing code handle it?

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

OK. Please use split-file. Placing the two cases in one file will be clearer.

yozhu added a comment.Jun 22 2022, 5:43 PM

Please answer why executableInputSections > 1 is needed.

If there is zero or only one input section having instructions, the scenario of "hot code calling into cold code" doesn't exist.

Then just let existing code handle it?

If no branch thunk will be created, then there is no need to place hot code behind something else in the output section. Or if we are just ordering data symbols, we don't need to do this too.

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

4 size=0xC00000 input sections are too large (48MiB). If just for ARM::getThunkSectionSpacing (16MiB), it is too much.
Please check split-file. I think we need tests when totalSize >= target->getThunkSectionSpacing() triggers or not.

The original test covers the scenario where totalSize < target->getThunkSectionSpacing(). The new test covers the opposite.

OK. Please use split-file. Placing the two cases in one file will be clearer.

Yes will check in a revision to use the split-file feature.

yozhu updated this revision to Diff 439258.Jun 22 2022, 8:57 PM

Rewrite the test cases using the split-file feature

srhines added a subscriber: kongyi.Jun 23 2022, 3:25 AM
srhines added a subscriber: trybka.

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

This optimization requires profile data, which isn't generally in use for 1p NDK usage (to my knowledge). Maybe @trybka might know if there are PGO/AFDO users in our 1p apps. As for the Android platform, we do use AutoFDO, but it isn't possible for us to collect enough AFDO data to validate this a priori. We'd have to merge this patch into a toolchain that is currently being used for droidfood, and then experiment later to see what impact it had. Based on the way that Android schedules work, it would be months before we have enough droidfooders on next year's release. I think it might be best to see if this has an impact on any Google 1p apps for now.

Another way to look at the logic in this code is that, if no branch thunk is required, then to symbol ordering there is no difference between targeting RISC machine and CISC machine. While for CISC machine we just simply follow what ordering says, we should do the same for RISC machine as well.

We have a non-zero number of 1p apps who are generating data for PGO, but
it is not widely used (which is to say, I would not rely on that for test
coverage / soak testing if that's what you're asking).

This patch tries to adjust D44969 heuristics a bit.

This is not necessary if less than two input sections contain anything that is executable, for example, when user specifies an ordering of read-only data (instead of function) symbols. It is also not necessary if total size of output section is small such

that no branch thunk would ever be required, which is common for mobile apps.

I agree that the mentioned cases do not unnecessarily need reordering. It doesn't matter either way, then why change the behavior with more code?

You should also add the motivation for this change to the description (which I believe is to minimize the number of pages occupied by hot code).

Such a motivation is fine. Perhaps @srhines can test this for other applications.

This optimization requires profile data, which isn't generally in use for 1p NDK usage (to my knowledge). Maybe @trybka might know if there are PGO/AFDO users in our 1p apps. As for the Android platform, we do use AutoFDO, but it isn't possible for us to collect enough AFDO data to validate this a priori. We'd have to merge this patch into a toolchain that is currently being used for droidfood, and then experiment later to see what impact it had. Based on the way that Android schedules work, it would be months before we have enough droidfooders on next year's release. I think it might be best to see if this has an impact on any Google 1p apps for now.

The original heuristics (D44969) saves app size by reducing the number of branch thunks required for hot code calling into cold code within same output section. If we know such branch thunk definitely won't be required in certain cases, then the original reordering -- putting hot contributions in the middle of cold contributions -- won't be necessary. How would this change regress on app size for the two mentioned scenarios since no thunk will be generated anyway?

yozhu added a comment.Jun 23 2022, 3:55 PM

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

`-z noseparate-code is the default case. Since it is purely random, this patch does not justify its usefulness.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

The patch introduced some non-trivial complexity and therefore it needs to justify it. For non-code sections I agree that not shuffling would look better but that only suggests that osec->flags & SHF_EXECINSTR (osec needs to passed from the caller of sortISDBySectionOrder) is fine, not executableInputSections > 1.

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

Yes. With no measurement we can't accept an arbitrary change which claims to be better.

lanza added a subscriber: lanza.Jun 24 2022, 4:53 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

yozhu added a comment.Jun 26 2022, 6:58 PM

How does it save one page?

Hot code will always be placed together, so where it starts impact how many pages it will occupy. Moving it towards the beginning of the output section increases the possibility that one less page will be taken.

I am not sure this is true.
For -z separate-code layout, PT_LOAD program header has an aligned start address. I agree that placing hot code at the start may potentially remove one hot page.
For -z noseparate-code layout, I think we can construct a case that placing hot code at the start may use one more page.

This is to follow what ordering implies. For CISC machine target the linker won't do anything but just strictly follow the specified order.

For the -z noseparate-code case, it would be the same probability for the impact on number of pages used by hot contributions. We don't know where in a page .text section will start, so it will be purely random or with equal probability that hot contribution may take one more page, one less page, or take same number of pages, between placing hot contributions at section start (like what the linker does for CISC machines) and shuffling some amount of cold contributions before hot ones.

`-z noseparate-code is the default case. Since it is purely random, this patch does not justify its usefulness.

The current logic is to avoid generating branch thunks only. So if we know branch thunk is no needed, why bother doing the extra shuffling working?

The patch introduced some non-trivial complexity and therefore it needs to justify it. For non-code sections I agree that not shuffling would look better but that only suggests that osec->flags & SHF_EXECINSTR (osec needs to passed from the caller of sortISDBySectionOrder) is fine, not executableInputSections > 1.

And for the -z separarte-code case, it is guaranteed to be better (or being same in rare cases).

Yes. With no measurement we can't accept an arbitrary change which claims to be better.

Curious how the previous change (D44969) got accepted without any measurement for the impact on the "no need of branch thunk but still do shuffling" scenario. It makes sense when it implies higher probability of reducing number of required branch thunks, but sounds against what symbol ordering implies to user if no branch thunk is needed.

yozhu added a comment.Jun 26 2022, 7:05 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.
  • It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.

So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

  • It could be that output section is not marked as executable, but it does contain instructions and at runtime the output section might be reprotected to be executable.

No. The output section generally has all flags from input sections. See OutputSection::commitSection

So here we "trust" more on what compiler/assembler says on each input section's attribute. And since we are enumerating through all input sections anyway, counting executableInputSections doesn't add too much complication.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

MaskRay requested changes to this revision.Jun 30 2022, 12:22 PM
This revision now requires changes to proceed.Jun 30 2022, 12:22 PM
yozhu added a comment.Jun 30 2022, 3:18 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

yozhu requested review of this revision.Jul 7 2022, 3:52 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.
Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()
I don't think it is useful to special case executableInputSections > 1

yozhu added a comment.Jul 7 2022, 9:54 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

yozhu added a comment.Jul 8 2022, 11:08 AM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.

Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.

Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.

We are using heuristics here. And I don't think there is a heuristic which can cover all valid scenarios.
totalSize >= target->getThunkSectionSpacing() is a useful condition. That is what (@smeenai mentioned) I don't fundamentally object to the idea.
When that is satisfied, if the output section has code sections, it is almost assured that the number of input code sections is greater than 1, so it's not really necessary to test executableInputSections > 1.
Again, there are linker scripts combining rodata and code sections, but they only do this for very small output sections. I use my experience reading many linker scripts here.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

Answered in the previous paragraph.

(from a previous comment) Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

I don't view the problem this way as I just see a patch with unneeded complexity. I judge a patch by its own merit, not where the author comes from. If you check many patches I am reviewing/have reviewed, it's very common for my to "Request Changes" even if the author works at Google.

smeenai added a subscriber: ruiu.Jul 8 2022, 12:02 PM

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

I think it'd be helpful for us to step back a bit. We all want to maintain the overall health and cleanliness of LLVM and LLD while also being able to make changes that help with the use cases we care about. We have different aspects of code we value more highly (e.g. simplicity vs. thoroughness), but all of us ultimately want the project to function as best as it can.

@ruiu always valued code simplicity highly for LLD, and I think @MaskRay has similar sensibilities :) Edge cases are important, of course, but LLD has historically considered it okay to not handle some rare ones until we have a proven need. The current version of the patch reads straightforwardly enough to me, but we can test it vs. a version where we only check if the output section is executable and see if it makes a difference for any of our internal libraries, to confirm if we have a use case for one check over the other.

yozhu added a comment.Jul 8 2022, 2:09 PM

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.

Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.

We are using heuristics here. And I don't think there is a heuristic which can cover all valid scenarios.
totalSize >= target->getThunkSectionSpacing() is a useful condition. That is what (@smeenai mentioned) I don't fundamentally object to the idea.
When that is satisfied, if the output section has code sections, it is almost assured that the number of input code sections is greater than 1, so it's not really necessary to test executableInputSections > 1.
Again, there are linker scripts combining rodata and code sections, but they only do this for very small output sections. I use my experience reading many linker scripts here.

Heuristics doesn't have to be perfect, I agree with that. However, if with small tweaks we can make the heuristics cover more valid cases, then I think it is worth the effort. ELF linker supports multiple executable output sections, and supports merging differently named input sections into one output section. Why we can't make the linker bullet proof for extra scenarios (even if they seem to be less common) with simple and low overhead tweak?

Your bar of complexity and simplicity seems unreasonably high.

I don't think it is useful to special case executableInputSections > 1

Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

To clarify, I don't think there's an objection to the change in general now (per @MaskRay's earlier comment after @lanza posted the 16.5% number). It's just a question of whether to count individual executable input sections or use the output section's executable-ness to make the ordering determination.

Right. The code can spend fewer lines to address the issue. This is one of the moments I just wanted to commandeer the patch and satisfying the user needs instead of wasting more and more time arguing.

Indeed we have spent quite some time arguing about this. I don't agree with you on "corner case", "not useful", or "fewer lines of code"; and I really don't understand why you push back so hard on this change. Before you requested another revision one week ago the only remaining unsettled point was whether we need to track each input section because you think their attributes cannot be reliably tracked, then I showed you with an example that this is not the case. Then I think it should be all good then.

Answered in the previous paragraph.

(from a previous comment) Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team). What are the rules here to determine what is useful and what not?

It is simply fixing a flaw in the original ordering scheme and the new additional logic doesn't add too much complexity or overhead compared to what is already there for mixing cold and hot contributions. Please don't abuse the administrator rights.

I don't view the problem this way as I just see a patch with unneeded complexity. I judge a patch by its own merit, not where the author comes from. If you check many patches I am reviewing/have reviewed, it's very common for my to "Request Changes" even if the author works at Google.

One interesting note Shoaib and I just discovered is that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso. Placing hot sections at the front of the .text section would give us a slice of a "free page" that, with the current algorithm, would just be cold code that's less likely to be used. I imagine this isn't a substantial difference but it's probably slightly in favor of placing hot code first here.

Thanks. This is fine, even if not that large. The patch should use osec->flags & SHF_EXECINSTR instead of executableInputSections for simplification. Then the complexity/benefit trade-off will look fine.

I didn't check output section's attribute but instead check on each input section because:

  • It could be that the output section contains .text (RX) contributions as well as .rdata (R) contributions, like read-only data is merged into text section.

This implies a linker script placing rodata into an executable output section.
This is a corner case and the operation here does not matter. Just be simple.

Even if it is a rare or corner case, it is still a valid case. Counting executableInputSections doesn't add many cycles. Why not just cover all the possible cases?

For input rodata in an output executable section, since the for loop doesn't precisely track whether an input section is executable or not, it is as if all rodata are executable as well.
Your executableInputSections rule doesn't cover more cases.

Input section's attribute can be precisely tracked since the linker has no reason to change input section's attribute once it is read from input OBJ. For example, if we have source code like

__attribute__((section(".text.rox")))
const int x = 5;

__attribute__((section(".text.row")))
const int w = 5;

int foo() { return x; }
int bar() { return w; }

and we compile it with -fdata-sections -ffunction-sections and then link the OBJ using a linker script like

SECTIONS
{
  ttt 0x1000 :
  {
    tt.o (.text.*)
  }
}

Then in the for loop we will see two input sections are executable and the other two are not.

I have mentioned that this is a corner case we don't necessarily try to optimize. It takes code.

As @lanza mentioned above that ~16.5% of the libraries in one of our major apps have .text sections starting on the first page of the dso, this is actually not a corner case for us. And given how much code are already there for symbol ordering in the linker, this change doesn't add that much overhead or complexity.

Note: even if some linker scripts combine rodata and code sections in one output section, such an output section is typically far smaller than target->getThunkSectionSpacing()

Why typically it has to be far smaller? The linker doesn't restrict the use of linker scripts for combining RO and X sections to form a large output section. And typically code plus read-only data takes significant percentage of binary size.

My comment discusses how likely a user combines rodata and code in one output section. It's not about the extent of all rodata and code output sections.

Should we cover all the valid scenarios if possible? I don't think it is good practice to leave something to just cover most, especially the cost of covering all isn't that much as one would reasonably be concerned.

We are using heuristics here. And I don't think there is a heuristic which can cover all valid scenarios.
totalSize >= target->getThunkSectionSpacing() is a useful condition. That is what (@smeenai mentioned) I don't fundamentally object to the idea.
When that is satisfied, if the output section has code sections, it is almost assured that the number of input code sections is greater than 1, so it's not really necessary to test executableInputSections > 1.
Again, there are linker scripts combining rodata and code sections, but they only do this for very small output sections. I use my experience reading many linker scripts here.

Heuristics doesn't have to be perfect, I agree with that. However, if with small tweaks we can make the heuristics cover more valid cases, then I think it is worth the effort. ELF linker supports multiple executable output sections, and supports merging differently named input sections into one output section. Why we can't make the linker bullet proof for extra scenarios (even if they seem to be less common) with simple and low overhead tweak?

executableInputSections > 1 doesn't cover more valid cases than a simple output section flag test for practical usage.
I encourage that you do some experiments how likely executableInputSections == 1 when the other conditions are satisfied (or executableInputSections == 0 with an executable output section).
I mentioned that with a linker script such a case can be constructed but it is far from realistic. The comment was made way before you raised your linker script example.

Your bar of complexity and simplicity seems unreasonably high.

I realize that such opinions are always subjective. So in the end such a comment never contributes to a constructive discussion.
Please try to focus more on the technical side of things, not make it personal or political.
Stay away from "Well, it may not be useful to you or Google, but this is useful to us (as mentioned above) and this is the linker for open source community (not just you or your team)." and "Please don't abuse the administrator rights."

yozhu updated this revision to Diff 443785.Jul 11 2022, 4:28 PM

Updating D128382: [LLD] Two tweaks to symbol ordering scheme

yozhu edited the summary of this revision. (Show Details)Jul 11 2022, 4:30 PM
MaskRay added inline comments.Jul 11 2022, 4:39 PM
lld/ELF/Writer.cpp
1323

Delete if (executableOutputSection)

1369

it's more natural to place two target->getThunkSectionSpacing() together.

lld/test/ELF/arm-symbol-ordering-file.s
16

We need a test that readonly sections on'y trigger the logic.

You can use llvm-objcopy --set-section-flags .foo=alloc,readonly %t.small.o to drop SHF_EXECINSTR.

yozhu added inline comments.Jul 11 2022, 10:16 PM
lld/ELF/Writer.cpp
1323

If output section is executable, we don't need to count totalSize.

yozhu added inline comments.Jul 11 2022, 10:22 PM
lld/ELF/Writer.cpp
1369

It is kind of to do cheaper (equal or not equal to zero) checks first and then do relatively more expensive (comparing two 64-bit integers) check. But I can swap the 2nd and 3rd checks; it likely doesn't matter that much.

MaskRay added inline comments.Jul 11 2022, 10:29 PM
lld/ELF/Writer.cpp
1323

Since getSize actually takes not-too-few instructions. Keeping if (executableOutputSection) is fine.

yozhu updated this revision to Diff 443976.Jul 12 2022, 8:59 AM

Updating D128382: [LLD] Two tweaks to symbol ordering scheme

yozhu marked 2 inline comments as done.Jul 12 2022, 9:00 AM
MaskRay accepted this revision.Jul 12 2022, 10:20 AM

The linker today will always put hot contributions in the middle of cold ones when targeting RISC machine, so to minimize the chances that branch thunks need be generated for hot code calling into cold code.

Mention --symbol-ordering-file in the subject or the commit message, otherwise readers may thought this applies to normal usage.

lld/ELF/Writer.cpp
1363

total size of isd

This revision is now accepted and ready to land.Jul 12 2022, 10:20 AM
yozhu edited the summary of this revision. (Show Details)Jul 12 2022, 10:37 AM
yozhu updated this revision to Diff 444011.Jul 12 2022, 10:49 AM

Updating D128382: [LLD] Two tweaks to symbol ordering scheme

yozhu marked an inline comment as done.Jul 12 2022, 10:50 AM
This revision was automatically updated to reflect the committed changes.