This is an archive of the discontinued LLVM Phabricator instance.

[LLD] Add archive Name to relocaiton overflow printout
AbandonedPublic

Authored by ayermolo on Sep 7 2021, 4:04 PM.

Details

Reviewers
MaskRay
Summary

When sourceFile is found only it is being printed. For root causing relocation overflows it's good to know from which archive it came from.

Diff Detail

Event Timeline

ayermolo created this revision.Sep 7 2021, 4:04 PM
ayermolo requested review of this revision.Sep 7 2021, 4:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2021, 4:04 PM
MaskRay requested changes to this revision.Sep 7 2021, 4:12 PM

toString(file) already contains the archive name.

This revision now requires changes to proceed.Sep 7 2021, 4:12 PM

toString(file) already contains the archive name.

Well yes, but it's only used when srcFile is not found. When it is found what is printed is:
ld.lld: error: foo.cpp:(.section_name+0x6): relocation <rest of the message>
With this change when srcFile is found output is similar to what toString(file) returns. Except instead of object file name it will be srcFile.

Now maybe whole thing can be simplified and toString(file) can just be used in general?

MaskRay added a subscriber: evgeny777.EditedSep 7 2021, 4:38 PM

Oh, yes. This occasionally tripped over me as well. Perhaps the getDILineInfo code path needs the object/archive filename as well.

  // First check if we can get desired values from debugging information.
  if (Optional<DILineInfo> info = getFile<ELFT>()->getDILineInfo(this, offset))
    return info->FileName + ":" + std::to_string(info->Line) + ":(" +
           secAndOffset + ")";

-  // File->sourceFile contains STT_FILE symbol that contains a
-  // source file name. If it's missing, we use an object file name.
-  std::string srcFile = std::string(getFile<ELFT>()->sourceFile);
-  if (srcFile.empty())
-    srcFile = toString(file);
+  std::string srcFile = toString(file);

I understand that source filename+line number is handy for hand-written assembly, but the .o filename is usually needed so that llvm-readelf -r on the file can do something useful.

However, with just object filename, the (function xxx) part seems less useful.

The code was from D25826 by @evgeny777

Oh, yes. This occasionally tripped over me as well. Perhaps the getDILineInfo code path needs the object/archive filename as well.

  // First check if we can get desired values from debugging information.
  if (Optional<DILineInfo> info = getFile<ELFT>()->getDILineInfo(this, offset))
    return info->FileName + ":" + std::to_string(info->Line) + ":(" +
           secAndOffset + ")";

-  // File->sourceFile contains STT_FILE symbol that contains a
-  // source file name. If it's missing, we use an object file name.
-  std::string srcFile = std::string(getFile<ELFT>()->sourceFile);
-  if (srcFile.empty())
-    srcFile = toString(file);
+  std::string srcFile = toString(file);

I understand that source filename+line number is handy for hand-written assembly, but the .o filename is usually needed so that llvm-readelf -r on the file can do something useful.

However, with just object filename, the (function xxx) part seems less useful.

The code was from D25826 by @evgeny777

Ah thanks for providing context. @evgeny777 thoughts?

@MaskRay Circling back to this after PTO. Since there was no feedback from Eugene, what do you think? Leave it like this to minimize impact and preserve original intent?

ayermolo updated this revision to Diff 381687.Oct 22 2021, 5:32 PM

Added test.
Was waiting to make sure patch is in right direction before spending time on a test.

Is this still needed?


For recording all potential out-of-range relocations, you can use --emit-relocs. It is used by some post-link analysis tools like some tools in the Linux kernel and BOLT.
You can extract SHT_RELA sections without the SHF_ALLOC flag from the linked output, and then analyze the R_X86_64_PC32 relocations. The output can be huge as --emit-relocs contains every input relocation. To make the output smaller, you can specify -g0 to drop debug info.
It is still an open issue that clang -g codegen may be slightly different from -g0 codegen in some cases, but as a ballpark figure it should do its job well:)

Is this still needed?


For recording all potential out-of-range relocations, you can use --emit-relocs. It is used by some post-link analysis tools like some tools in the Linux kernel and BOLT.
You can extract SHT_RELA sections without the SHF_ALLOC flag from the linked output, and then analyze the R_X86_64_PC32 relocations. The output can be huge as --emit-relocs contains every input relocation. To make the output smaller, you can specify -g0 to drop debug info.
It is still an open issue that clang -g codegen may be slightly different from -g0 codegen in some cases, but as a ballpark figure it should do its job well:)

Nope, your patch does the trick. Thanks.
Unfortunately I don't have the bandwidth to analyze all the various services that have potential to overflow. So usage model is to help services owners identify issues for themselves with improved diagnostics.
One of the reasons I was suggesting various improvements to reporting errors. Downstream we have a self contained patch in Relocations.cpp that will print out layout when relocation overflow happens. It really speeds up doing first level root cause.

ayermolo abandoned this revision.Oct 28 2021, 10:41 AM

Thanks:)


I think companies shipping large executables may have similar needs. Internally someone asked me whether LLD can record a section but I feel that --emit-relocs is the right solution.
I am always concerned with whether we may spend too much code on a rather specific use case where an alternative existing functionality exists.
I'd still say that watching relocations is like shooting a different target. The build system can rearrange dependencies/object files and the max relocation distance may not be representative.
(Say the longest relocation can disappear in the next build and the second longest relocation in the previous build can become the longest one.)

Thanks:)


I think companies shipping large executables may have similar needs. Internally someone asked me whether LLD can record a section but I feel that --emit-relocs is the right solution.
I am always concerned with whether we may spend too much code on a rather specific use case where an alternative existing functionality exists.
I'd still say that watching relocations is like shooting a different target. The build system can rearrange dependencies/object files and the max relocation distance may not be representative.
(Say the longest relocation can disappear in the next build and the second longest relocation in the previous build can become the longest one.)

I understand we don't want to chase too specialized use cases, but Relocation overflows do happen to multiple users of LLD. One can just search stack overflow with people asking "What does this relocation overflow mean". Adding extra information on none-critical path that will tell user at a glance, without re-linking, and doing extra steps, what the state of the binary is and what is the source and destination of relocation does improve user experience. Instead of every user who experiences relocation overflow being forced to repeat same steps over and over, LLD can produce that information with a cost of one function in Relocations.cpp. It is less to do with chasing any particular relocation, but providing a high level overview of what went wrong and where to look next. For example if reloc was between .gcc_except and .data, one can look at sections between those to see if maybe something there that can be moved out of the way, or one of the other sections is abnormally large, etc.

Thanks:)


I think companies shipping large executables may have similar needs. Internally someone asked me whether LLD can record a section but I feel that --emit-relocs is the right solution.
I am always concerned with whether we may spend too much code on a rather specific use case where an alternative existing functionality exists.
I'd still say that watching relocations is like shooting a different target. The build system can rearrange dependencies/object files and the max relocation distance may not be representative.
(Say the longest relocation can disappear in the next build and the second longest relocation in the previous build can become the longest one.)

I understand we don't want to chase too specialized use cases, but Relocation overflows do happen to multiple users of LLD. One can just search stack overflow with people asking "What does this relocation overflow mean". Adding extra information on none-critical path that will tell user at a glance, without re-linking, and doing extra steps, what the state of the binary is and what is the source and destination of relocation does improve user experience. Instead of every user who experiences relocation overflow being forced to repeat same steps over and over, LLD can produce that information with a cost of one function in Relocations.cpp. It is less to do with chasing any particular relocation, but providing a high level overview of what went wrong and where to look next. For example if reloc was between .gcc_except and .data, one can look at sections between those to see if maybe something there that can be moved out of the way, or one of the other sections is abnormally large, etc.

Extra steps are unavoidable because out-of-range relocations lead to errors, not warnings, and errors suppress the output.
The current diagnostic already tells the user the relocated location and the referenced symbol (I added the referenced symbol).
I don't agree that recording the information in a section can improve the general user experience.
It might make problems easier to analyze for a targeted small group of developers who know the existence of an analyzer and know how to use it.
As I said, --emit-relocs can basically do the same job, just requiring the interested small group of developers making the tool in a different way.
I also disagree that the cost is just one function in Relocations.cpp. Implementing it needs changes in multiple places and the complexity may not justify it if another tool exists and does the same job.
(Last, some may have tendency guessing the root cause of an out-of-range relocation. That would sometimes confuse users as the root cause might be incorrect (a "-Wl,-z,notext" diagnostic I removed belongs to the same broad category).)

wenlei added a comment.Nov 4 2021, 6:06 PM

Hey @MaskRay, overall I agree that any complexity added into any tool needs justification. There's always a trade off - sometimes it's obvious, others less so. However, I think in this case there's perhaps a misunderstanding. There were also discussions about related topic on other patches, so I chatted with @ayermolo off patch. Let me try to clarify things a bit.

  1. Recording all potential out-of-range relocations you mentioned above. This is non-trivial complexity, and it's somewhat duplicated to --emit-reloc as you pointed out. I also agree that going after specific relocation isn't the right approach for dealing with this problem. But this is *not* what was proposed here or in other patches.
  1. Better diagnostics when out-of-range relocation happens. This is what Alex was trying to do here and in other patches. I think this is reasonable - it helps developers understand the problem quicker, and the implementation is also simple. There're different things we can do here: 1). print from/to section in addition to from/to archive; 2) print the full section layout on error.

Internally, we benefited a lot from printing out section layout when hitting out-of-range relocation. Yes, you can rebuild with -M added then sed to get the equivalent info. But having such info right there on error is just a much nicer workflow for error reporting or quick triaging. Note that LLVM/LLD is also a tool used by average non-compiler developers, so the knowledge it requires to get the equivalent info through -M + sed is honestly a high bar for many people, and yet compiler engineers don't often have the environment to easily rebuild and repro the issues. For this reason, I believe such improved diagnostics for print from/to section and section layout has value for others too. It is a diagnostics improvement for error case, not a new feature under a new switch. Print section layout will add some complexity, however it's a pretty mechanic change not touching any fundamental elements of linker. So I hope that this trade-off makes sense and can be accepted. Of course, different people may view the value of such diagnostics differently, but I hope there's empathy for accepting such improvements because 1) it has shown the need in some places, 2) improving diagnostics on error should be a net good thing for user-ability.

WDYT?

Hey @MaskRay, overall I agree that any complexity added into any tool needs justification. There's always a trade off - sometimes it's obvious, others less so. However, I think in this case there's perhaps a misunderstanding. There were also discussions about related topic on other patches, so I chatted with @ayermolo off patch. Let me try to clarify things a bit.

  1. Recording all potential out-of-range relocations you mentioned above. This is non-trivial complexity, and it's somewhat duplicated to --emit-reloc as you pointed out. I also agree that going after specific relocation isn't the right approach for dealing with this problem. But this is *not* what was proposed here or in other patches.

Thanks for understanding.

  1. Better diagnostics when out-of-range relocation happens. This is what Alex was trying to do here and in other patches. I think this is reasonable - it helps developers understand the problem quicker, and the implementation is also simple. There're different things we can do here: 1). print from/to section in addition to from/to archive; 2) print the full section layout on error.

For the archive name, I also find it useful. I have implemented it in D112518.
If there are cases missing, we can discuss from there.

Internally, we benefited a lot from printing out section layout when hitting out-of-range relocation. Yes, you can rebuild with -M added then sed to get the equivalent info. But having such info right there on error is just a much nicer workflow for error reporting or quick triaging. Note that LLVM/LLD is also a tool used by average non-compiler developers, so the knowledge it requires to get the equivalent info through -M + sed is honestly a high bar for many people, and yet compiler engineers don't often have the environment to easily rebuild and repro the issues. For this reason, I believe such improved diagnostics for print from/to section and section layout has value for others too. It is a diagnostics improvement for error case, not a new feature under a new switch. Print section layout will add some complexity, however it's a pretty mechanic change not touching any fundamental elements of linker. So I hope that this trade-off makes sense and can be accepted. Of course, different people may view the value of such diagnostics differently, but I hope there's empathy for accepting such improvements because 1) it has shown the need in some places, 2) improving diagnostics on error should be a net good thing for user-ability.

WDYT?

The description is not what this patch does. Do you mean D94141?
As I said, I do not think auto -M or section output behavior is useful.
It is also not much to ask the user to re-link with an option.

wenlei added a comment.Nov 4 2021, 9:40 PM

The description is not what this patch does. Do you mean D94141?

Yes, I was referring to that patch. It is all related to improving diagnostics on error.

It is also not much to ask the user to re-link with an option.

I don't agree. 1) When builds are done on build agents, it's often not easy to just do "re-link". Then for people to repro this, a full rebuild may be needed which could take hours; 2) as I said above, it's too much to ask for an average non-compiler developer to deal with all of this, to know the -M + sed trick; 3) toolchain maintainers often don't have the setup to redo those builds locally.

Let's use an analogy - you're a user of a product, say a phone app, and when the app crashed, wouldn't it be nice if there's an diagnostic report you can send to app developers right way? Isn't it a bad user experience if after the crash, what users have to do is to go to phone settings, do some trick, and repeat all you've done again to repro the crash? Ofc, it's doable. But isn't it better to make things easier for everyone? Toolchain is also a product that serves its users. I'm yet to be convinced why such a harmless diagnostic improvements is not acceptable, given it certainly helps some users.

As I said, I do not think auto -M or section output behavior is useful.

Well, for the reasons I mentioned above, evidently it's proven to be quite useful for us. It helped us to triage some known issues quickly without needing to rebuild and point people to proper solutions very efficiently. It routinely saves many people hours of extra work from extra build for repro. Also in case there's confusion, this is not unconditional auto -M, this is just better diagnostics when error actually happens.

The description is not what this patch does. Do you mean D94141?

Yes, I was referring to that patch. It is all related to improving diagnostics on error.

Then please move move the reply there. I'll copy my response there and delete this message.

It is also not much to ask the user to re-link with an option.

I don't agree. 1) When builds are done on build agents, it's often not easy to just do "re-link". Then for people to repro this, a full rebuild may be needed which could take hours; 2) as I said above, it's too much to ask for an average non-compiler developer to deal with all of this, to know the -M + sed trick; 3) toolchain maintainers often don't have the setup to redo those builds locally.

Let's use an analogy - you're a user of a product, say a phone app, and when the app crashed, wouldn't it be nice if there's an diagnostic report you can send to app developers right way? Isn't it a bad user experience if after the crash, what users have to do is to go to phone settings, do some trick, and repeat all you've done again to repro the crash? Ofc, it's doable. But isn't it better to make things easier for everyone? Toolchain is also a product that serves its users. I'm yet to be convinced why such a harmless diagnostic improvements is not acceptable, given it certainly helps some users.

The two scenarios are quite different. For an app crash, the crash involves user input and some non-determinism. There is often no good way reproducing every trace of it, so a coredump can be helpful.

A linker out-of-range error is different. It's expected that all input files don't immediate disappear so you can re-run the link.
The linker is deterministic (there could be LTO nondeterminism bugs but in an ideal world it is deterministic) so re-invoking can reproduce the issue.
Auto output section dump assumes that the output section sizes are important - but I am not sure.
To fix the issue you probably want more detailed information. That may require some "zoom-in" & "zoom-out" manual steps.
I don't think from just an output section dump you can identify the problem, except that you can tell the user "your output is too large".
Some domain specific knowledge may be involved like "some sections are known to cause more problems".
All information like this is too specific information which may serve your users well but may not be generic enough.

I want to highlight some points you may have neglected but are important to me:

  • We want option orthogonality. -M is kept as opt-in. Making it enabled in some diagnostics breaks orthogonality.
  • The design principle of diagnostics is to be sufficiently useful but not too verbose.
  • -M output may be too huge. On some systems there may be a limit on output size. Printing too much may cause hide other diagnostics. We all know that template errors can sometimes lead to pages of errors which are difficult to comprehend, right?

As I said, I do not think auto -M or section output behavior is useful.

Well, for the reasons I mentioned above, evidently it's proven to be quite useful for us. It helped us to triage some known issues quickly without needing to rebuild and point people to proper solutions very efficiently. It routinely saves many people hours of extra work from extra build for repro. Also in case there's confusion, this is not unconditional auto -M, this is just better diagnostics when error actually happens.

The description is not what this patch does. Do you mean D94141?

Yes, I was referring to that patch. It is all related to improving diagnostics on error.

Then please move move the reply there. I'll copy my response there and delete this message.

It is also not much to ask the user to re-link with an option.

I don't agree. 1) When builds are done on build agents, it's often not easy to just do "re-link". Then for people to repro this, a full rebuild may be needed which could take hours; 2) as I said above, it's too much to ask for an average non-compiler developer to deal with all of this, to know the -M + sed trick; 3) toolchain maintainers often don't have the setup to redo those builds locally.

Let's use an analogy - you're a user of a product, say a phone app, and when the app crashed, wouldn't it be nice if there's an diagnostic report you can send to app developers right way? Isn't it a bad user experience if after the crash, what users have to do is to go to phone settings, do some trick, and repeat all you've done again to repro the crash? Ofc, it's doable. But isn't it better to make things easier for everyone? Toolchain is also a product that serves its users. I'm yet to be convinced why such a harmless diagnostic improvements is not acceptable, given it certainly helps some users.

The two scenarios are quite different. For an app crash, the crash involves user input and some non-determinism. There is often no good way reproducing every trace of it, so a coredump can be helpful.

You missed the point. It's not about non-determinism. It's about *wasted effort* in rebuild/repro vs having some diagnostics readily available on first occurrence of the error.

A linker out-of-range error is different. It's expected that all input files don't immediate disappear so you can re-run the link.
The linker is deterministic (there could be LTO nondeterminism bugs but in an ideal world it is deterministic) so re-invoking can reproduce the issue.
Auto output section dump assumes that the output section sizes are important - but I am not sure.
To fix the issue you probably want more detailed information. That may require some "zoom-in" & "zoom-out" manual steps.
I don't think from just an output section dump you can identify the problem, except that you can tell the user "your output is too large".
Some domain specific knowledge may be involved like "some sections are known to cause more problems".
All information like this is too specific information which may serve your users well but may not be generic enough.

I want to highlight some points you may have neglected but are important to me:

  • We want option orthogonality. -M is kept as opt-in. Making it enabled in some diagnostics breaks orthogonality.

Maybe this is another misunderstanding, but what was proposed was "print the full section layout on error.", specifically for output sections, not -M. print -M on error is definitely too much as there could be too many symbols. But we found printing the output section layout to be helpful and not too verbose either.

  • The design principle of diagnostics is to be sufficiently useful but not too verbose.

I agree with the principals, and I do think it's important. But again, it wasn't -M, printing output section layout is much less verbose then -M. This addresses the one below too.

  • -M output may be too huge. On some systems there may be a limit on output size. Printing too much may cause hide other diagnostics. We all know that template errors can sometimes lead to pages of errors which are difficult to comprehend, right?

As I said, I do not think auto -M or section output behavior is useful.

Well, for the reasons I mentioned above, evidently it's proven to be quite useful for us. It helped us to triage some known issues quickly without needing to rebuild and point people to proper solutions very efficiently. It routinely saves many people hours of extra work from extra build for repro. Also in case there's confusion, this is not unconditional auto -M, this is just better diagnostics when error actually happens.