This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Add --why-extract= to query why archive members/lazy object files are extracted
ClosedPublic

Authored by MaskRay on Sep 10 2021, 12:38 AM.

Details

Summary

Similar to D69607 but for archive member extraction unrelated to GC. This patch adds --why-extract=.

Prior art:

GNU ld -M prints

Archive member included to satisfy reference by file (symbol)

a.a(a.o)                      main.o (a)
b.a(b.o)                      (b())

-M is mainly for input section/symbol assignment <-> output section mapping
(often huge output) and the information may appear ad-hoc.

Apple ld64

__Z1bv forced load of b.a(b.o)
_a forced load of a.a(a.o)

It doesn't say the reference file.

Arm's proprietary linker

Selecting member vsnprintf.o(c_wfu.l) to define vsnprintf.
...
Loading member vsnprintf.o from c_wfu.l.
              definition:  vsnprintf
              reference :  _printf_a

--why-extract= gives the user the full data (which is much shorter than GNU ld
-Map). It is easy to track a chain of references to one archive member with a
one-liner, e.g.

% ld.lld main.o a_b.a b_c.a c.a -o /dev/null --why-extract=- | tee stdout
reference       extracted       symbol
main.o  a_b.a(a_b.o)    a
a_b.a(a_b.o)    b_c.a(b_c.o)    b()
b_c.a(b_c.o)    c.a(c.o)        c()

% ruby -F'\t' -ane 'BEGIN{p={}}; p[$F[1]]=[$F[0],$F[2]] if $.>1; END{x="c.a(c.o)"; while y=p[x]; puts "#{y[0]} extracts #{x} to resolve #{y[1]}"; x=y[0] end}' stdout
b_c.a(b_c.o) extracts c.a(c.o) to resolve c()
a_b.a(a_b.o) extracts b_c.a(b_c.o) to resolve b()
main.o extracts a_b.a(a_b.o) to resolve a

Archive member extraction happens before --gc-sections, so this may not be a live path
under --gc-sections, but I think it is a good approximation in practice.

  • Specifying a file avoids output interleaving with --verbose.
  • Required = prevents accidental overwrite of an input if the user forgets =. (Most of compiler drivers' long options accept = but not )

Diff Detail

Event Timeline

MaskRay created this revision.Sep 10 2021, 12:38 AM
MaskRay requested review of this revision.Sep 10 2021, 12:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 10 2021, 12:38 AM

Some conflicting thoughts:

  1. Maybe this should be part of the --verbose (or possibly --trace) output?
  2. If not doing the above, maybe it should be in a separate file?
  3. If printing it to stdout is still desired, and not under another existing option, I think it might be better to group the output into one block somewhere. If you combine the new option with e.g. --verbose or --trace, I expect you'll get interleaved and potentially confusing output, esepcially as the table header will be divorced from the rest of the table content.
lld/test/ELF/why-extract.s
16–18

I'm not entirely sure what the right solution here is, but I think this output needs to be better tabulated, so that the columns line up (possibly with the exception of long names). Possibly alternatively, rather than using tabs to separate, how about some other symbol like ; or |?

Not had a chance to formulate a strong opinion yet, I think James has some interesting thoughts, but I expect a lot of this will come down to personal preference.

I can share the results of what Arm's proprietary linker does which has been useful. It essentially integrates this into the --verbose output, which does interleave with other verbose output but in this particular case it works out ok as it helps show where the reference and definitions are from each object. A hand edited example example, the library names are a little strange as they encode properties about the library for example .l is for little endian:

Selecting member vsnprintf.o(c_wfu.l) to define vsnprintf.
...
Loading member vsnprintf.o from c_wfu.l.
              definition:  vsnprintf
              reference :  _printf_a
               ...
Selecting member _printf_a.o(c_wfu.l) to define _printf_a.
...
Loading member _printf_a.o from c_wfu.l.
              definition:  _printf_a
              weak ref  :  _printf_fp_hex
              ...

While not complete as it only cites the first single symbol reference that is resolved by a library member, it can usually be used to trace back. The information is not tabulated so interleaving isn't as much of a problem.

pcc added a comment.Sep 10 2021, 9:33 AM

I'm not sure this should be considered an alternative to D69607. Maybe it is if you're not using --gc-sections, but with that flag we really need something like a breadth-first search like in that patch.

MaskRay edited the summary of this revision. (Show Details)EditedSep 10 2021, 9:52 AM

I'm not sure this should be considered an alternative to D69607. Maybe it is if you're not using --gc-sections, but with that flag we really need something like a breadth-first search like in that patch.

Right, I removed the reference to D69607 (like ld64 -why_live). I didn't look closely at D69607. Indeed it is about --gc-sections. This is for the archive member output in GNU ld -M.

MaskRay updated this revision to Diff 372157.Sep 12 2021, 9:47 PM
MaskRay edited the summary of this revision. (Show Details)

Change --why-extract= to be similar to --print-archive-stats=: use a file

MaskRay updated this revision to Diff 372162.Sep 12 2021, 10:13 PM
MaskRay retitled this revision from [ELF] Add --why-extract to query why an archive member/lazy object file is extracted to [ELF] Add --why-extract= to query why an archive member/lazy object file is extracted.
MaskRay edited the summary of this revision. (Show Details)

Add manpage

jhenderson added inline comments.Sep 13 2021, 1:28 AM
lld/ELF/Driver.cpp
2198

We should consider adding an equivalent check to this area for the new file, for the same reason as the other cases.

lld/ELF/Symbols.cpp
326

This function may be a bit misnamed now, since it doesn't actually do any printing. Perhaps recordWhyExtract?

lld/test/ELF/why-extract.s
2

Maybe add a case where there was nothing used from an archive, to show that the file is still created, with the header.

15

I might suggest changing < why.txt to --input-file=why.txt, but it's entirely up to you. My motivation is that when using the PowerShell termainl, the < character is reserved and not used for stdin for some reason, which means you can't copy and paste this sort of line to aid in debugging.

MaskRay updated this revision to Diff 372297.Sep 13 2021, 11:07 AM
MaskRay marked 4 inline comments as done.

comments

lld/test/ELF/why-extract.s
15

OK, that's why some FileCheck tests use --input-file=...

Most probably just consider this form more verbose than <, though...

Particularly in optimization and codegen test directories, opt < and llc < are really really common.

16–18

I pick \t because tsv files are common (awk one-linear just directly work on it without setting the field separator).

; or | separators are uncommon.

The terminal display probably doesn't matter that much?

MaskRay edited the summary of this revision. (Show Details)Sep 13 2021, 11:08 AM
jhenderson accepted this revision.Sep 14 2021, 12:04 AM

LGTM, but please wait for someone else too.

lld/test/ELF/why-extract.s
15

Fortunately, I don't work with opt or llc very often :)

16–18

It's mostly about making the table pretty, because as things stand, the columns will regularly not line up. However, if you think the machine parsing support is more useful, that's fine by me.

This revision is now accepted and ready to land.Sep 14 2021, 12:04 AM
peter.smith accepted this revision.Sep 14 2021, 5:34 AM

LGTM too.

Should there be a release notes entry for this? (Not sure if you prefer to add release notes in the diff itself or later after the corresponding release branch has been cut.)

MaskRay updated this revision to Diff 372585.EditedSep 14 2021, 4:34 PM
MaskRay retitled this revision from [ELF] Add --why-extract= to query why an archive member/lazy object file is extracted to [ELF] Add --why-extract= to query why archive members/lazy object files are extracted.
MaskRay edited the summary of this revision. (Show Details)
  • add a backward reference test
  • add an entry to lld/docs/ReleaseNotes.rst

I'll wait one week before committing.

While not checking --gc-sections, I think this is a quite good approximation.
In the past, I have used a chain of -t and -y sym to dig such information.

MaskRay edited the summary of this revision. (Show Details)Sep 20 2021, 9:44 AM

Thanks for working on this. In the past I have resorted to hacking in my own ad-hoc versions of such features to solve particularly horrible dependency cases. As mentioned above there's a case for more powerful dependency tracking e.g. https://reviews.llvm.org/D69607. The more of these debugging features available the better IMO :)

thakis added a subscriber: thakis.Oct 7 2021, 6:46 AM

(FYI: You likely know this but it's not mention further up: ld64.lld implements ld64's -why_load which I think does the same thing.)

(FYI: You likely know this but it's not mention further up: ld64.lld implements ld64's -why_load which I think does the same thing.)

(Yes, I checked ld64. I picked --why-extract= instead of --why-load= because extract (instead of load) is what Solaris ld and GNU ld's documentation say about archive member extraction.
There are 3 columns instead of 2 so that tracing a chain is easier. --whole-archive (-all_load) doesn't have dedicated output but it should be fine. Interleaved output of -why_load is another issue.)