This is an archive of the discontinued LLVM Phabricator instance.

[Symbolizer][Debuginfo] Add debuginfod client to llvm-symbolizer.
ClosedPublic

Authored by noajshu on Nov 11 2021, 2:32 PM.

Details

Summary

Adds a fallback to use the debuginfod client library (386655) in findDebugBinary.
Fixed a cast of Erorr::success() to Expected<> in debuginfod library.
Added Debuginfod to Symbolize deps in gn.
Updates compiler-rt/lib/sanitizer_common/symbolizer/scripts/build_symbolizer.sh to include Debuginfod library to fix sanitizer-x86_64-linux breakage.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Nice and simple - though would be good to have test coverage.

noajshu updated this revision to Diff 388658.Nov 19 2021, 4:02 PM

Remove unnecessary buildIDToString call due to updates to D112758.

noajshu updated this revision to Diff 390207.Nov 28 2021, 9:01 AM

Add more help info for client tool and update lit tests.

noajshu updated this revision to Diff 390208.Nov 28 2021, 9:01 AM

Undo accidental diff update (sorry)

noajshu updated this revision to Diff 390213.Nov 28 2021, 9:57 AM

Add lit end-to-end tests of symbolizer debuginfod client support.

@dblaikie thanks for your suggestion to add tests. I just added regression test coverage checking that the symbolizer can query a local debuginfod server (using a python simple http server).
I also added unit tests to D112758 checking that the local cache is used correctly.
I made one small change to the client library, which is to respect the DEBUGINFOD_CACHE_PATH and DEBUGINFOD_TIMEOUT env variables in addition to DEBUGINFOD_URLS.
Please let me know if you have thoughts on the testing strategy or any other feedback on this diff stack (D112751, D112753, D112758, and D113717).
If this looks good I can update D112759 to use this same python test script.
Thanks!

noajshu updated this revision to Diff 390215.Nov 28 2021, 10:22 AM

Remove extraneous --keep-section=.notes in symbolizer debuginfod test.

Not looked at anything else in the patch stack, but I can more-or-less guess what is going on (storing the debug data on a remote server for the symbolizer and other tools to be able to look up?). If so, in general, this looks fine to me. My only slight concern is I wonder how many places (especially build bots) will exercise this test, since it requires a new dependency. Might be worth approaching some build bot owners and asking if they'd be willing to install curl as part of their setup.

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
390–393

It would be good if this failure path could be tested, unless this can be just said to be tested with the other generic symbolizer tests?

llvm/test/tools/llvm-symbolizer/debuginfod.py
2 ↗(On Diff #390215)

Not sure I follow which source you're referring to here...

8 ↗(On Diff #390215)
17 ↗(On Diff #390215)

What's this being used for?

23 ↗(On Diff #390215)
25 ↗(On Diff #390215)

I'm pretty sure this explicit return type isn't necessary.

Also, should it be server_path? (Also applies below)

39 ↗(On Diff #390215)

Perhaps worth including the actual code in this message, to make test debugging easier.

47 ↗(On Diff #390215)

Any particular reason not to put this at top-level like the other imports?

49–51 ↗(On Diff #390215)

It would help the RUN line above be more self-descriptive (and flexible for the future) if these were dashed arguments, rather than positionals.

55 ↗(On Diff #390215)

sys.exit is the normal way to exit with an exit code.

llvm/test/tools/llvm-symbolizer/lit.local.cfg
1 ↗(On Diff #390215)

You could avoid this by just renaming the test file to have a .test suffix.

Not looked at anything else in the patch stack, but I can more-or-less guess what is going on (storing the debug data on a remote server for the symbolizer and other tools to be able to look up?). If so, in general, this looks fine to me. My only slight concern is I wonder how many places (especially build bots) will exercise this test, since it requires a new dependency. Might be worth approaching some build bot owners and asking if they'd be willing to install curl as part of their setup.

I seem to recall some mention that this API was usable with a local cache even without the curl dependency? Perhaps a test using only a local cache (not needing a server or curl) could be added as well? (the test with curl/etc might be good to keep too - though perhaps it could be removed if the API functionality is already tested elsewhere - if it isn't/can't readily be tested elsewhere then here's fine too)

noajshu updated this revision to Diff 390826.Nov 30 2021, 2:49 PM
noajshu marked 10 inline comments as done.

Add end-to-end test of debuginfod client cache (curl not required). Updates to debuginfod.py testing helper script.

noajshu added a comment.EditedNov 30 2021, 3:00 PM

Thank you for the helpful comments @jhenderson , I just uploaded a change to address them.

Not looked at anything else in the patch stack, but I can more-or-less guess what is going on (storing the debug data on a remote server for the symbolizer and other tools to be able to look up?). If so, in general, this looks fine to me. My only slight concern is I wonder how many places (especially build bots) will exercise this test, since it requires a new dependency. Might be worth approaching some build bot owners and asking if they'd be willing to install curl as part of their setup.

I seem to recall some mention that this API was usable with a local cache even without the curl dependency? Perhaps a test using only a local cache (not needing a server or curl) could be added as well? (the test with curl/etc might be good to keep too - though perhaps it could be removed if the API functionality is already tested elsewhere - if it isn't/can't readily be tested elsewhere then here's fine too)

D112758 includes unit testing that verifies the local cache works without curl. I updated the end-to-end tests of this diff (D113717) and added an additional test that just hits the local cache and does not require Curl. I also added checks for the Curl-dependent test to verify that even after the server is taken down, the debuginfo is still accessible in the local cache.
Is it ok to keep all of these tests? If I had to pick between the end-to-end cache test here and the unit test in D112758, I would favor the unit test since it's more likely to be run on all bots. However the end-to-end test here is additionally checking that the debuginfod client has been properly integrated with the symbolizer.

I think it's a great idea to make sure libcurl will be tested on some buildbots. Since LLVM_ENABLE_CURL defaults to ON in D112753, llvm will be built with curl by default as long as it's found by cmake. It's possible the library is already installed on some buildbots. Do you know how to view the buildbot configuration scripts?

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
390–393

I agree. I've just added additional lit tests which hit this failure path.

llvm/test/tools/llvm-symbolizer/debuginfod.py
8 ↗(On Diff #390215)

Thanks, I just split this script into debuginfod.test and debuginfod.py to make it a bit easier to read. If there's a preference towards a single file I can instead merge them back together and use %s.

17 ↗(On Diff #390215)

I just added a comment at the top of this script to explain its purpose. The short version is we want to run an end-to-end test of the debuginfod client functionality of the symbolizer, and so we need a working debuginfod server running at the same time.

49–51 ↗(On Diff #390215)

Good idea, changed this!

Thank you for the helpful comments @jhenderson , I just uploaded a change to address them.

Not looked at anything else in the patch stack, but I can more-or-less guess what is going on (storing the debug data on a remote server for the symbolizer and other tools to be able to look up?). If so, in general, this looks fine to me. My only slight concern is I wonder how many places (especially build bots) will exercise this test, since it requires a new dependency. Might be worth approaching some build bot owners and asking if they'd be willing to install curl as part of their setup.

I seem to recall some mention that this API was usable with a local cache even without the curl dependency? Perhaps a test using only a local cache (not needing a server or curl) could be added as well? (the test with curl/etc might be good to keep too - though perhaps it could be removed if the API functionality is already tested elsewhere - if it isn't/can't readily be tested elsewhere then here's fine too)

D112758 includes unit testing that verifies the local cache works without curl. I updated the end-to-end tests of this diff (D113717) and added an additional test that just hits the local cache and does not require Curl. I also added checks for the Curl-dependent test to verify that even after the server is taken down, the debuginfo is still accessible in the local cache.
Is it ok to keep all of these tests? If I had to pick between the end-to-end cache test here and the unit test in D112758, I would favor the unit test since it's more likely to be run on all bots. However the end-to-end test here is additionally checking that the debuginfod client has been properly integrated with the symbolizer.

I'd probably be inclined to skip the testing that requires this standalone python server - and only test llvm-symbolizer via the non-curl-dependent local cache. The API surface area seems small enough that it's pretty unlikely that the unit tests for networked cases could pass, local cache testing of llvm-symbolizer could pass, but then networked llvm-symbolizer functionality would still be broken?

Oh... but the unit testing only tests the local cache - so this llvm-symbolizer test and python server is the only testing of the remote part? It'd be nice if there is lower level and higher level testing, that the lower level testing should be more comprehensive - if it'd be feasible/reasonable to test the API directly with networking functionality?

I think it's a great idea to make sure libcurl will be tested on some buildbots. Since LLVM_ENABLE_CURL defaults to ON in D112753, llvm will be built with curl by default as long as it's found by cmake. It's possible the library is already installed on some buildbots. Do you know how to view the buildbot configuration scripts?

Not sure, really :/ the buildbots are a bit haphazard.

llvm/test/tools/llvm-symbolizer/debuginfod.py
1 ↗(On Diff #390826)

Any file that isn't itself a test file (with RUN lines, etc) even if it isn't currently found by the lit config as a test, should go in an "Inputs" subdirectory.

I think it's a great idea to make sure libcurl will be tested on some buildbots. Since LLVM_ENABLE_CURL defaults to ON in D112753, llvm will be built with curl by default as long as it's found by cmake. It's possible the library is already installed on some buildbots. Do you know how to view the buildbot configuration scripts?

I'm afraid I don't know either. Perhaps ask on the llvm-dev mailing list?

llvm/test/tools/llvm-symbolizer/debuginfod-cache.test
5 ↗(On Diff #390826)

Does this string need to be this specific value? Might be worth a comment if it does.

10–11 ↗(On Diff #390826)

Use -NEXT suffixes here and below.

14 ↗(On Diff #390826)

Similar to my above comment: add a comment why this path is what it is.

llvm/test/tools/llvm-symbolizer/debuginfod.py
2 ↗(On Diff #390826)
8 ↗(On Diff #390215)

I personally am a big fan of keeping support files that are used in only a single test within the same file, as it means if the test is ever deleted/renamed, the support files aren't left lying around with no users. It also is easier to follow what the script is doing: by having them separate, it requires a developer to open up another file to see what a test is actually doing.

Note: An exception to the above is if a script is soon going to be used by other tests that don't yet exist.

17 ↗(On Diff #390215)

I was referring specificaly to the import functools which I didn't immediately see a usage of.

llvm/test/tools/llvm-symbolizer/debuginfod.test
3

If you're referring to addr.exe - that's not a program specific to one particular test: it's used by a fair number of tests (or at least used to be!). I'd therefore delete this comment, since it is misleading.

It would be useful to put a high-level comment explaining what this test is testing, so that the reason --strip-debug is used is explained somewhere.

7

Similar to my earlier comment, it would be useful to explain why this string is what it is.

15
16–17
18–19

Similar to above: explain the need for the llvm-objcopy command.

noajshu updated this revision to Diff 391955.Dec 5 2021, 8:05 PM
noajshu marked 10 inline comments as done.

Simplify testing.

Thanks for your suggestions on the end-to-end tests @jhenderson. Following a discussion here about testing strategy, I think we only need to test the integration point in the symbolizer here, which we achieve with the simpler test that just hits the cache. I have removed the more complicated python HTTP server test from this diff and moved it to D112759 where it was adapted to test the llvm-debuginfod-find client instead and incorporate your feedback to make it more self-documenting.

Without the bulkier test this diff is nice and small. How does it look to you?

noajshu updated this revision to Diff 391958.Dec 5 2021, 8:12 PM
noajshu marked 3 inline comments as done.

Explain the need for the llvm-objcopy command.

jhenderson added inline comments.Dec 6 2021, 1:08 AM
llvm/test/tools/llvm-symbolizer/debuginfod.test
11

I feel like runtime environment variables are usually prefixed with env. I suspect there's a good reason for this (maybe something Windows-specific?), so you probably need that too here, at a guess.

24
26

You don't use the --print-address option here.

noajshu updated this revision to Diff 392161.Dec 6 2021, 1:23 PM
noajshu marked an inline comment as done.

Prefix environment variables in lit test with env, and remove extraneous flag to llvm-symbolizer in test.

noajshu marked 2 inline comments as done.Dec 6 2021, 1:23 PM
jhenderson accepted this revision.Dec 7 2021, 1:06 AM

LGTM! Thanks for the work!

This revision is now accepted and ready to land.Dec 7 2021, 1:06 AM

LGTM! Thanks for the work!

Thanks for the helpful comments!

noajshu updated this revision to Diff 392415.Dec 7 2021, 8:13 AM

Due to D115131, which removed HTTPClient::initialize from InitLLVM, we need to initialize the client at the beginning of main() in llvm-symbolizer.cpp.

thakis added a subscriber: thakis.Dec 8 2021, 10:55 AM

Hm, I'd prefer if llvm-symbolizer (and almost all llvm tools) didn't make network requests. Could we have a new tool for this instead, so that the dep on Debuginfod could be in the tool instead of in llvm/lib/DebugInfo/Symbolize?

Thinking out loud a bit, maybe there should be a lib/DebuginfoCacheReader that can just read the debuginfod cache but not write on it, and then ~everything could depend on that without problems, and lib/Debuginfod would depend on that but could also write to the cache (and it could depend on libcurl and make http requests etc)? Then in a bot setup, the bot could call the "pull symbols" thing as a dedicated step at the start. And for devs, maybe there could be an llvm-debuginfod binary that can pull symbols as part of symbolizing.

(My main concern is that the current design feels that it's very easy to add a dep on an http library to arbitrary llvm tools, and most toolchain tools shouldn't make http requests.)

To expand a bit, for most tools it doesn't seem desirable to me if they can block randomly and need 1000x as long depending on if they need to download something or not.

It seems fine if the debugger makes http requests, or spawns a subprocess that makes http requests. But for most tools it doesn't seem desirable if they can block randomly and need 1000x as long depending on if they need to download something or not. It seems nicer to run the "get symbols" thing once and have all tools work consistently fast and without network access then.

@thakis Thanks for the suggestion. Although we may know the --objs which are going to be fed to llvm-symbolizer in advance, so we could run a "pull symbols" step just for those, we wouldn't know in advance which shared libraries the input (e.g. backtrace) will touch. So I think to make it work you'd need three steps -- symbolize and figure out what debuginfo is missing, then run the debuginfod fetch to download, and then symbolize again.

By the way, network requests are made only when built with LLVM_ENABLE_CURL and then only to those urls in the DEBUGINFOD_URLS environment variable. There is also a timeout controllable by DEBUGINFOD_TIMEOUT which sets an upper limit on the number of milliseconds before a fetch operation gives up.

noajshu updated this revision to Diff 392895.Dec 8 2021, 12:54 PM

Small fix to address buildbot failure.

vitalybuka added inline comments.
llvm/tools/llvm-symbolizer/CMakeLists.txt
13

Something with llvm::getCachedOrDownloadDebuginfo(llvm::ArrayRef<unsigned char>) is needed
https://lab.llvm.org/buildbot/#/builders/37/builds/9005

noajshu added inline comments.Dec 8 2021, 4:21 PM
llvm/tools/llvm-symbolizer/CMakeLists.txt
13

Thanks for pointing this out. Do you know if it might be related to the use of

typedef ArrayRef<uint8_t> BuildIDRef;
typedef SmallVector<uint8_t, 10> BuildID;

?

noajshu updated this revision to Diff 392972.Dec 8 2021, 4:22 PM

Re-upload to run builders again.

noajshu updated this revision to Diff 393010.Dec 8 2021, 6:55 PM

Fix typo in Expected takeError idiom

jhenderson added inline comments.Dec 9 2021, 2:38 AM
llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
265–266

Could we RAII this somehow, so that the explicit cleanup at the end isn't needed? Also, I below the typical user won't be using the debuginfod client, so maybe it should be lazily initialized?

noajshu updated this revision to Diff 393237.Dec 9 2021, 11:52 AM

Implement RAII for HTTPClient::cleanup using static destructor.
Verified destructor is called using lldb.

noajshu marked an inline comment as done.Dec 9 2021, 12:05 PM
noajshu added inline comments.
llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
265–266

Good idea, I've RAII'd the cleanup!

I would like to lazily initialize the client, but I'm not sure we can. There's a relevant discussion here about why we need to run curl_global_init at the beginning of main, when there is only one thread.
Although we could move the HTTPClient::init to one of llvm-symbolizer's functions, I'm concerned this function could get called from another program from within a thread. By calling at the beginning of main() we don't have to worry about that.

vitalybuka added inline comments.Dec 9 2021, 12:10 PM
llvm/tools/llvm-symbolizer/CMakeLists.txt
13

I don't know

noajshu updated this revision to Diff 393271.Dec 9 2021, 1:29 PM
noajshu marked an inline comment as done.

Add Debuginfod dep to llvm/utils/gn/secondary/llvm/lib/DebugInfo/Symbolize/BUILD.gn to attempt to fix buildbot failure

noajshu added inline comments.Dec 9 2021, 1:33 PM
llvm/tools/llvm-symbolizer/CMakeLists.txt
13

I noticed that shortly after I committed this in 02cc8d69, Nico Weber kindly ported the gn build with 0f865dc6. I think this may be what's missing to fix the sanitizer buildbot and I've added the corresponding edit below. Next time I commit I'll watch this bot closely to see if it recurs. Thanks again for pointing out the failure!

noajshu marked 2 inline comments as done.Dec 9 2021, 1:34 PM

Add Debuginfod dep to llvm/utils/gn/secondary/llvm/lib/DebugInfo/Symbolize/BUILD.gn to attempt to fix buildbot failure

That buldbot does not use .gn

noajshu added a comment.EditedDec 9 2021, 3:30 PM

Add Debuginfod dep to llvm/utils/gn/secondary/llvm/lib/DebugInfo/Symbolize/BUILD.gn to attempt to fix buildbot failure

That buldbot does not use .gn

Ok! What does the buildbot do / use?
I followed the instructions to reproduce the build failure, but the build fails for other reasons. (logs)

Add Debuginfod dep to llvm/utils/gn/secondary/llvm/lib/DebugInfo/Symbolize/BUILD.gn to attempt to fix buildbot failure

That buldbot does not use .gn

Ok! What does the buildbot do / use?
I followed the instructions to reproduce the build failure, but the build fails for other reasons. (logs)

Ah, I think I see the problem.

noajshu updated this revision to Diff 393329.Dec 9 2021, 4:21 PM

Add symbolizer symbols to global_symbols.txt.

noajshu edited the summary of this revision. (Show Details)Dec 9 2021, 4:21 PM
This revision was landed with ongoing or failed builds.Dec 9 2021, 4:25 PM
vitalybuka added inline comments.Dec 9 2021, 4:35 PM
compiler-rt/lib/sanitizer_common/symbolizer/scripts/global_symbols.txt
75 ↗(On Diff #393329)

we cannot do that
all allowed symbols are essentially glibc or sanitizer itself
but nothing will resolve these symbols

noajshu added inline comments.Dec 9 2021, 4:42 PM
compiler-rt/lib/sanitizer_common/symbolizer/scripts/global_symbols.txt
75 ↗(On Diff #393329)

Sorry, I'm just trying to understand what this does and I saw several commits added symbols to that list in the past.

Can you give me some idea of what the general problem is? Do I need to update the buildbot configuration since I'm linking the symbolizer against a new library (Debuginfod)? Or did the sanitizer detect a bug in my code?

vitalybuka added inline comments.Dec 9 2021, 4:53 PM
compiler-rt/lib/sanitizer_common/symbolizer/scripts/global_symbols.txt
75 ↗(On Diff #393329)

I guess bot will fail on later step.
global_symbols.txt it the list of unresolved symbols in this lib and we expect no llvm stuff there
sanitized user apps may link this lib and without need for any llvm stuff

sorry, that I didn't realized this before, but we just need to include Debuginfod into
llvm-project/compiler-rt/lib/sanitizer_common/symbolizer/scripts/build_symbolizer.sh

noajshu added inline comments.Dec 9 2021, 4:58 PM
compiler-rt/lib/sanitizer_common/symbolizer/scripts/global_symbols.txt
75 ↗(On Diff #393329)

Excellent, thank you so much! I will revert this commit, apply this change, remove my edits to the global_symbols.txt, and then re-push.

vitalybuka added inline comments.Dec 9 2021, 5:01 PM
compiler-rt/lib/sanitizer_common/symbolizer/scripts/global_symbols.txt
75 ↗(On Diff #393329)

Up to you, but I don't mind if you fix this in folloup patch.
Thank you!

noajshu updated this revision to Diff 393339.Dec 9 2021, 5:05 PM
noajshu added a reviewer: vitalybuka.

Remove global_symbols.txt edits, and add Debuginfod lib to compiler-rt/lib/sanitizer_common/symbolizer/scripts/build_symbolizer.sh

noajshu edited the summary of this revision. (Show Details)Dec 9 2021, 5:11 PM
vitalybuka reopened this revision.Dec 9 2021, 5:19 PM

to accept

This revision is now accepted and ready to land.Dec 9 2021, 5:19 PM
vitalybuka accepted this revision.Dec 9 2021, 5:21 PM

Thanks, compiler_rt LGTM.

This revision was landed with ongoing or failed builds.Dec 9 2021, 5:39 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2021, 5:39 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript

@thakis Thanks for the suggestion. Although we may know the --objs which are going to be fed to llvm-symbolizer in advance, so we could run a "pull symbols" step just for those, we wouldn't know in advance which shared libraries the input (e.g. backtrace) will touch. So I think to make it work you'd need three steps -- symbolize and figure out what debuginfo is missing, then run the debuginfod fetch to download, and then symbolize again.

You could have a tool that pulls symbols for a binary and all the shared libraries it links, right?

A pattern I've seen over the years is that since we're all very busy, as soon as there's an easy explanation for why something is slow, people will jump to that explanation if something is slow. After this change, if someone says "llvm-symbolizer took several seconds", people will just say "well it probably pulled stuff off the network" instead of actually investigating.

If the concern of pulling stuff from the network is separated into its own binary, it's impossible to get confused about this.

Here is the relevant llvm-dev thread where we planned these changes.

Thanks for the link! Looks like there wasn't a lot of discussion on the thread. 1 and 3 seem like great goals. As said on this review, I think it's better if for step 2 there was a dedicated binary to pull debug info from the web, and if llvm-symbolizer (and other llvm tools) didn't start making network requests, but instead just loaded the symbols downloaded by the new tool.

Turns out this isn't even just theoretical: This _did_ add a dependency on libcurl to lld:

//lld/tools/lld:lld has an assert_no_deps entry:
  //llvm/lib/Debuginfod:Debuginfod
which fails for the dependency path:
  //lld/tools/lld:lld ->
  //lld/COFF:COFF ->
  //llvm/lib/DebugInfo/Symbolize:Symbolize ->
  //llvm/lib/Debuginfod:Debuginfod

The linker shouldn't do http requests, obviously. Let's revert this for now until at least the lld dep is figured out.

hans added a subscriber: hans.Dec 10 2021, 8:39 AM

Turns out this isn't even just theoretical: This _did_ add a dependency on libcurl to lld:

//lld/tools/lld:lld has an assert_no_deps entry:
  //llvm/lib/Debuginfod:Debuginfod
which fails for the dependency path:
  //lld/tools/lld:lld ->
  //lld/COFF:COFF ->
  //llvm/lib/DebugInfo/Symbolize:Symbolize ->
  //llvm/lib/Debuginfod:Debuginfod

The linker shouldn't do http requests, obviously. Let's revert this for now until at least the lld dep is figured out.

I'm not 100% sure that it's obvious the linker shouldn't make http requests if it can be used to improve diagnostics.

Though most of lld's symbolizing is on object files and I don't think it's expected that debug info for unlinked object files would be in a symbol server - but perhaps there are cases where lld might want to symbolize an address in a .so where symbolizing might still be useful (though arguably so - if it's system installed it's probably not super relevant to the user to know what source files the symbol came from - they're just dealing with the system library, they aren't the authors of it - and if it's something they authored it probably has local debug info that doesn't need looking up)

@noajshu Could you look into if/how lld uses the symbolizer and if there are use cases where it would have an positive effect on the user experience (improved diagnostic quality, etc) to lookup debug info on the symbol server? If there are no such cases, maybe having a separate entry point into the symbolizing library that statically does not use debuginfod would be good. The library dependency would still be there, but if a caller only uses the non-debuginfod entry point, nothing from the debuginfod library would be pulled in at link time.

@thakis Thanks for the suggestion. Although we may know the --objs which are going to be fed to llvm-symbolizer in advance, so we could run a "pull symbols" step just for those, we wouldn't know in advance which shared libraries the input (e.g. backtrace) will touch. So I think to make it work you'd need three steps -- symbolize and figure out what debuginfo is missing, then run the debuginfod fetch to download, and then symbolize again.

You could have a tool that pulls symbols for a binary and all the shared libraries it links, right?

yes, see D112759

A pattern I've seen over the years is that since we're all very busy, as soon as there's an easy explanation for why something is slow, people will jump to that explanation if something is slow. After this change, if someone says "llvm-symbolizer took several seconds", people will just say "well it probably pulled stuff off the network" instead of actually investigating.

If the concern of pulling stuff from the network is separated into its own binary, it's impossible to get confused about this.

Hi Nico, I just added you as a reviewer of D115500, which changes the build so that LLVM builds without libcurl by default, even if libcurl is present. This way there is no confusion.

Turns out this isn't even just theoretical: This _did_ add a dependency on libcurl to lld:

//lld/tools/lld:lld has an assert_no_deps entry:
  //llvm/lib/Debuginfod:Debuginfod
which fails for the dependency path:
  //lld/tools/lld:lld ->
  //lld/COFF:COFF ->
  //llvm/lib/DebugInfo/Symbolize:Symbolize ->
  //llvm/lib/Debuginfod:Debuginfod

The linker shouldn't do http requests, obviously. Let's revert this for now until at least the lld dep is figured out.

I'm not 100% sure that it's obvious the linker shouldn't make http requests if it can be used to improve diagnostics.

Though most of lld's symbolizing is on object files and I don't think it's expected that debug info for unlinked object files would be in a symbol server - but perhaps there are cases where lld might want to symbolize an address in a .so where symbolizing might still be useful (though arguably so - if it's system installed it's probably not super relevant to the user to know what source files the symbol came from - they're just dealing with the system library, they aren't the authors of it - and if it's something they authored it probably has local debug info that doesn't need looking up)

@noajshu Could you look into if/how lld uses the symbolizer and if there are use cases where it would have an positive effect on the user experience (improved diagnostic quality, etc) to lookup debug info on the symbol server? If there are no such cases, maybe having a separate entry point into the symbolizing library that statically does not use debuginfod would be good. The library dependency would still be there, but if a caller only uses the non-debuginfod entry point, nothing from the debuginfod library would be pulled in at link time.

Thanks for all these comments and sorry for the late response! I will respond more thoroughly to the points above soon. To avoid confusion I wanted to quickly mention something.
I believe there is an error in the GN build configuration which causes lld's dependency on Debuginfod.
Specifically the file
llvm/utils/gn/secondary/lld/COFF/BUILD.gn
adds the dependency on DebugInfo/Symbolize, but this is not in
lld/COFF/CMakeLists.txt
Perhaps this is gn line is leftover from a previous dependency which was updated in cmake but never updated in gn?
Removing it seems to resolve the issue building lld with gn when both this diff and rG2586c23b are applied.

noajshu updated this revision to Diff 393572.Dec 10 2021, 12:47 PM
noajshu edited the summary of this revision. (Show Details)

Fix lld's COFF lib gn script to remove unnecessary dependency on llvm/lib/DebugInfo/Symbolize. This should now be compatible with rG2586c23b.

Fix lld's COFF lib gn script to remove unnecessary dependency on llvm/lib/DebugInfo/Symbolize. This should now be compatible with rG2586c23b.

If the gn change can be committed/reviewed separately (before recommitting this change) then it probably should be.

Fix lld's COFF lib gn script to remove unnecessary dependency on llvm/lib/DebugInfo/Symbolize. This should now be compatible with rG2586c23b.

If the gn change can be committed/reviewed separately (before recommitting this change) then it probably should be.

Sounds good! Here it is: D115554.

@noajshu Could you look into if/how lld uses the symbolizer and if there are use cases where it would have an positive effect on the user experience (improved diagnostic quality, etc) to lookup debug info on the symbol server? If there are no such cases, maybe having a separate entry point into the symbolizing library that statically does not use debuginfod would be good. The library dependency would still be there, but if a caller only uses the non-debuginfod entry point, nothing from the debuginfod library would be pulled in at link time.

Good idea, thanks for the recommendation!
I've taken a look and lld. There is no usage of the definitions in Symbolize.cpp. This is unsurprising as removing the lib dep in D115554 did not break the gn build.
There is one file, lld/COFF/SymbolTable.cpp that includes "llvm/DebugInfo/Symbolize.h". This is a red herring, lld not use any of the Symbolize.h declarations. It actually just needs llvm/DebugInfo/DIContext.h which is included as
Symbolize.h -> SymbolizableModule.h -> DIContext.h.
So probably that line should be replaced with #include "llvm/DebugInfo/DIContext.h". I'm happy to add this to D115554.

Though most of lld's symbolizing is on object files and I don't think it's expected that debug info for unlinked object files would be in a symbol server - but perhaps there are cases where lld might want to symbolize an address in a .so where symbolizing might still be useful (though arguably so - if it's system installed it's probably not super relevant to the user to know what source files the symbol came from - they're just dealing with the system library, they aren't the authors of it - and if it's something they authored it probably has local debug info that doesn't need looking up)

I agree, there could be some use cases for debuginfod in lld, though it's probably not as natural target as (say) lldb.
Now that it's clear that lld does not use Symbolize, and therefore lld does not use Debuginfod, I propose llvm-dev as more appropriate place to discuss which tools should (or should not) use debuginfod in the future.
IIRC this is the most recent reply to the debuginfod RFC: https://lists.llvm.org/pipermail/llvm-dev/2021-October/153101.html
That would be a good place to follow up.
Thanks!

noajshu updated this revision to Diff 393606.Dec 10 2021, 3:34 PM

Remove lld gn build change, as it has been moved to D115554 at @dblaikie 's suggestion.

noajshu edited the summary of this revision. (Show Details)Dec 10 2021, 3:34 PM

I'm not 100% sure that it's obvious the linker shouldn't make http requests if it can be used to improve diagnostics.

It's very scary to me that you say that. Maybe we should have a in-person (well, VC) discussion about this change? To me, separation of concerns makes it fairly obvious that compilers and linkers etc shouldn't rely on an internet connection to work. There should be a separate binary that handles downloading stuff that can be run well in advance (or, not run). Kitchen-sync binaries/libraries etc tend to be not great to work with.

D115554 looks fine to me, independent of this change. I don't think it helps this change here much since another tool might add an accidental (or even intentional) dependency on DebugInfo/Symbollize again, like lld did in the past.

D115500 also looks fine to me in isolation, but it doesn't help with the overall design here.

I fairly strongy feel that there should be a library for just reading this debuginfod cache that doesn't do web requests that all tools could link to without concerns, and a separate library for writing to it and doing network requests, and that should be linked to from very few binaries.

This revision was landed with ongoing or failed builds.Dec 13 2021, 3:04 PM

I'm not 100% sure that it's obvious the linker shouldn't make http requests if it can be used to improve diagnostics.

It's very scary to me that you say that. Maybe we should have a in-person (well, VC) discussion about this change?

Happy to, if you like - my calendar's usually pretty open - either scheduled or impromptu is fine by me.

To me, separation of concerns makes it fairly obvious that compilers and linkers etc shouldn't rely on an internet connection to work. There should be a separate binary that handles downloading stuff that can be run well in advance (or, not run). Kitchen-sync binaries/libraries etc tend to be not great to work with.

D115554 looks fine to me, independent of this change. I don't think it helps this change here much since another tool might add an accidental (or even intentional) dependency on DebugInfo/Symbollize again, like lld did in the past.

D115500 also looks fine to me in isolation, but it doesn't help with the overall design here.

I fairly strongy feel that there should be a library for just reading this debuginfod cache that doesn't do web requests that all tools could link to without concerns, and a separate library for writing to it and doing network requests, and that should be linked to from very few binaries.

noajshu, I'm confused why this relanded with ongoing open discussion.

noajshu, I'm confused why this relanded with ongoing open discussion.

Hi @thakis , we achieved consensus on this approach with an RFC to llvm-dev. Since you're proposing to remove debuginfod could you please follow up with your concerns on llvm-dev?
Thanks!

noajshu, I'm confused why this relanded with ongoing open discussion.

Hi @thakis , we achieved consensus on this approach with an RFC to llvm-dev. Since you're proposing to remove debuginfod could you please follow up with your concerns on llvm-dev?
Thanks!

The thread had very few replies on it, the actual technical discussion seems to happen on this review.

thakis added a comment.Feb 3 2022, 3:38 PM

@dblaikie, @phosek, @noajshu, @hans and I met (online) last month and talked about this for a bit.

The summary was the following action items:

  • have Debuginfod have a callback and put http request in llvm-symbolizer, and make llvm-symbolizer instead of Debuginfod depend on libcurl (iirc @phosek offered to look at that)
  • maybe have an llvm-symbolizer mode that doesn't do requests
  • maybe have a way to explicitly request pre-downloads of debug info

Some notes about things mentioned during the meeting:

Concerts with the approach:

  • separation of concerns is generally good; compiler shouldn't depend on network
  • libcurl now makes it into the deps of llvm.so
      • some linux distros link it dynamically, so clang now depends on it transitively
    • consider llvm-symbolizer copied via adb to a device that has funky connectivity (easy to see this causing timeouts);
  • bootstrapping
  • parallelism, no write blocking if pre-read
    • would download in parallel atm

These concerns don't apply in all scenarios:

  • .o storage might networked anyways, so less a concern there
  • might have to pull stuff for dlopen()'d shared libraries
  • would be happy if libllvm.so didn't depend on libcurl
  • care about llvm-symbolizer and lldb
  • current setup: build system image on one machine, strip binaries, upload debug info to GCS, then runs tests based on image, and then at runtime downloads debug info on demand
    • can't pull all debug info for all binaries, too much storage
  • been using this model for 3 years now, seems to work fine
    • current thing not debuginfod based
  • is opt-in based on env vars, so is that really different from having your debug info on a fuse mount
  • (someone might add cmake var)

@dblaikie, @phosek, @noajshu, @hans and I met (online) last month and talked about this for a bit.

The summary was the following action items:

  • have Debuginfod have a callback and put http request in llvm-symbolizer, and make llvm-symbolizer instead of Debuginfod depend on libcurl (iirc @phosek offered to look at that)
  • maybe have an llvm-symbolizer mode that doesn't do requests

This is being implemented by @mysterymath as D118413 and D118665.

thakis added a comment.Feb 8 2022, 5:15 PM

Great, thanks!

Glad to see others had similar concerns at https://github.com/llvm/llvm-project/issues/52731 :)