This is an archive of the discontinued LLVM Phabricator instance.

[Symbolize][DebugInfoD] Try looking up failed paths as Build IDs.
AbandonedPublic

Authored by mysterymath on Jan 25 2022, 12:18 PM.

Details

Summary

If a path given to the symbolizer cannot be loaded as a binary file, and
the path string is a 160 character hex, attempt to look up the binary
via DebugInfoD. This will first try cache directories, then an HTTP
lookup if built with libcurl.

This allows Build IDs to be provided to llvm-symbolizer anywhere a path
could be provided: in stdin, via flag, via args, or via environment
variables.

Diff Detail

Event Timeline

mysterymath created this revision.Jan 25 2022, 12:18 PM
mysterymath requested review of this revision.Jan 25 2022, 12:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 25 2022, 12:18 PM
phosek added inline comments.Jan 26 2022, 12:06 AM
llvm/docs/CommandGuide/llvm-symbolizer.rst
31

Nit: spelling.

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
522

Build ID doesn't have any prescribed length, and linkers typically support a number of hashing algorithms (see https://github.com/llvm/llvm-project/blob/3704abaa166bec865f56f48337a1732eab77dc68/lld/ELF/Driver.cpp#L764) so this isn't going to work.

mysterymath marked an inline comment as done.
  • Bail on small hex strings, not exact size.
  • Reworded comment.
  • Fixed typo.
mysterymath added inline comments.Jan 26 2022, 12:23 PM
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
522

From the link, it looks like the smallest presently supported would be MD5 and UUID, both at 128 bits.
Given that Build IDs are intended to be globally unique (right?), that seems like a reasonable minimum for this check.

It should be fairly unlikely to have a filename of >= 16 characters that is a valid hex identifier unless it's some kind of ID; we'd be into
aBadBeefCafeBabe territory.

mysterymath added inline comments.Jan 26 2022, 12:31 PM
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
522

Ah, but the --build_id flag can be used to stamp binaries with an arbitrary hex string as part of the build process.

So, the possibilities that immediately occur to me:

  1. Keep this as a heuristic check that captures "the usual" build IDs, but then you simply couldn't use this feature in llvm-symbolizer with your own custom build ID schema.
  2. Remove this check entirely, and do lookups on anything that could be a build ID. This could issue spurious lookups on paths like "bad", but only in the case where "bad" isn't a valid binary.
  3. Provide an in-band mechanism in the syntax of llvm-symbolizer for specifying whether the "path" component of a symoblization request is intended to be a path or a build ID. This would need some further design.
mysterymath marked an inline comment as not done.Jan 26 2022, 12:31 PM
mcgrathr added inline comments.Jan 26 2022, 5:16 PM
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
522

Build IDs can be of any size, and you've already overlooked one of the common hash algorithms (the one lld uses by default) in attempting to enumerate "expected" sizes, which should be a pretty strong sign that an enumeration of expected sizes might not be the best idea. It's for reasons like this that I don't think it's the best CLI here to conflate file names with build ID hex strings. It doesn't seem at all taxing to require a command-line switch to indicate a build ID just as the --obj / --exe / -e switch indicates a file name. For the stdin syntax, I think a new explicit syntax makes sense as well. It already accepts the optional "CODE", "DATA", or "FRAME" prefix before "filename address". So I think it should be fine enough to accept an optional "BUILDID" prefix between the optional "command" prefix and the filename (which given the prefix is interpreted as a build ID). Or something similar.

AFAICT the only environment variable case is using LLVM_SYMBOLIZER_OPTS to pass command-line switches. If you can use -e there today, then using the new build ID switch should be just as easy and require no additional work to support.

phosek added inline comments.Jan 26 2022, 6:21 PM
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
522

I agree with @mcgrathr, having an explicit command-line option and syntax is less error-prone and should be fairly straightforward to implement.

mysterymath abandoned this revision.Jan 27 2022, 11:11 AM
mysterymath added inline comments.
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
522

SGTM; I'll abandon this PR then, as the proposed mechanisms wouldn't look anything like this.

Adding another prefix to the syntax may take a bit of refactoring (I'll just have to try it.), but I already have a patch more-or-less ready for a command line switch. I'll send PRs that reference this one.