This is an archive of the discontinued LLVM Phabricator instance.

[symbolizer] Support symbol lookup
ClosedPublic

Authored by sepavloff on May 3 2023, 9:13 AM.

Details

Summary

Recent versions of GNU binutils starting from 2.39 support symbol+offset
lookup in addition to the usual numeric address lookup. This change adds
symbol lookup to llvm-symbolize and llvm-addr2line.

Now llvm-symbolize behaves closer to GNU addr2line, - if the value specified
as address in command line or input stream is not a number, it is treated as
a symbol name. For example:

llvm-symbolize --obj=abc.so func_22
llvm-symbolize --obj=abc.so "CODE func_22"

This lookup is now supported only for functions. Specification with
offset is not supported yet.

Diff Detail

Event Timeline

sepavloff created this revision.May 3 2023, 9:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 9:13 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
sepavloff requested review of this revision.May 3 2023, 9:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 9:13 AM

I believe it would be much better not to add the new mode (SYMBOL), but to support the new way of specifying an address for CODE.

Please, don't forget to reflected the changes in llvm/docs/CommandGuide/llvm-symbolizer.rst and probably llvm/docs/CommandGuide/llvm-addr2line.rst.

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
252–254

I don't think we need a new point of divergence here. It's unlikely that anyone would rely on the tool to generate an error for such an input.

Completely agree with what @ikudrin said. We don't need to maintain backwards compatibility between versions of our tools if the old behaviour didn't make much sense (i.e. treating non-numbers as symbols is a good thing, and the old behaviour of just echoing the input wasn't particularly useful). I also don't think we need a new SYMBOL directive (assuming GNU addr2line doesn't support it anyway), unless there's no practical way to make symbols work in the CODE directive (but I don't see why there wouldn't be).

sepavloff updated this revision to Diff 537212.Jul 4 2023, 9:37 PM

Address reviewers' notes

  • Removed command SYMBOL,
  • Support symbol lookup in llvm-symbolizer as well.
jhenderson added inline comments.Jul 5 2023, 1:45 AM
llvm/include/llvm/DebugInfo/Symbolize/DIPrinter.h
66

I'm thinking this might be better named IsGNUStyle, since llvm-symbolizer has an --output-style option, and this variable controls how the output is formatted.

llvm/lib/DebugInfo/Symbolize/SymbolizableObjectFile.cpp
357

Nit: This should probably have braces, as it is a non-trivial statement (even if it is only a single statement technically). See https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements for more details.

llvm/test/tools/llvm-symbolizer/Inputs/addr2.inp
1 ↗(On Diff #537212)

Please don't add a separate input file that is only used by one test.

The llvm-symbolizer tests could do with a bit of cleaning-up. Using a separate input file for text input is one such example.

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
487

This should be controlled by the output-style command-line option (see my comment elsewhere).

jhenderson added inline comments.Jul 5 2023, 1:45 AM
llvm/test/tools/llvm-symbolizer/output-style-empty-line.test
12 ↗(On Diff #537212)

I don't think these check-suffixes are particularly understandable any more, given the changed behaviour. I'd suggest instead renaming them all and moving the check patterns to be immediately after the group that uses them. Something like:

RUN: llvm-symbolizer -e %p/Inputs/addr.exe < %p/Inputs/addr.inp \
RUN:   | FileCheck %s --check-prefix=SYMB-LLVM

RUN: llvm-symbolizer --output-style=LLVM -e %p/Inputs/addr.exe < %p/Inputs/addr.inp \
RUN:   | FileCheck %s --check-prefix=SYMB-LLVM

SYMB-LLVM: x.c:14:0
SYMB-LLVM-EMPTY:
SYMB-LLVM-NEXT: some text2

RUN: llvm-symbolizer --output-style=GNU -e %p/Inputs/addr.exe < %p/Inputs/addr.inp \
RUN:   | FileCheck %s --check-prefix=SYMB-GNU

SYMB-GNU: x.c:14
SYMB-GNU-NEXT: some text2

RUN: llvm-addr2line -i -e %p/Inputs/addr.exe < %p/Inputs/addr.inp \
RUN:   | FileCheck %s --check-prefix=ADDR-GNU

RUN: llvm-addr2line --output-style=GNU -i -e %p/Inputs/addr.exe < %p/Inputs/addr.inp \
RUN:   | FileCheck %s --check-prefix=ADDR-GNU

ADDR-GNU: x.c:14
ADDR-GNU-NEXT: ??:0

RUN: llvm-addr2line --output-style=LLVM -i -e %p/Inputs/addr.exe < %p/Inputs/addr.inp \
RUN:   | FileCheck %s --check-prefix=ADDR-LLVM

ADDR-LLVM: x.c:14:0
ADDR-LLVM-EMPTY:
ADDR-LLVM-NEXT: ??:0
llvm/test/tools/llvm-symbolizer/output-style-json-code.test
49

I think you could make this a little more self descriptive, by changing your invalid input to something like "0 not a symbol name or number". That being said, the intent of this test, I think, was to use the exact same set of inputs as one that doesn't use JSON style.

llvm/test/tools/llvm-symbolizer/symbol-search.test
2

You're using a weird mix of comment markers in this file. The standard rules in newer tools tests are

  • ## for actual comments;
  • # for RUN and CHECK lines;
  • all comment markers should have a space between them and the rest of the line (e.g. # CHECK: or ## This is a comment).
  • optionally, if the test consists solely of things prefixed with a comment marker, you can drop one # from each of the above, but it's more common to have them than not.

In addition, this test should have a comment explaining what the test is testing, probably before this comment.

5

I can't tell what "PRFUNC" is supposed to stand for. It might be useful to add a comment before each test case explaining what that specific case is testing.

10

Why is this test case epeated twice?

18

It would be better for both "nonexistent" cases to have a suffix, rather than just the llvm-symbolizer one.

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
226
239

This whole area of code could do with some blank lines to aid readability. I would suggest that you have them between each group of related lines, where "related" means a comment, one if statement, and any lines strongly related to that if statement (e.g. variable declarations). For example, I'd add a line before the comment about 0x prefixes, and another one after the if's closing brace.

252–254

+1: I don't think we need to handle invalid input differently between llvm-symbolizer and llvm-addr2line specifically, as long as the updated behaviour makes sense.

ikudrin added inline comments.Jul 6 2023, 5:55 PM
llvm/lib/DebugInfo/Symbolize/DIPrinter.cpp
266–269

The conditions with fewer negations are much easier to understand.

sepavloff updated this revision to Diff 538109.Jul 7 2023, 5:59 AM

Address reviewers' notes

sepavloff marked 9 inline comments as done.Jul 7 2023, 6:10 AM
sepavloff added inline comments.
llvm/test/tools/llvm-symbolizer/output-style-empty-line.test
12 ↗(On Diff #537212)

Yes, it looks better. Thank you!

llvm/test/tools/llvm-symbolizer/symbol-search.test
2

Thank you for eplanations. I modernized also sym.test, it didn't follow these rules.

10

It should be llvm-addr2line and llvm-symbolizer variants. Fixed.

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
487

Initialization of Config.IsGNUStyle is now below in this function.

sepavloff updated this revision to Diff 538185.Jul 7 2023, 10:19 AM
sepavloff marked an inline comment as done.

Make response of llvm-symbolizer on invalid input with spaces identical to llvm-addr2line

sepavloff marked 2 inline comments as done.Jul 7 2023, 10:20 AM
sepavloff edited the summary of this revision. (Show Details)Jul 18 2023, 11:52 AM

Sorry for the delay - I was away last week, and am still catching up on reviews.

llvm/test/tools/llvm-symbolizer/flag-grouping.test
6 ↗(On Diff #538185)

Let's match the full line in all of these cases where the text has changed.

llvm/test/tools/llvm-symbolizer/symbol-search.test
2

Can I ask you to spin off sym.test's updates into a new review? I've got no objection to the update (sym.test itself is a little bit archaic), but it's not really related to this patch.

Also "when address" -> "when an address"

5

The individual sub cases within this test file could still do with some comments to make their purpose clearer. I think I follow it better now with the recent improvements, but more clarity would help.

10

I forgot to mention that I prefer - to _ in prefix names. It avoids weird awkardness like CODE_CMD-NEXT for example, and - is easier to type on an English keyboard :D

20

ADDR is probably not a good abbreviation for addr2line since it could be confused with address which is a a very relevant term to these tests. Perhaps A2L? Apologies for the churn.

sepavloff updated this revision to Diff 542971.Jul 21 2023, 9:40 AM

Address reviewer's notes

jhenderson added inline comments.Jul 25 2023, 12:42 AM
llvm/test/tools/llvm-symbolizer/sym.test
20 ↗(On Diff #542971)

I generally would avoid double blank lines. My personal rule is similar to what I follow in C++ code:

  1. No blank line if the comment is tightly linked to the immediately following block.
  2. One blank line if the comment is more general (e.g. it applies to a wider area of code/isn't really targeted at a specific block etc).
llvm/test/tools/llvm-symbolizer/symbol-search.test
8

Prefer a slightly longer form that says something like "Show that the "CODE" command supports search by symbol name." Same goes for below comments.

9

See my above comment: as this comment is tightly tied to the test case that immediately follows, it doesn't need a blank line (and regardless, blank lines generally don't have a comment marker, unless they are separating paragraphs within a longer comment).

14–15

I don't think you should reference GNU here - that's a motivation for the behaviour, but not for the test case for that behaviour. Rather, what this comment could say is "Show that llvm-addr2line and llvm-symbolizer accept symbol names on the command-line."

21
32

I think I might be getting confused between different patches, but looking at this again, my feeling is that we don't really want this divergence in behaviour between GNU and LLVM mode, unless there's a strong motivation for the LLVM behaviour style. I might be inclined to create a different precursor patch that unifies the two, so that they do what GNU addr2line does.

ikudrin added inline comments.Jul 31 2023, 10:05 PM
llvm/docs/ReleaseNotes.rst
166

Please remove the spaces in the blank line.

sepavloff updated this revision to Diff 555456.Sep 1 2023, 12:24 PM

Rebase the patch

@sepavloff, what's the situation with this patch? Are you planning on addressing review comments or similar, or waiting for an upstream patch to land etc etc?

@sepavloff, what's the situation with this patch? Are you planning on addressing review comments or similar, or waiting for an upstream patch to land etc etc?

All necessary prerequisites are landed, the patch is updated and is ready for review.

jhenderson added inline comments.Sep 5 2023, 1:03 AM
llvm/docs/ReleaseNotes.rst
166

Not addresssed?

llvm/include/llvm/DebugInfo/Symbolize/Symbolize.h
110

This is a bit of a nit-pick, but it is throwing me off a bit having both const std::string & and StringRef arguments for this overload of findSymbol. I see why you've done it (similarity with other overloaded functions above and below), but I feel like it should probably still use the correct and consistent style (StringRef) for both.

llvm/lib/DebugInfo/Symbolize/DIPrinter.cpp
384

It's possible I've missed it, but I don't see any test that shows this array has appropriate contents (I see a few where it is empty, but there needs to be one or more test cases that cover it having contents, at least one of which should cover it having more than one entry).

llvm/lib/DebugInfo/Symbolize/SymbolizableObjectFile.cpp
357

I'm a little concerned about the performance of this code. It might be a premature concern, but looping through a list of symbols to find one, for every input symbol, sounds like it could get expensive quite rapidly - if you were naively to use llvm-symbolizer to symbolize an entire object by using its listed symbol names, you'd end up with an n^2 operation, as every search has to go through the entire list.

The performance of the symbolizer code is considered important, hence why I'm bringing this up. I wonder if it's worth considering changing the std::vector used for Symbols into another container with more efficient searching performance?

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
238–239

Is there testing covering this failure for this specific case?

246–247

Ditto.

254–255

This needs a test case to show that if Opts.Demangle is false, the name isn't demangled.

llvm/test/tools/llvm-symbolizer/symbol-search.test
9

Not addressed. For a concrete example, I personally recommend the following format:

# Comment
RUN: ...
RUN: ...

CHECK: ...
CHEC-NEXT:

# Next test case comment
RUN: ...
14–15

Not addressed.

25
37
llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
239

StartsWithHexPrefix appears to be unused?

sepavloff updated this revision to Diff 556112.Sep 7 2023, 12:11 AM

Update patch

There look like there are a few of my previous comments that haven't been addressed yet?

llvm/test/tools/llvm-symbolizer/output-style-json-code.test
65
llvm/test/tools/llvm-symbolizer/symbol-search.test
13

You don't specify things "in" a command line, you specify them "on the command-line."

18

I'm not sure this comment needed changing. The previous version was good: "If symbol has a space in its name, ignore everything after it." though I might change "after it" to "from the space onwards."

You could add "Check that" to the start too, so the final thing might look like:
"Check that if a symbol has a space in its name, ignore everything from the space onwards."

27

"Show that if a symbol ..."

sepavloff marked 5 inline comments as done.Sep 7 2023, 3:06 AM
sepavloff added inline comments.
llvm/lib/DebugInfo/Symbolize/DIPrinter.cpp
384

At the end of output-style-json-code.test a test is added that checks the case of more than one entry.

llvm/lib/DebugInfo/Symbolize/SymbolizableObjectFile.cpp
357

This is a very important aspect of this solution. It however requires substantial code changes, so it is not practical to implement the performance enhancements in this patch. There are few users of this feature right now and GNU addr2line also uses linear search, so it is unlikely that performance problems would be noticed immediately.

Some possible applications need llvm-symbolizer for large files and many symbols, performance is critical for such cases, so a solution will be elaborated. It requires more efforts than just changing the implementation of Symbol, as it is already sorted by address. There are other points where the speed or memory consumption can be improved, they could be treated together.

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
238–239

It is unlikely that this code fails, because getOrCreateModuleInfo is called early to check existence and validity of binary file.

246–247

Actually this code was copied from symbolize*Common, it will fail exactly in the same cases when fail those functions.

254–255

Such tests are added at the end of symbol-search.test.

sepavloff updated this revision to Diff 556121.Sep 7 2023, 3:18 AM

Fix comments

sepavloff updated this revision to Diff 556193.Sep 7 2023, 12:00 PM

Add missing tests

sepavloff added inline comments.Sep 7 2023, 12:16 PM
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
238–239

I was wrong. If binary file name was not specified via --obj option, it will be extracted from command line arguments. In this case getOrCreateModuleInfo is called just in this code and may fail. The relevant test is added to symbol-search.test.

246–247

It seems this case (no error from getOrCreateModuleInfo but zero pointer to SymbolizableModule) cannot be realized.

Sorry for the delay - Phabricator issues and being busy meant I only just got back to this.

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
246–247

Okay, thanks for looking into it.

llvm/test/tools/llvm-symbolizer/symbol-search.test
13

I think canoncially it should be "command-line" not "command line" (this is based on what someone from our docs team once told me).

56

Strictly speaking, using the --obj option to provide the binary file is also specifying it on the command-line. I think this comment should be a little more specific (assuming it matters). I'd also use "input file" rather than "binary file" to clarify the file's purpose. Something like: "Show that both the symbol and input file can be specified in the search string on the command-line." or something like that.

Similar comments apply below.

62

Nit: here and below, add a space after >2 to make it stand out more.

sepavloff updated this revision to Diff 557294.Sep 25 2023, 2:03 AM

Update the patch

  • Fix comments in symbol-search.test,
  • Merge D149757,
  • Rebase.
jhenderson accepted this revision.Sep 26 2023, 12:00 AM

LGTM, thanks, but please give @ikudrin/@MaskRay/... a few days to make any other comments.

This revision is now accepted and ready to land.Sep 26 2023, 12:00 AM
This revision was landed with ongoing or failed builds.Oct 2 2023, 7:39 AM
This revision was automatically updated to reflect the committed changes.

The patch was reverted because the test llvm/test/Support/interrupts.test started failing. The fail is observed only on Windows. It looks like this is the test problem, it works incorrectly, if stderr is not empty. The fix is provided in https://github.com/llvm/llvm-project/pull/68556.