Page MenuHomePhabricator

[llvm-objdump] Handle multiple syms at same addr in disassembly.
ClosedPublic

Authored by simon_tatham on Aug 10 2022, 9:28 AM.

Details

Summary

The main disassembly loop in llvm-objdump works by iterating through
the symbols in a code section, and for each one, dumping the range of
the section from that symbol to the next. If there's another symbol
defined at the same location, then that range will have length 0, and
llvm-objdump will skip over the symbol entirely.

As a result, llvm-objdump will only show the last of the symbols
defined at that address. Not only that, but the other symbols won't
even be checked against the --disassemble-symbol list. So if you
have two symbols foo and bar defined in the same place, then one
of --disassemble-symbol=foo and --disassemble-symbol=bar will
generate an error message and no disassembly.

I think a better approach in that situation is to prioritise display
of the symbol the user actually asked for. Also, if the user
specifically asks for disassembly of both of two symbols defined
at the same address, the best response I can think of is to
disassemble the code once, preceded by both symbol names.

This involves teaching llvm-objdump to be able to display more than
one symbol name at the head of a disassembled section, which also
makes it possible to implement a --show-all-symbols option to
display every symbol defined in the code, not just the most
preferred one at each address.

This change also turns out to fix a bug in which --disassemble-all
on a mixed Arm/Thumb ELF file would fail to switch disassembly states
between Arm and Thumb functions, because the mapping symbols were
accidentally ignored.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jhenderson added inline comments.Aug 11 2022, 1:30 AM
llvm/tools/llvm-objdump/llvm-objdump.cpp
1595–1597

It's not immediately obvious to me why this line has changed. Could you explain please, as it likely means I've missed something.

1617

StringRef?

1689–1690

Is there a subtle behaviour change here if you have multiple symbols at the same address but different types (i.e. one is STT_OBJECT and one isn't, e.g. STT_FUNC)?

1695–1696

Same comment as above.

simon_tatham marked 11 inline comments as done.Aug 11 2022, 6:19 AM

I'm not the code owner of these disassembly tools

Several of the reviewers I picked for this patch were authors of the specific parts of the code that I was doing something complicated to. According to git, you were involved in setting up the --disassemble-symbols system in the first place, so I particularly value your input on whether there's any intentional feature of its semantics that I've broken without noticing :-)

(The onSymbolStart mechanism is the other one that I'm concerned about, because neither of its use cases is familiar to me. So I tried to pick reviewers who know something about that, as well.)

llvm/tools/llvm-objdump/llvm-objdump.cpp
1494–1497

Oops, good catch. Those braces were originally there to isolate some variables so that they weren't in scope for the rest of the enormous for-loop body. But by the time I finished developing the patch I'd removed all the variables local to the block, and somehow didn't notice that even in three last-minute pre-upload reviews of my own :-)

Now I look closely, the same thing happened to the other scope that decides which symbols to print. I'll fix that one too.

1501

Not trivially, because in the case where we have to demangle names, demangle() returns a newly made string and we have to have somewhere to store it.

I've changed it so that we have the vector of StringRef you wanted, and also a vector of std::string which is left empty when not in demangling mode, and the StringRefs point at the local vector or the original name elsewhere, as appropriate.

1511

I just meant that if a symbol specified by the user doesn't appear in the object file at all, then we're exempt from the need to display it. But I agree that's not 100% clear, or particularly important to highlight in this context. I'll remove the parenthesis.

1542

I agree! But I guessed that the widespread existing use of unsigned in similar cases in the LLVM code base (for example, this very loop where SI ranges up to Symbols.size()) was a local idiom that I'd be criticised for going against. I'm glad to see that the opposite is true ;-)

1595–1597

In the existing version of the loop, SI is incremented after the loop body runs, by the ++SI in the for statement itself. So throughout the loop body, SI points at the symbol we're currently disassembling, and SI+1 here indicates the next symbol, whose address marks the point where we're planning to finish this iteration of the disassembly loop.

In the new version, I've removed the ++SI in the for statement, and replaced it with code at the beginning of the loop body that advances SI past all the symbols defined at the same address. So after that code runs, the rest of the loop body sees SI already pointing at the first symbol defined at a later address.

1689–1690

Potentially, yes. Previously, llvm-objdump would pick just one of the symbols defined at the address, and base its decision on that symbol alone. With this patch, it will go through all of them, and spots any STT_OBJECT even if it's not the symbol last in the sorted list.

This is just the sort of thing I hoped to have a useful discussion about in order to decide what the behaviour should be, to avoid the risk of writing oodles of code to implement a complicated policy that we had no consensus on :-) so thanks for flagging it up.

What do you think we should do if an STT_FUNC and an STT_OBJECT occur at the same address? llvm-objdump's existing policy doesn't look particularly deliberate to me – it's an artefact of the code's previous lack of attention to collocated symbols. Perhaps it's nonetheless best to stick with the existing policy just for stability's sake, but if so, I'd prefer that we'd discussed other options before deciding that.

Other possibilities that spring to mind are to deliberately make STT_OBJECT highest priority (which is what's happening in this version of the code), or to make it lowest priority, or to choose based on some criterion like symbol index in the ELF file (go with whichever symbol was first/last in the actual object file's symtab). And maybe, whichever of those we do, emit a warning that flags up that we had to make an arbitrary decision that could have gone the other way.

What do you think?

(PS I hope you're not going to like the symtab index idea, because that information isn't preserved at all in SymbolInfoTy so it would take a load more plumbing :-)

simon_tatham marked 5 inline comments as done.

Addressed all review comments (I think) other than the question of STT_OBJECT priority versus other kinds of symbol.

jhenderson added inline comments.Aug 12 2022, 12:40 AM
llvm/tools/llvm-objdump/llvm-objdump.cpp
1501

I'd forgotten about the demangle aspect of this. As such, I'm not fussed whichever way you prefer to do it.

1505–1508

You can do this, right?

1689–1690

Looking at the comment block, my instinct says we should treat STT_FUNC as higher priority (possibly assuming it hasn't got size 0) and do regular disassembly. Having an STT_OBJECT/STT_COMMON symbol at the same address as an STT_FUNC symbol sounds like it's unlikely to ever occur in practice ("this code represents both a function and some data??"). If it does, I think it's reasonable to pick one style somewhat arbitrarily. The user can use --disassemble-symbol of the STT_OBJECT symbol if they want to disassemble it as data in this case, I think.

I'm less clear on the second block below about ARM mapping symbols, because I'm not familiar with how ARM mapping symbols are used, and therefore don't think I can make an informed decision on the right approach there.

simon_tatham added inline comments.Aug 12 2022, 3:00 AM
llvm/tools/llvm-objdump/llvm-objdump.cpp
1505–1508

I nearly did, but I wasn't confident that a StringRef to a std::string stored in a std::vector is guaranteed to stay valid if the std::vector has to resize itself. What if the std::string implementation stores short enough strings without a separate allocation, and the vector resize involves a realloc?

So instead I did something I'm sure is safe, which is to set up the entire vector of strings, commit to never modifying it again, and then start making StringRefs pointing into it.

(Another option I'd have been completely confident of would be to make a vector of unique_ptr<string>, so that even if the vector resizes, each pointed-to string stays put. That forces even more allocations, though.)

Note to self: I still need to review the tests, but will do that once the approach re. multiple symbols has been settled on and the tests updated accordingly.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1505–1508

Good point - keep the loop as-is. It turns out that the "Small String Optimization" that std::string can use under-the-hood, means that perhaps unintuitively, the string's data can be moved when the string itself is moved, rather than just the pointer to the data (see https://stackoverflow.com/questions/57723963/is-it-safe-to-store-the-pointer-to-the-data-of-a-stdstring).

simon_tatham marked 6 inline comments as done.Aug 15 2022, 3:03 AM
simon_tatham added inline comments.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1689–1690

OK, I'll change it to treat code symbols as higher priority than data.

As for mapping symbols, that's a use case I do know something about, and honestly, I think the simplest approach there is to stop checking for STT_OBJECT symbols at all, and just say that if there are mapping symbols in this section, we should use them, and not try to second-guess whether we think they're useful. So I think the best thing is to remove that loop completely, and replace it with a test of MappingSymbols.empty().

simon_tatham marked an inline comment as done.

Updated handling of STT_OBJECT symbols as discussed. Also added a comment about the confusing double loop setting up the demangled symbol names, since the next person might also wonder why it's not being done in one step.

Code changes basically look good, but I'm out of time for today to review the testing. Please make sure the testing covers all the new code paths, and that the changed behaviour we were discussing is also covered.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511

I feel like this should be simplfiable to a single line, possibly using something like SymNamesHere.insert(SymNamesHere.begin(), DemangledSymNamesHere.begin(), DemangledSymNamesHere.end()); although maybe it's unnecessarily complex. (Same goes for the loop immediately below in the else).

Added another test to check that the new data vs code priorities work.

I'd misunderstood the previous sorting criterion, it turned out. I
thought symbols at the same address were sorted by type. In fact
they're sorted by name, then by type. So out of a data and code
symbol, the alphabetically later one was previously winning!

In addition to me inline comments, I think one bit of testing you still need is showing whether the demangled or raw name is used for the priority ordering of symbol names.

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
1 ↗(On Diff #453011)

This is going to need some kind of REQUIRES directive, since it's testing disassembly.

4 ↗(On Diff #453011)

"Prior to D131589" is going to be pretty meaningless in the future. I'd just get rid of this whole sentence (and delete the "now" from the next sentence), or at least replace this bit with the more generic "Previously".

17–20 ↗(On Diff #453011)

Do you think it would be worth also showing which symbols are printed?

29–33 ↗(On Diff #453011)

I prefer to line up the values in a block so that they all start at the same column, like in the suggestion. It marginally helps with readability, I find.

llvm/test/tools/llvm-objdump/multiple-symbols.test
1 ↗(On Diff #453011)

I'd move your comment about what the test does to the start of the file here.

Also, this test will need a REQUIRES directive too, as it won't work if the build doesn't have the relevant target configured.

4 ↗(On Diff #453011)

I'd put blank lines between the closely-related groups here. For example, you could have the first two runs in one group, the next 6 in another and the remainder in a third.

I'd then label each group with a comment explaining what's special about that set of test cases.

21 ↗(On Diff #453011)

Nit: here and elsewhere, I think the canonical spelling is ARM.

29 ↗(On Diff #453011)

I'd suggest this fomatting for the groups, because it looks initially like you've just omitted a space after the comma!

33 ↗(On Diff #453011)

Rather than half a dozen different CHECK patterns, which are mostly duplicates, I'd consider using multiple check prefixes in each test case to enable/disable the relevant parts. For example, you'd have one prefix for each of the symbols, and then another prefix for each of the disassembly blocks. You could use an --implicit-check-not to the FileCheck commands to ensure other stuff isn't printed incorrectly, instead of the -NOT style too, but up to you.

Rough example:

# AMAP: 00000000 <$a.0>:
# AAAA: 00000000 <aaaa>:
# BBBB: 00000000 <bbbb>:
# CODE1: <first functions code>
# CODE1-NEXT: ...
...
# TMAP: ....
101–103 ↗(On Diff #453011)

Bearing in mind that --disassemble-symbol should already have testing elsewhere, what does this second block of code + function symbols give us that the first block alone doesn't?

136 ↗(On Diff #453011)

Delete this comment - it's obvious that it's the input file by virtue of it being YAML.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1583

The change of this logic should correspond to some sort of test case, I think, but I don't think I see anything?

I've left most of your comments un-replied-to so far, because I need to think harder about the choice of symbols to display, as mentioned in one of my inline comments below.

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
17–20 ↗(On Diff #453011)

Hmmm. Now I look more closely, there's still something a bit odd here.

The symbols that are printed are the ones beginning with B, in every case. (Exactly as the unmodified llvm-objdump would have done.) The presence of a non-data symbol at each location has caused the section contents to be disassembled as code, but the non-data symbol isn't winning the contest in every case to be the one printed. I wonder if that inconsistency might be confusing? If we think the code symbol is more important from a disassembly perspective, perhaps we should make it the one displayed, as well, for consistency?

Otherwise you end up printing a data symbol name followed by code, which looks confusing. I feel as if we ought to print a data symbol with data, or a code symbol with code, but not a confusing mixture. I'll have a rethink.

llvm/test/tools/llvm-objdump/multiple-symbols.test
21 ↗(On Diff #453011)

I do have to keep remembering that LLVM's canonical spelling isn't the same as Arm's canonical spelling. My fingers have a very strong habit of typing it the way we spell it, for obvious reasons!

llvm/tools/llvm-objdump/llvm-objdump.cpp
1583

It turned out that I had trouble thinking of something that would have changed as a result of removing this section!

The intention of the old code here is to avoid checking mapping symbols if we're starting disassembly at an STT_OBJECT symbol. But STT_OBJECT symbols are handled by the previous if statement by going to dumpELFData and then terminating this loop iteration, so it's difficult for one to get as far as here in the first place.

If there is any case that could have got here at all without being eaten by the previous test, it must be a confusing edge case of some kind and I haven't put my finger on it yet.

jhenderson added inline comments.Aug 18 2022, 12:27 AM
llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
17–20 ↗(On Diff #453011)

I agree - if we are disassembling as code, we should be printing code symbols. If there are multiple symbols to print (due to --disassemble-symbol etc), then if any of them are code, I think we should still print as code. However, if none of the code symbols are "selected" for printing, we should print as data, in my opinion.

In case there's any ambiguity, I do think we should pick the code symbols above the data symbols, both in choosing which symbol to pick and therefore how to disassemble a block of bytes.

It is probably worth taking a look at what GNU objdump does, and see if you can identify any behaviour that makes sense and we can conform to. Disassembly is one area where we diverge somewhat, but I think it might still be a useful reference point.

llvm/test/tools/llvm-objdump/multiple-symbols.test
21 ↗(On Diff #453011)

I mean, I'm going off what wikipedia (and several other websites) tells me is the spelling :-) Strange that Arm's official spelling according to the company is different! I'd be happy to go back to what you had before then!

simon_tatham marked 10 inline comments as done.Aug 19 2022, 2:35 AM

I think one bit of testing you still need is showing whether the demangled or raw name is used for the priority ordering of symbol names.

I'm not sure what you mean by that. Priority order of symbol names? The sorting order in Symbols is set up before anything gets demangled, so it will be based on the raw name, but that isn't changed by this patch.

llvm/test/tools/llvm-objdump/multiple-symbols.test
21 ↗(On Diff #453011)

There was a change of preference at some point in the past, and it's entirely possible that not everyone has caught up. But in recent years Arm's preferred spelling of its own name is "Arm".

(Obviously identifiers in source code have to match the existing spelling and all be consistent, but comments can be up to date!)

33 ↗(On Diff #453011)

I had a try at this, but I'm afraid I couldn't see how to make it test the things I want tested.

The problem is that the -NEXT suffix doesn't apply between different FileCheck prefixes. If I write, for example,

COMMON: some header
FOO-NEXT: line involving foo
BAR-NEXT: line involving bar

then I'd like --check-prefixes=COMMON,BAR to enforce that the bar line shows up immediately after the header line, and there isn't an intervening line of any kind. But in fact the BAR-NEXT check provokes an error message from FileCheck that there should have been a previous BAR check for it to be next to. It apparently means "must be on the next line from the previous check with the same prefix", not "... with any currently enabled prefix".

So I think if I converted this test into your suggested style, I'd lose the ability to have NEXT checks at all, so I'd have to have a pair of prefixes for each piece of output, denoting "this line is / is not expected to appear in the file" ...

# AMAP:     00000000 <$a.0>:
# AMAP-NOT: 00000000 <$a.0>:
# AAAA:     00000000 <aaaa>:
# AAAA-NOT: 00000000 <aaaa>:

and then each RUN line would have to have an absolutely enormous collection of check-prefixes specifying every single line it both did and didn't want.

I can give that a try if you really want me to, but are you sure it's clearer? The effect from my point of view is that all the details of what makes one test run different from another are now way off to the right and smushed into a long undistinguished list of keywords, and you have to cross-refer to the checks anyway to make sense of them, instead of laid out in a table of "here is what this test expects to see".

101–103 ↗(On Diff #453011)

It took me a day to remember what I'd been thinking here myself, so I agree it's unclear! The intention of having both an Arm and a Thumb function was to ensure that each one is disassembled in the right one of those states, because the mapping symbols that indicate the changeover are still reliably recognised, regardless of which subset of symbols is being displayed.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1583

Aha! There is an edge case affected by this change. If you set --disassemble-all to force disassembly of data sections, then the previous code would have had the side effect of ignoring mapping symbols in code sections, so you'd get Thumb code mistakenly disassembled as Arm.

The new criterion of "use mapping symbols if they're there" stops that failure from happening. I'll add a regression test for it.

simon_tatham marked 2 inline comments as done.
simon_tatham edited the summary of this revision. (Show Details)

Moved the check for data symbols to before we choose symbols to display, so that the same check can control which symbol is printed and how the data after it is disassembled.

Added a test for the changed behaviour of --disassemble-all, and tweaked comments and layout in existing tests for review comments.

I think one bit of testing you still need is showing whether the demangled or raw name is used for the priority ordering of symbol names.

I'm not sure what you mean by that. Priority order of symbol names? The sorting order in Symbols is set up before anything gets demangled, so it will be based on the raw name, but that isn't changed by this patch.

Your new test is about multiple symbols at the same location, and you specifically call out the alphatical sorting in a comment. That then immediately raises the question about whether demangled or mangled names are used. It's not the end of the world, since you rightly point out that this aspect hasn't changed, but I think it would still be useful to check (assuming of course it isn't already tested, anyway).

llvm/test/tools/llvm-objdump/ELF/ARM/disassemble-all.s
1 ↗(On Diff #453935)

Perhaps worth adding "mapping-symbols" to the test name, e.g. disassemble-all-mapping-symbols.s, since it's specifically the interaction of the two that's interesting.

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
18 ↗(On Diff #453935)

"is displayed before" sounds incomplete. Before what? Do you mean "is displayed first"? Also, missing full stop at end of sentence.

25–32 ↗(On Diff #453935)

Nit: the whitespace for indentation of these lines is inconsistent with the first block above. Please fix.

llvm/test/tools/llvm-objdump/multiple-symbols.test
33 ↗(On Diff #453011)

The problem is that the -NEXT suffix doesn't apply between different FileCheck prefixes.

I'm 95% certain that this is incorrect, as I just tested it out locally with the following test passing fine for me:

# RUN: echo foo > %t.txt
# RUN: echo baz >> %t.txt
# RUN: FileCheck %s --input-file=%t.txt --check-prefixes=FOO,BAZ

# FOO: foo
# BAR-NEXT: bar
# BAZ-NEXT: baz

Did you perhaps accidentally omit the COMMON from one of your FileCheck prefix sets? The only rule for -NEXT/-EMPTY commands is that there has to be one regular check (across all prefix sets) before the first -NEXT/-EMPTY.

101–103 ↗(On Diff #453011)

Perhaps worth additional comments then to explain this.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1518
simon_tatham marked 10 inline comments as done.Aug 19 2022, 7:07 AM
simon_tatham added inline comments.
llvm/test/tools/llvm-objdump/multiple-symbols.test
33 ↗(On Diff #453011)

You're right, it does work the way you say. I had indeed missed having an initial regular check, because I started off with a -NOT check, which doesn't count. But I was misled by FileCheck's error message: if I adjust your demo so that its first check is FOO-NOT, then I see

z.test:3:3: error: found 'BAZ-NEXT' without previous 'BAZ: line

which is what led me to think it worked the way I said!

llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511

I'm afraid I don't know enough about that kind of STL iterator idiom to see how you'd do it in the else loop, where you not only have to iterate over SymbolsHere but also extract the Name field of each one. You'd need some kind of lambda, or templated field extraction gadget, or something, surely?

simon_tatham marked an inline comment as done.

I think I've now addressed all your review comments, including adding a demonstration of alphabetical order vs demangling.

simon_tatham marked an inline comment as done.Aug 19 2022, 7:26 AM

(Ah, that's where the one last unticked Done box was hiding.)

jhenderson added inline comments.Aug 22 2022, 12:16 AM
llvm/test/tools/llvm-objdump/multiple-symbols-mangling.test
7 ↗(On Diff #453985)

I'm wondering if this is a case where a generated-from-assembly-using-llvm-mc input might be more appropriate. Given we already need the Arm target for the disassembly, and we don't need to control any fine details of the object really, I don't think you lose any coverage, and the input file would be simpler.

It might be worth looking to do the same at some of the other tests, though I haven't tried to figure out whether the assembly equivalent would be simpler.

14 ↗(On Diff #453985)

FWIW, I find _ characters in check prefixes weird. I'm also slightly hesitant, because it is easy enough to mistype an _ as - (but less likely the other way around). I'm not sure you lose much by switching to -.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511

You're looking for std::transform I believe in that case, though it's debatable whether it's easier to read, so what you've got is fine.

simon_tatham marked 3 inline comments as done.

Adjusted check prefixes, and translated all yaml2obj inputs into llvm-mc inputs (which in all cases fails with the old llvm-objdump, i.e. still produces an object file that successfully tests the changed behaviour).

llvm/test/tools/llvm-objdump/multiple-symbols-mangling.test
14 ↗(On Diff #453985)

I wasn't sure whether the use of - in the middle of the check prefix might conflict with the use of - to separate the semantic -NOT: and so forth at the end. But apparently that works fine.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511

Yes, I see, with a lambda to extract the field of each object. I agree it's nicer to leave it as it is :-)

jhenderson accepted this revision.Aug 22 2022, 3:43 AM

LGTM, but before pushing, probably worth giving others (@rafauler, @MaskRay in particular) a day or two to have another look.

This revision is now accepted and ready to land.Aug 22 2022, 3:43 AM
This revision was landed with ongoing or failed builds.Aug 24 2022, 7:08 AM
This revision was automatically updated to reflect the committed changes.