This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
test/tools/llvm-objdump/
-
tools/
-
llvm-objdump/
-
ELF/
-
ARM/
-
disassemble-all-mapping-symbols.s
-
data-vs-code-priority.s
-
multiple-symbols-mangling.s
-
multiple-symbols.s
-
tools/llvm-objdump/
-
llvm-objdump/
-
ObjdumpOpts.td
33/33
llvm-objdump.cpp

Differential D131589

[llvm-objdump] Handle multiple syms at same addr in disassembly.
ClosedPublic

Authored by simon_tatham on Aug 10 2022, 9:28 AM.

Download Raw Diff

Details

Reviewers

scott.linder
jhenderson
aardappel
rochauha
sbc100
dschuff
rafauler
MaskRay

Commits

rG8e29f3f1c35a: [llvm-objdump] Handle multiple syms at same addr in disassembly.

Summary

The main disassembly loop in llvm-objdump works by iterating through
the symbols in a code section, and for each one, dumping the range of
the section from that symbol to the next. If there's another symbol
defined at the same location, then that range will have length 0, and
llvm-objdump will skip over the symbol entirely.

As a result, llvm-objdump will only show the last of the symbols
defined at that address. Not only that, but the other symbols won't
even be checked against the --disassemble-symbol list. So if you
have two symbols foo and bar defined in the same place, then one
of --disassemble-symbol=foo and --disassemble-symbol=bar will
generate an error message and no disassembly.

I think a better approach in that situation is to prioritise display
of the symbol the user actually asked for. Also, if the user
specifically asks for disassembly of both of two symbols defined
at the same address, the best response I can think of is to
disassemble the code once, preceded by both symbol names.

This involves teaching llvm-objdump to be able to display more than
one symbol name at the head of a disassembled section, which also
makes it possible to implement a --show-all-symbols option to
display every symbol defined in the code, not just the most
preferred one at each address.

This change also turns out to fix a bug in which --disassemble-all
on a mixed Arm/Thumb ELF file would fail to switch disassembly states
between Arm and Thumb functions, because the mapping symbols were
accidentally ignored.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

simon_tatham created this revision.Aug 10 2022, 9:28 AM

Herald added a reviewer: MaskRay. · View Herald TranscriptAug 10 2022, 9:28 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a subscriber: rupprecht. · View Herald Transcript

simon_tatham requested review of this revision.Aug 10 2022, 9:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 9:28 AM

Herald added subscribers: llvm-commits, StephenFan, aheejin. · View Herald Transcript

Harbormaster completed remote builds in B180436: Diff 451520.Aug 10 2022, 11:45 AM

Thanks Simon, I think this patch makes a nice improvement to the behavior of objdump. I'm not the code owner of these disassembly tools, but the patch looks good to me. I would rather wait for the input of whoever is more familiar with objdump, though.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1494–1497	Curious about the style here where we define a scope to isolate a piece of computation. I don't think I've seen this in other LLVM files or in the coding style, so I'm curious what other people think about it. I do think it makes the code more clear to understand here, because it's only 3 lines. But on the other hand, it increases nesting on line 1509, which can actually make code harder to read (arguably).

When adding new tool options, please make sure to update the CommandGuide at llvm/docs/CommandGuide for the tool.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1494–1497	I'm not sure this style is used in LLVM, or at least not in the areas I'm familiar with, so I'd drop it.
1496	Nit: I think we prefer preincrement to post.
1501	Can this be a vector of `StringRef` to save the copying?
1511	"(that we can find at all)" - I'm not sure what this is saying. Do you even need it?
1533
1534
1537–1540	Not sure you need the parentheses, or the "//" emphasis marks. Also, "Arm" should be "ARM".
1542	`size_t` would be the more natural type for loops over `SymbolsHere`, since that's the value returned by `size()` and used by the index operator. Ditto below.
1595–1597	It's not immediately obvious to me why this line has changed. Could you explain please, as it likely means I've missed something.
1617	`StringRef`?
1689–1690	Is there a subtle behaviour change here if you have multiple symbols at the same address but different types (i.e. one is STT_OBJECT and one isn't, e.g. STT_FUNC)?

jhenderson added inline comments.Aug 11 2022, 1:30 AM

llvm/tools/llvm-objdump/llvm-objdump.cpp
1695–1696	Same comment as above.

In D131589#3714216, @rafauler wrote:

I'm not the code owner of these disassembly tools

Several of the reviewers I picked for this patch were authors of the specific parts of the code that I was doing something complicated to. According to git, you were involved in setting up the --disassemble-symbols system in the first place, so I particularly value your input on whether there's any intentional feature of its semantics that I've broken without noticing :-)

(The onSymbolStart mechanism is the other one that I'm concerned about, because neither of its use cases is familiar to me. So I tried to pick reviewers who know something about that, as well.)

llvm/tools/llvm-objdump/llvm-objdump.cpp
1494–1497	Oops, good catch. Those braces were originally there to isolate some variables so that they weren't in scope for the rest of the enormous for-loop body. But by the time I finished developing the patch I'd removed all the variables local to the block, and somehow didn't notice that even in three last-minute pre-upload reviews of my own :-) Now I look closely, the same thing happened to the other scope that decides which symbols to print. I'll fix that one too.
1501	Not trivially, because in the case where we have to demangle names, `demangle()` returns a newly made string and we have to have somewhere to store it. I've changed it so that we have the vector of `StringRef` you wanted, and also a vector of `std::string` which is left empty when not in demangling mode, and the StringRefs point at the local vector or the original name elsewhere, as appropriate.
1511	I just meant that if a symbol specified by the user doesn't appear in the object file at all, then we're exempt from the need to display it. But I agree that's not 100% clear, or particularly important to highlight in this context. I'll remove the parenthesis.
1542	I agree! But I guessed that the widespread existing use of `unsigned` in similar cases in the LLVM code base (for example, this very loop where `SI` ranges up to `Symbols.size()`) was a local idiom that I'd be criticised for going against. I'm glad to see that the opposite is true ;-)
1595–1597	In the existing version of the loop, `SI` is incremented after the loop body runs, by the `++SI` in the `for` statement itself. So throughout the loop body, `SI` points at the symbol we're currently disassembling, and `SI+1` here indicates the next symbol, whose address marks the point where we're planning to finish this iteration of the disassembly loop. In the new version, I've removed the `++SI` in the `for` statement, and replaced it with code at the beginning of the loop body that advances `SI` past all the symbols defined at the same address. So after that code runs, the rest of the loop body sees `SI` already pointing at the first symbol defined at a later address.
1689–1690	Potentially, yes. Previously, `llvm-objdump` would pick just one of the symbols defined at the address, and base its decision on that symbol alone. With this patch, it will go through all of them, and spots any STT_OBJECT even if it's not the symbol last in the sorted list. This is just the sort of thing I hoped to have a useful discussion about in order to decide what the behaviour should be, to avoid the risk of writing oodles of code to implement a complicated policy that we had no consensus on :-) so thanks for flagging it up. What do you think we should do if an STT_FUNC and an STT_OBJECT occur at the same address? llvm-objdump's existing policy doesn't look particularly deliberate to me – it's an artefact of the code's previous lack of attention to collocated symbols. Perhaps it's nonetheless best to stick with the existing policy just for stability's sake, but if so, I'd prefer that we'd discussed other options before deciding that. Other possibilities that spring to mind are to deliberately make STT_OBJECT highest priority (which is what's happening in this version of the code), or to make it lowest priority, or to choose based on some criterion like symbol index in the ELF file (go with whichever symbol was first/last in the actual object file's symtab). And maybe, whichever of those we do, emit a warning that flags up that we had to make an arbitrary decision that could have gone the other way. What do you think? (PS I hope you're not going to like the symtab index idea, because that information isn't preserved at all in `SymbolInfoTy` so it would take a load more plumbing :-)

Addressed all review comments (I think) other than the question of STT_OBJECT priority versus other kinds of symbol.

Harbormaster completed remote builds in B180658: Diff 451830.Aug 11 2022, 8:07 AM

jhenderson added inline comments.Aug 12 2022, 12:40 AM

llvm/tools/llvm-objdump/llvm-objdump.cpp
1501	I'd forgotten about the `demangle` aspect of this. As such, I'm not fussed whichever way you prefer to do it.
1505–1508	You can do this, right?
1689–1690	Looking at the comment block, my instinct says we should treat STT_FUNC as higher priority (possibly assuming it hasn't got size 0) and do regular disassembly. Having an STT_OBJECT/STT_COMMON symbol at the same address as an STT_FUNC symbol sounds like it's unlikely to ever occur in practice ("this code represents both a function and some data??"). If it does, I think it's reasonable to pick one style somewhat arbitrarily. The user can use `--disassemble-symbol` of the STT_OBJECT symbol if they want to disassemble it as data in this case, I think. I'm less clear on the second block below about ARM mapping symbols, because I'm not familiar with how ARM mapping symbols are used, and therefore don't think I can make an informed decision on the right approach there.

simon_tatham added inline comments.Aug 12 2022, 3:00 AM

llvm/tools/llvm-objdump/llvm-objdump.cpp
1505–1508	I nearly did, but I wasn't confident that a `StringRef` to a `std::string` stored in a `std::vector` is guaranteed to stay valid if the `std::vector` has to resize itself. What if the `std::string` implementation stores short enough strings without a separate allocation, and the vector resize involves a realloc? So instead I did something I'm sure is safe, which is to set up the entire vector of strings, commit to never modifying it again, and then start making `StringRef`s pointing into it. (Another option I'd have been completely confident of would be to make a vector of `unique_ptr<string>`, so that even if the vector resizes, each pointed-to string stays put. That forces even more allocations, though.)

Note to self: I still need to review the tests, but will do that once the approach re. multiple symbols has been settled on and the tests updated accordingly.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1505–1508	Good point - keep the loop as-is. It turns out that the "Small String Optimization" that std::string can use under-the-hood, means that perhaps unintuitively, the string's data can be moved when the string itself is moved, rather than just the pointer to the data (see https://stackoverflow.com/questions/57723963/is-it-safe-to-store-the-pointer-to-the-data-of-a-stdstring).

simon_tatham marked 6 inline comments as done.Aug 15 2022, 3:03 AM

simon_tatham added inline comments.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1689–1690	OK, I'll change it to treat code symbols as higher priority than data. As for mapping symbols, that's a use case I do know something about, and honestly, I think the simplest approach there is to stop checking for `STT_OBJECT` symbols at all, and just say that if there are mapping symbols in this section, we should use them, and not try to second-guess whether we think they're useful. So I think the best thing is to remove that loop completely, and replace it with a test of `MappingSymbols.empty()`.

Updated handling of STT_OBJECT symbols as discussed. Also added a comment about the confusing double loop setting up the demangled symbol names, since the next person might also wonder why it's not being done in one step.

Harbormaster completed remote builds in B181241: Diff 452618.Aug 15 2022, 3:39 AM

Code changes basically look good, but I'm out of time for today to review the testing. Please make sure the testing covers all the new code paths, and that the changed behaviour we were discussing is also covered.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511	I feel like this should be simplfiable to a single line, possibly using something like `SymNamesHere.insert(SymNamesHere.begin(), DemangledSymNamesHere.begin(), DemangledSymNamesHere.end());` although maybe it's unnecessarily complex. (Same goes for the loop immediately below in the `else`).

Added another test to check that the new data vs code priorities work.

I'd misunderstood the previous sorting criterion, it turned out. I
thought symbols at the same address were sorted by type. In fact
they're sorted by name, then by type. So out of a data and code
symbol, the alphabetically later one was previously winning!

Herald added a subscriber: emaste. · View Herald TranscriptAug 16 2022, 7:54 AM

Harbormaster completed remote builds in B181532: Diff 453011.Aug 16 2022, 8:42 AM

In addition to me inline comments, I think one bit of testing you still need is showing whether the demangled or raw name is used for the priority ordering of symbol names.

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
1 ↗	(On Diff #453011)	This is going to need some kind of `REQUIRES` directive, since it's testing disassembly.
4 ↗	(On Diff #453011)	"Prior to D131589" is going to be pretty meaningless in the future. I'd just get rid of this whole sentence (and delete the "now" from the next sentence), or at least replace this bit with the more generic "Previously".
17–20 ↗	(On Diff #453011)	Do you think it would be worth also showing which symbols are printed?
29–33 ↗	(On Diff #453011)	I prefer to line up the values in a block so that they all start at the same column, like in the suggestion. It marginally helps with readability, I find.
llvm/test/tools/llvm-objdump/multiple-symbols.test
1 ↗	(On Diff #453011)	I'd move your comment about what the test does to the start of the file here. Also, this test will need a `REQUIRES` directive too, as it won't work if the build doesn't have the relevant target configured.
4 ↗	(On Diff #453011)	I'd put blank lines between the closely-related groups here. For example, you could have the first two runs in one group, the next 6 in another and the remainder in a third. I'd then label each group with a comment explaining what's special about that set of test cases.
21 ↗	(On Diff #453011)	Nit: here and elsewhere, I think the canonical spelling is ARM.
29 ↗	(On Diff #453011)	I'd suggest this fomatting for the groups, because it looks initially like you've just omitted a space after the comma!
33 ↗	(On Diff #453011)	Rather than half a dozen different CHECK patterns, which are mostly duplicates, I'd consider using multiple check prefixes in each test case to enable/disable the relevant parts. For example, you'd have one prefix for each of the symbols, and then another prefix for each of the disassembly blocks. You could use an `--implicit-check-not` to the FileCheck commands to ensure other stuff isn't printed incorrectly, instead of the `-NOT` style too, but up to you. Rough example: # AMAP: 00000000 <$a.0>: # AAAA: 00000000 <aaaa>: # BBBB: 00000000 <bbbb>: # CODE1: <first functions code> # CODE1-NEXT: ... ... # TMAP: ....
101–103 ↗	(On Diff #453011)	Bearing in mind that --disassemble-symbol should already have testing elsewhere, what does this second block of code + function symbols give us that the first block alone doesn't?
136 ↗	(On Diff #453011)	Delete this comment - it's obvious that it's the input file by virtue of it being YAML.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1583	The change of this logic should correspond to some sort of test case, I think, but I don't think I see anything?

I've left most of your comments un-replied-to so far, because I need to think harder about the choice of symbols to display, as mentioned in one of my inline comments below.

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
17–20 ↗	(On Diff #453011)	Hmmm. Now I look more closely, there's still something a bit odd here. The symbols that are printed are the ones beginning with `B`, in every case. (Exactly as the unmodified llvm-objdump would have done.) The presence of a non-data symbol at each location has caused the section contents to be disassembled as code, but the non-data symbol isn't winning the contest in every case to be the one printed. I wonder if that inconsistency might be confusing? If we think the code symbol is more important from a disassembly perspective, perhaps we should make it the one displayed, as well, for consistency? Otherwise you end up printing a data symbol name followed by code, which looks confusing. I feel as if we ought to print a data symbol with data, or a code symbol with code, but not a confusing mixture. I'll have a rethink.
llvm/test/tools/llvm-objdump/multiple-symbols.test
21 ↗	(On Diff #453011)	I do have to keep remembering that LLVM's canonical spelling isn't the same as Arm's canonical spelling. My fingers have a very strong habit of typing it the way we spell it, for obvious reasons!
llvm/tools/llvm-objdump/llvm-objdump.cpp
1583	It turned out that I had trouble thinking of something that would have changed as a result of removing this section! The intention of the old code here is to avoid checking mapping symbols if we're starting disassembly at an `STT_OBJECT` symbol. But `STT_OBJECT` symbols are handled by the previous if statement by going to `dumpELFData` and then terminating this loop iteration, so it's difficult for one to get as far as here in the first place. If there is any case that could have got here at all without being eaten by the previous test, it must be a confusing edge case of some kind and I haven't put my finger on it yet.

jhenderson added inline comments.Aug 18 2022, 12:27 AM

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
17–20 ↗	(On Diff #453011)	I agree - if we are disassembling as code, we should be printing code symbols. If there are multiple symbols to print (due to --disassemble-symbol etc), then if any of them are code, I think we should still print as code. However, if none of the code symbols are "selected" for printing, we should print as data, in my opinion. In case there's any ambiguity, I do think we should pick the code symbols above the data symbols, both in choosing which symbol to pick and therefore how to disassemble a block of bytes. It is probably worth taking a look at what GNU objdump does, and see if you can identify any behaviour that makes sense and we can conform to. Disassembly is one area where we diverge somewhat, but I think it might still be a useful reference point.
llvm/test/tools/llvm-objdump/multiple-symbols.test
21 ↗	(On Diff #453011)	I mean, I'm going off what wikipedia (and several other websites) tells me is the spelling :-) Strange that Arm's official spelling according to the company is different! I'd be happy to go back to what you had before then!

In D131589#3728128, @jhenderson wrote:

I think one bit of testing you still need is showing whether the demangled or raw name is used for the priority ordering of symbol names.

I'm not sure what you mean by that. Priority order of symbol names? The sorting order in Symbols is set up before anything gets demangled, so it will be based on the raw name, but that isn't changed by this patch.

llvm/test/tools/llvm-objdump/multiple-symbols.test
21 ↗	(On Diff #453011)	There was a change of preference at some point in the past, and it's entirely possible that not everyone has caught up. But in recent years Arm's preferred spelling of its own name is "Arm". (Obviously identifiers in source code have to match the existing spelling and all be consistent, but comments can be up to date!)
33 ↗	(On Diff #453011)	I had a try at this, but I'm afraid I couldn't see how to make it test the things I want tested. The problem is that the `-NEXT` suffix doesn't apply between different FileCheck prefixes. If I write, for example, COMMON: some header FOO-NEXT: line involving foo BAR-NEXT: line involving bar then I'd like `--check-prefixes=COMMON,BAR` to enforce that the bar line shows up immediately after the header line, and there isn't an intervening line of any kind. But in fact the `BAR-NEXT` check provokes an error message from FileCheck that there should have been a previous `BAR` check for it to be next to. It apparently means "must be on the next line from the previous check with the same prefix", not "... with any currently enabled prefix". So I think if I converted this test into your suggested style, I'd lose the ability to have `NEXT` checks at all, so I'd have to have a pair of prefixes for each piece of output, denoting "this line is / is not expected to appear in the file" ... # AMAP: 00000000 <$a.0>: # AMAP-NOT: 00000000 <$a.0>: # AAAA: 00000000 <aaaa>: # AAAA-NOT: 00000000 <aaaa>: and then each RUN line would have to have an absolutely enormous collection of check-prefixes specifying every single line it both did and didn't want. I can give that a try if you really want me to, but are you sure it's clearer? The effect from my point of view is that all the details of what makes one test run different from another are now way off to the right and smushed into a long undistinguished list of keywords, and you have to cross-refer to the checks anyway to make sense of them, instead of laid out in a table of "here is what this test expects to see".
101–103 ↗	(On Diff #453011)	It took me a day to remember what I'd been thinking here myself, so I agree it's unclear! The intention of having both an Arm and a Thumb function was to ensure that each one is disassembled in the right one of those states, because the mapping symbols that indicate the changeover are still reliably recognised, regardless of which subset of symbols is being displayed.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1583	Aha! There is an edge case affected by this change. If you set `--disassemble-all` to force disassembly of data sections, then the previous code would have had the side effect of ignoring mapping symbols in code sections, so you'd get Thumb code mistakenly disassembled as Arm. The new criterion of "use mapping symbols if they're there" stops that failure from happening. I'll add a regression test for it.

Moved the check for data symbols to before we choose symbols to display, so that the same check can control which symbol is printed and how the data after it is disassembled.

Added a test for the changed behaviour of --disassemble-all, and tweaked comments and layout in existing tests for review comments.

Herald added a subscriber: kristof.beyls. · View Herald TranscriptAug 19 2022, 2:35 AM

Harbormaster completed remote builds in B182183: Diff 453935.Aug 19 2022, 3:17 AM

In D131589#3734616, @simon_tatham wrote:

In D131589#3728128, @jhenderson wrote:

I think one bit of testing you still need is showing whether the demangled or raw name is used for the priority ordering of symbol names.

I'm not sure what you mean by that. Priority order of symbol names? The sorting order in Symbols is set up before anything gets demangled, so it will be based on the raw name, but that isn't changed by this patch.

Your new test is about multiple symbols at the same location, and you specifically call out the alphatical sorting in a comment. That then immediately raises the question about whether demangled or mangled names are used. It's not the end of the world, since you rightly point out that this aspect hasn't changed, but I think it would still be useful to check (assuming of course it isn't already tested, anyway).

llvm/test/tools/llvm-objdump/ELF/ARM/disassemble-all.s
1 ↗	(On Diff #453935)	Perhaps worth adding "mapping-symbols" to the test name, e.g. `disassemble-all-mapping-symbols.s`, since it's specifically the interaction of the two that's interesting.
llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.test
18 ↗	(On Diff #453935)	"is displayed before" sounds incomplete. Before what? Do you mean "is displayed first"? Also, missing full stop at end of sentence.
25–32 ↗	(On Diff #453935)	Nit: the whitespace for indentation of these lines is inconsistent with the first block above. Please fix.
llvm/test/tools/llvm-objdump/multiple-symbols.test
33 ↗	(On Diff #453011)	The problem is that the -NEXT suffix doesn't apply between different FileCheck prefixes. I'm 95% certain that this is incorrect, as I just tested it out locally with the following test passing fine for me: # RUN: echo foo > %t.txt # RUN: echo baz >> %t.txt # RUN: FileCheck %s --input-file=%t.txt --check-prefixes=FOO,BAZ # FOO: foo # BAR-NEXT: bar # BAZ-NEXT: baz Did you perhaps accidentally omit the `COMMON` from one of your FileCheck prefix sets? The only rule for -NEXT/-EMPTY commands is that there has to be one regular check (across all prefix sets) before the first -NEXT/-EMPTY.
101–103 ↗	(On Diff #453011)	Perhaps worth additional comments then to explain this.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1518

simon_tatham marked 10 inline comments as done.Aug 19 2022, 7:07 AM

simon_tatham added inline comments.

llvm/test/tools/llvm-objdump/multiple-symbols.test
33 ↗	(On Diff #453011)	You're right, it does work the way you say. I had indeed missed having an initial regular check, because I started off with a `-NOT` check, which doesn't count. But I was misled by FileCheck's error message: if I adjust your demo so that its first check is `FOO-NOT`, then I see z.test:3:3: error: found 'BAZ-NEXT' without previous 'BAZ: line which is what led me to think it worked the way I said!
llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511	I'm afraid I don't know enough about that kind of STL iterator idiom to see how you'd do it in the `else` loop, where you not only have to iterate over `SymbolsHere` but also extract the `Name` field of each one. You'd need some kind of lambda, or templated field extraction gadget, or something, surely?

I think I've now addressed all your review comments, including adding a demonstration of alphabetical order vs demangling.

(Ah, that's where the one last unticked Done box was hiding.)

Harbormaster completed remote builds in B182212: Diff 453985.Aug 19 2022, 7:52 AM

jhenderson added inline comments.Aug 22 2022, 12:16 AM

llvm/test/tools/llvm-objdump/multiple-symbols-mangling.test
7 ↗	(On Diff #453985)	I'm wondering if this is a case where a generated-from-assembly-using-llvm-mc input might be more appropriate. Given we already need the Arm target for the disassembly, and we don't need to control any fine details of the object really, I don't think you lose any coverage, and the input file would be simpler. It might be worth looking to do the same at some of the other tests, though I haven't tried to figure out whether the assembly equivalent would be simpler.
14 ↗	(On Diff #453985)	FWIW, I find `_` characters in check prefixes weird. I'm also slightly hesitant, because it is easy enough to mistype an `_` as `-` (but less likely the other way around). I'm not sure you lose much by switching to `-`.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511	You're looking for `std::transform` I believe in that case, though it's debatable whether it's easier to read, so what you've got is fine.

Adjusted check prefixes, and translated all yaml2obj inputs into llvm-mc inputs (which in all cases fails with the old llvm-objdump, i.e. still produces an object file that successfully tests the changed behaviour).

llvm/test/tools/llvm-objdump/multiple-symbols-mangling.test
14 ↗	(On Diff #453985)	I wasn't sure whether the use of `-` in the middle of the check prefix might conflict with the use of `-` to separate the semantic `-NOT:` and so forth at the end. But apparently that works fine.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1510–1511	Yes, I see, with a lambda to extract the field of each object. I agree it's nicer to leave it as it is :-)

Harbormaster completed remote builds in B182536: Diff 454428.Aug 22 2022, 3:16 AM

LGTM, but before pushing, probably worth giving others (@rafauler, @MaskRay in particular) a day or two to have another look.

This revision is now accepted and ready to land.Aug 22 2022, 3:43 AM

This revision was landed with ongoing or failed builds.Aug 24 2022, 7:08 AM

Closed by commit rG8e29f3f1c35a: [llvm-objdump] Handle multiple syms at same addr in disassembly. (authored by simon_tatham). · Explain Why

This revision was automatically updated to reflect the committed changes.

simon_tatham added a commit: rG8e29f3f1c35a: [llvm-objdump] Handle multiple syms at same addr in disassembly..

simon_tatham mentioned this in rG79f99bf6220e: [bolt] Fix a test affected by D131589..Aug 24 2022, 7:52 AM

Thanks! This is a useful option. FWIW I created a feature request for GNU objdump https://sourceware.org/bugzilla/show_bug.cgi?id=29847

MaskRay mentioned this in rGd3b7c84a0bf6: [llvm-objdump][docs] Mention --show-all-symbols.Dec 5 2022, 12:01 PM

Revision Contents

Path

Size

llvm/

test/

tools/

llvm-objdump/

ELF/

ARM/

disassemble-all-mapping-symbols.s

32 lines

data-vs-code-priority.s

66 lines

multiple-symbols-mangling.s

42 lines

multiple-symbols.s

97 lines

tools/

llvm-objdump/

ObjdumpOpts.td

4 lines

llvm-objdump.cpp

225 lines

Diff 455194

llvm/test/tools/llvm-objdump/ELF/ARM/disassemble-all-mapping-symbols.s

This file was added.

				// Regression test for a bug in which --disassemble-all had the side effect
				// of stopping mapping symbols from being checked in code sections, so that
				// mixed Arm/Thumb code would not all be correctly disassembled.

				@ RUN: llvm-mc -triple arm-unknown-linux -filetype=obj %s -o %t.o
				@ RUN: llvm-objdump -d %t.o \| FileCheck %s
				@ RUN: llvm-objdump -d --disassemble-all %t.o \| FileCheck %s

				@ CHECK: 00000000 <armfunc>:
				@ CHECK: 0: e2800001 add r0, r0, #1
				@ CHECK: 4: e12fff1e bx lr
				@
				@ CHECK: 00000008 <thmfunc>:
				@ CHECK: 8: f100 0001 add.w r0, r0, #1
				@ CHECK: c: 4770 bx lr

				.arch armv8a
				.text

				.arm
				.global armfunc
				.type armfunc, %function
				armfunc:
				add r0, r0, #1
				bx lr

				.thumb
				.global thmfunc
				.type thmfunc, %function
				thmfunc:
				add r0, r0, #1
				bx lr

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.s

This file was added.

				@ REQUIRES: arm-registered-target

				// Test that code symbols take priority over data symbols if both are
				// defined at the same address during disassembly.
				//
				// In the past, llvm-objdump would select the alphabetically last
				// symbol at each address. To demonstrate that it's now choosing by
				// symbol type, we define pairs of code and data symbols at the same
				// address in such a way that the code symbol and data symbol each
				// have a chance to appear alphabetically last. Also, we test that
				// both STT_FUNC and STT_NOTYPE are regarded as code symbols.

				@ RUN: llvm-mc -triple armv8a-unknown-linux -filetype=obj %s -o %t.o
				@ RUN: llvm-objdump --triple armv8a -d %t.o \| FileCheck %s

				// Ensure that all four instructions in the section are disassembled
				// rather than dumped as data, and that in each case, the code symbol
				// is displayed before the disassembly, and not the data symbol at the
				// same address.

				@ CHECK: Disassembly of section .text:
				@ CHECK-EMPTY:
				@ CHECK-NEXT: <A1function>:
				@ CHECK-NEXT: movw r0, #1
				@ CHECK-EMPTY:
				@ CHECK-NEXT: <B2function>:
				@ CHECK-NEXT: movw r0, #2
				@ CHECK-EMPTY:
				@ CHECK-NEXT: <A3notype>:
				@ CHECK-NEXT: movw r0, #3
				@ CHECK-EMPTY:
				@ CHECK-NEXT: <B4notype>:
				@ CHECK-NEXT: movw r0, #4

				.text

				.globl A1function
				.globl B2function
				.globl A3notype
				.globl B4notype
				.globl B1object
				.globl A2object
				.globl B3object
				.globl A4object

				.type A1function,%function
				.type B2function,%function
				.type A3notype,%notype
				.type B4notype,%notype
				.type B1object,%object
				.type A2object,%object
				.type B3object,%object
				.type A4object,%object

				A1function:
				B1object:
				movw r0, #1
				A2object:
				B2function:
				movw r0, #2
				A3notype:
				B3object:
				movw r0, #3
				A4object:
				B4notype:
				movw r0, #4

llvm/test/tools/llvm-objdump/multiple-symbols-mangling.s

This file was added.

				// This test demonstrates that the alphabetical-order tie breaking between
				// multiple symbols defined at the same address is based on the raw symbol
				// name, not its demangled version.

				@ REQUIRES: arm-registered-target

				@ RUN: llvm-mc -triple armv8a-unknown-linux -filetype=obj %s -o %t.o

				// All the run lines below should generate some subset of this
				// display, with different parts included:

				@ COMMON: Disassembly of section .text:
				@
				@ RAW-B: 00000000 <_Z4bbbbv>:
				@ NICE-B: 00000000 <bbbb()>:
				@ NO-B-NOT: bbbb
				@ A: 00000000 <aaaa>:
				@ COMMON: 0: e0800080 add r0, r0, r0, lsl #1
				@ COMMON: 4: e12fff1e bx lr

				// The default disassembly chooses just the alphabetically later symbol, which
				// is aaaa, because the leading _ on a mangled name sorts before lowercase
				// ASCII.

				@ RUN: llvm-objdump --triple armv8a -d %t.o \| FileCheck --check-prefixes=COMMON,NO-B,A %s

				// With the --show-all-symbols option, bbbb is also shown, in its raw form.

				@ RUN: llvm-objdump --triple armv8a --show-all-symbols -d %t.o \| FileCheck --check-prefixes=COMMON,RAW-B,A %s

				// With --demangle as well, bbbb is demangled, but that doesn't change its
				// place in the sorting order.

				@ RUN: llvm-objdump --triple armv8a --show-all-symbols --demangle -d %t.o \| FileCheck --check-prefixes=COMMON,NICE-B,A %s

				.text
				.globl aaaa
				.globl _Z4bbbv
				aaaa:
				_Z4bbbbv:
				add r0, r0, r0, lsl #1
				bx lr

llvm/test/tools/llvm-objdump/multiple-symbols.s

This file was added.

				// This test checks the behavior of llvm-objdump's --disassemble-symbols and
				// --show-all-symbols options, in the presence of multiple symbols defined at
				// the same address in an object file.

				// The test input file contains an Arm and a Thumb function, each with two
				// function-type symbols defined at its entry point. Also, because it's Arm,
				// there's a $a mapping symbol defined at the start of the section, and a $t
				// mapping symbol at the point where Arm code stops and Thumb code begins.

				// By default, llvm-objdump will pick one of the symbols to disassemble at each
				// point where any are defined at all. The tie-break sorting criterion is
				// alphabetic, so it will be the alphabetically later symbol in each case: of
				// the names aaaa and bbbb for the Arm function it picks bbbb, and of cccc and
				// dddd for the Thumb function it picks dddd.

				// Including an Arm and a Thumb function also re-checks that these changes to
				// the display of symbols doesn't affect the recognition of mapping symbols for
				// the purpose of switching disassembly mode.

				@ REQUIRES: arm-registered-target

				@ RUN: llvm-mc -triple armv8a-unknown-linux -filetype=obj %s -o %t.o

				// All the run lines below should generate some subset of this
				// display, with different parts included:

				@ HEAD: Disassembly of section .text:
				@ HEAD-EMPTY:
				@ AMAP-NEXT: 00000000 <$a.0>:
				@ AAAA-NEXT: 00000000 <aaaa>:
				@ BBBB-NEXT: 00000000 <bbbb>:
				@ AABB-NEXT: 0: e0800080 add r0, r0, r0, lsl #1
				@ AABB-NEXT: 4: e12fff1e bx lr
				@ BOTH-EMPTY:
				@ TMAP-NEXT: 00000008 <$t.1>:
				@ CCCC-NEXT: 00000008 <cccc>:
				@ DDDD-NEXT: 00000008 <dddd>:
				@ CCDD-NEXT: 8: eb00 0080 add.w r0, r0, r0, lsl #2
				@ CCDD-NEXT: c: 4770 bx lr

				// The default disassembly chooses just the alphabetically later symbol of each
				// set, namely bbbb and dddd.

				@ RUN: llvm-objdump --triple armv8a -d %t.o \| FileCheck --check-prefixes=HEAD,BBBB,AABB,BOTH,DDDD,CCDD %s

				// With the --show-all-symbols option, all the symbols are shown, including the
				// administrative mapping symbols.

				@ RUN: llvm-objdump --triple armv8a --show-all-symbols -d %t.o \| FileCheck --check-prefixes=HEAD,AMAP,AAAA,BBBB,AABB,BOTH,TMAP,CCCC,DDDD,CCDD %s

				// If we use --disassemble-symbols to ask for the disassembly of aaaa or bbbb
				// or both, then we expect the second cccc/dddd function not to appear in the
				// output at all. Also, we want to see whichever symbol we asked about, or both
				// if we asked about both.

				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=aaaa -d %t.o \| FileCheck --check-prefixes=HEAD,AAAA,AABB %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=bbbb -d %t.o \| FileCheck --check-prefixes=HEAD,BBBB,AABB %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=aaaa,bbbb -d %t.o \| FileCheck --check-prefixes=HEAD,AAAA,BBBB,AABB %s

				// With _any_ of those three options and also --show-all-symbols, the
				// disassembled code is still limited to just the symbol(s) you asked about,
				// but all symbols defined at the same address are mentioned, whether you asked
				// about them or not.

				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=aaaa --show-all-symbols -d %t.o \| FileCheck --check-prefixes=HEAD,AMAP,AAAA,BBBB,AABB %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=bbbb --show-all-symbols -d %t.o \| FileCheck --check-prefixes=HEAD,AMAP,AAAA,BBBB,AABB %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=aaaa,bbbb --show-all-symbols -d %t.o \| FileCheck --check-prefixes=HEAD,AMAP,AAAA,BBBB,AABB %s

				// Similarly for the Thumb function and its symbols. This time we must check
				// that the aaaa/bbbb block of code was not disassembled _before_ the output
				// we're expecting.

				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=cccc -d %t.o \| FileCheck --check-prefixes=HEAD,CCCC,CCDD %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=dddd -d %t.o \| FileCheck --check-prefixes=HEAD,DDDD,CCDD %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=cccc,dddd -d %t.o \| FileCheck --check-prefixes=HEAD,CCCC,DDDD,CCDD %s

				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=cccc --show-all-symbols -d %t.o \| FileCheck --check-prefixes=HEAD,TMAP,CCCC,DDDD,CCDD %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=dddd --show-all-symbols -d %t.o \| FileCheck --check-prefixes=HEAD,TMAP,CCCC,DDDD,CCDD %s
				@ RUN: llvm-objdump --triple armv8a --disassemble-symbols=cccc,dddd --show-all-symbols -d %t.o \| FileCheck --check-prefixes=HEAD,TMAP,CCCC,DDDD,CCDD %s

				.text
				.globl aaaa
				.globl bbbb
				.globl cccc
				.globl dddd

				.arm
				aaaa:
				bbbb:
				add r0, r0, r0, lsl #1
				bx lr

				.thumb
				cccc:
				dddd:
				add.w r0, r0, r0, lsl #2
				bx lr

llvm/tools/llvm-objdump/ObjdumpOpts.td

	Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines

	def section_headers : Flag<["--"], "section-headers">,			def section_headers : Flag<["--"], "section-headers">,
	HelpText<"Display summaries of the headers for each section.">;			HelpText<"Display summaries of the headers for each section.">;
	def : Flag<["--"], "headers">, Alias<section_headers>,			def : Flag<["--"], "headers">, Alias<section_headers>,
	HelpText<"Alias for --section-headers">;			HelpText<"Alias for --section-headers">;
	def : Flag<["-"], "h">, Alias<section_headers>,			def : Flag<["-"], "h">, Alias<section_headers>,
	HelpText<"Alias for --section-headers">;			HelpText<"Alias for --section-headers">;

				def show_all_symbols : Flag<["--"], "show-all-symbols">,
				HelpText<"Show all symbols during disassembly, even if multiple "
				"symbols are defined at the same location">;

	def show_lma : Flag<["--"], "show-lma">,			def show_lma : Flag<["--"], "show-lma">,
	HelpText<"Display LMA column when dumping ELF section headers">;			HelpText<"Display LMA column when dumping ELF section headers">;

	def source : Flag<["--"], "source">,			def source : Flag<["--"], "source">,
	HelpText<"When disassembling, display source interleaved with the "			HelpText<"When disassembling, display source interleaved with the "
	"disassembly. Implies --disassemble">;			"disassembly. Implies --disassemble">;
	def : Flag<["-"], "S">, Alias<source>, HelpText<"Alias for --source">;			def : Flag<["-"], "S">, Alias<source>, HelpText<"Alias for --source">;

	▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

llvm/tools/llvm-objdump/llvm-objdump.cpp

Show First 20 Lines • Show All 201 Lines • ▼ Show 20 Lines

bool objdump::LeadingAddr; bool objdump::LeadingAddr;

static bool Offloading; static bool Offloading;

static bool RawClangAST; static bool RawClangAST;

bool objdump::Relocations; bool objdump::Relocations;

bool objdump::PrintImmHex; bool objdump::PrintImmHex;

bool objdump::PrivateHeaders; bool objdump::PrivateHeaders;

std::vector<std::string> objdump::FilterSections; std::vector<std::string> objdump::FilterSections;

bool objdump::SectionHeaders; bool objdump::SectionHeaders;

static bool ShowAllSymbols;

static bool ShowLMA; static bool ShowLMA;

bool objdump::PrintSource; bool objdump::PrintSource;

static uint64_t StartAddress; static uint64_t StartAddress;

static bool HasStartAddressFlag; static bool HasStartAddressFlag;

static uint64_t StopAddress = UINT64_MAX; static uint64_t StopAddress = UINT64_MAX;

static bool HasStopAddressFlag; static bool HasStopAddressFlag;

▲ Show 20 Lines • Show All 1,258 Lines • ▼ Show 20 Lines for (const SectionRef &Section : ToolSectionFilter(Obj)) {

// the section offset. // the section offset.

uint64_t RelAdjustment = Obj.isRelocatableObject() ? 0 : SectionAddr; uint64_t RelAdjustment = Obj.isRelocatableObject() ? 0 : SectionAddr;

uint64_t Size; uint64_t Size;

uint64_t Index; uint64_t Index;

bool PrintedSection = false; bool PrintedSection = false;

std::vector<RelocationRef> Rels = RelocMap[Section]; std::vector<RelocationRef> Rels = RelocMap[Section];

std::vector<RelocationRef>::const_iterator RelCur = Rels.begin(); std::vector<RelocationRef>::const_iterator RelCur = Rels.begin();

std::vector<RelocationRef>::const_iterator RelEnd = Rels.end(); std::vector<RelocationRef>::const_iterator RelEnd = Rels.end();

// Disassemble symbol by symbol.

for (unsigned SI = 0, SE = Symbols.size(); SI != SE; ++SI) {

std::string SymbolName = Symbols[SI].Name.str();

if (Demangle)

SymbolName = demangle(SymbolName);

// Skip if --disassemble-symbols is not empty and the symbol is not in // Loop over each chunk of code between two points where at least

// the list. // one symbol is defined.

if (!DisasmSymbolSet.empty() && !DisasmSymbolSet.count(SymbolName)) for (size_t SI = 0, SE = Symbols.size(); SI != SE;) {

// Advance SI past all the symbols starting at the same address,

// and make an ArrayRef of them.

unsigned FirstSI = SI;

uint64_t Start = Symbols[SI].Addr;

ArrayRef<SymbolInfoTy> SymbolsHere;

while (SI != SE && Symbols[SI].Addr == Start)

++SI;

SymbolsHere = ArrayRef<SymbolInfoTy>(&Symbols[FirstSI], SI - FirstSI);

jhendersonUnsubmitted

Done

Nit: I think we prefer preincrement to post.

jhenderson: Nit: I think we prefer preincrement to post.

rafaulerUnsubmitted

Done

Curious about the style here where we define a scope to isolate a piece of computation. I don't think I've seen this in other LLVM files or in the coding style, so I'm curious what other people think about it.

I do think it makes the code more clear to understand here, because it's only 3 lines. But on the other hand, it increases nesting on line 1509, which can actually make code harder to read (arguably).

rafauler: Curious about the style here where we define a scope to isolate a piece of computation. I don't…

jhendersonUnsubmitted

Done

I'm not sure this style is used in LLVM, or at least not in the areas I'm familiar with, so I'd drop it.

jhenderson: I'm not sure this style is used in LLVM, or at least not in the areas I'm familiar with, so I'd…

simon_tathamAuthorUnsubmitted

Done

Oops, good catch. Those braces were originally there to isolate some variables so that they weren't in scope for the rest of the enormous for-loop body. But by the time I finished developing the patch I'd removed all the variables local to the block, and somehow didn't notice that even in three last-minute pre-upload reviews of my own :-)

Now I look closely, the same thing happened to the other scope that decides which symbols to print. I'll fix that one too.

simon_tatham: Oops, good catch. Those braces were originally there to isolate some //variables// so that they…

// Get the demangled names of all those symbols. We end up with a vector

// of StringRef that holds the names we're going to use, and a vector of

// std::string that stores the new strings returned by demangle(), if

// any. If we don't call demangle() then that vector can stay empty.

jhendersonUnsubmitted

Done

Can this be a vector of StringRef to save the copying?

jhenderson: Can this be a vector of `StringRef` to save the copying?

simon_tathamAuthorUnsubmitted

Done

Not trivially, because in the case where we have to demangle names, demangle() returns a newly made string and we have to have somewhere to store it.

I've changed it so that we have the vector of StringRef you wanted, and also a vector of std::string which is left empty when not in demangling mode, and the StringRefs point at the local vector or the original name elsewhere, as appropriate.

simon_tatham: Not trivially, because in the case where we have to demangle names, `demangle()` returns a…

jhendersonUnsubmitted

Done

I'd forgotten about the demangle aspect of this. As such, I'm not fussed whichever way you prefer to do it.

jhenderson: I'd forgotten about the `demangle` aspect of this. As such, I'm not fussed whichever way you…

std::vector<StringRef> SymNamesHere;

std::vector<std::string> DemangledSymNamesHere;

if (Demangle) {

// Fetch the demangled names and store them locally.

for (const SymbolInfoTy &Symbol : SymbolsHere)

DemangledSymNamesHere.push_back(demangle(Symbol.Name.str()));

// Now we've finished modifying that vector, it's safe to make

jhendersonUnsubmitted

Done

if (Demangle) {

- for (const SymbolInfoTy &Symbol : SymbolsHere)

+ for (const SymbolInfoTy &Symbol : SymbolsHere) {

DemangledSymNamesHere.push_back(demangle(Symbol.Name.str()));

- for (const std::string &DemangledName : DemangledSymNamesHere)

- SymNamesHere.push_back(DemangledName);

+ SymNamesHere.push_back(DemangledSymNamesHere.back());

+ }

} else {

You can do this, right?

jhenderson: You can do this, right?

simon_tathamAuthorUnsubmitted

Done

I nearly did, but I wasn't confident that a StringRef to a std::string stored in a std::vector is guaranteed to stay valid if the std::vector has to resize itself. What if the std::string implementation stores short enough strings without a separate allocation, and the vector resize involves a realloc?

So instead I did something I'm sure is safe, which is to set up the entire vector of strings, commit to never modifying it again, and then start making StringRefs pointing into it.

(Another option I'd have been completely confident of would be to make a vector of unique_ptr<string>, so that even if the vector resizes, each pointed-to string stays put. That forces even more allocations, though.)

simon_tatham: I //nearly// did, but I wasn't confident that a `StringRef` to a `std::string` stored in a `std…

jhendersonUnsubmitted

Done

Good point - keep the loop as-is. It turns out that the "Small String Optimization" that std::string can use under-the-hood, means that perhaps unintuitively, the string's data can be moved when the string itself is moved, rather than just the pointer to the data (see https://stackoverflow.com/questions/57723963/is-it-safe-to-store-the-pointer-to-the-data-of-a-stdstring).

jhenderson: Good point - keep the loop as-is. It turns out that the "Small String Optimization" that std…

// a vector of StringRefs pointing into it.

SymNamesHere.insert(SymNamesHere.begin(), DemangledSymNamesHere.begin(),

DemangledSymNamesHere.end());

jhendersonUnsubmitted

Done

"(that we can find at all)" - I'm not sure what this is saying. Do you even need it?

jhenderson: "(that we can find at all)" - I'm not sure what this is saying. Do you even need it?

simon_tathamAuthorUnsubmitted

Done

I just meant that if a symbol specified by the user doesn't appear in the object file at all, then we're exempt from the need to display it. But I agree that's not 100% clear, or particularly important to highlight in this context. I'll remove the parenthesis.

simon_tatham: I just meant that if a symbol specified by the user doesn't appear in the object file at all…

jhendersonUnsubmitted

Done

I feel like this should be simplfiable to a single line, possibly using something like SymNamesHere.insert(SymNamesHere.begin(), DemangledSymNamesHere.begin(), DemangledSymNamesHere.end()); although maybe it's unnecessarily complex. (Same goes for the loop immediately below in the else).

jhenderson: I feel like this should be simplfiable to a single line, possibly using something like…

simon_tathamAuthorUnsubmitted

Done

I'm afraid I don't know enough about that kind of STL iterator idiom to see how you'd do it in the else loop, where you not only have to iterate over SymbolsHere but also extract the Name field of each one. You'd need some kind of lambda, or templated field extraction gadget, or something, surely?

simon_tatham: I'm afraid I don't know enough about that kind of STL iterator idiom to see how you'd do it in…

jhendersonUnsubmitted

Done

You're looking for std::transform I believe in that case, though it's debatable whether it's easier to read, so what you've got is fine.

jhenderson: You're looking for `std::transform` I believe in that case, though it's debatable whether it's…

simon_tathamAuthorUnsubmitted

Done

Yes, I see, with a lambda to extract the field of each object. I agree it's nicer to leave it as it is :-)

simon_tatham: Yes, I see, with a lambda to extract the field of each object. I agree it's nicer to leave it…

} else {

for (const SymbolInfoTy &Symbol : SymbolsHere)

SymNamesHere.push_back(Symbol.Name);

}

// Distinguish ELF data from code symbols, which will be used later on to

// decide whether to 'disassemble' this chunk as a data declaration via

jhendersonUnsubmitted

Done

// Distinguish ELF data from code symbols, which will be used later on to

- // decide whether to 'disassemble' this chunk as at data declaration via

+ // decide whether to 'disassemble' this chunk as a data declaration via

// dumpELFData(), or whether to treat it as code.

jhenderson:

// dumpELFData(), or whether to treat it as code.

// If data _and_ code symbols are defined at the same address, the code

// takes priority, on the grounds that disassembling code is our main

// purpose here, and it would be a worse failure to _not_ interpret

// something that _was_ meaningful as code than vice versa.

// Any ELF symbol type that is not clearly data will be regarded as code.

// In particular, one of the uses of STT_NOTYPE is for branch targets

// inside functions, for which STT_FUNC would be inaccurate.

// So here, we spot whether there's any non-data symbol present at all,

// and only set the DisassembleAsData flag if there isn't. Also, we use

// this distinction to inform the decision of which symbol to print at

// the head of the section, so that if we're printing code, we print a

jhendersonUnsubmitted

Done

SymsToPrint[SymbolsHere.size() - 1] = true;

}

- // Now that we know we're disassembling this section at all, override

+ // Now that we know we're disassembling this section, override

// the choice of which symbols to display by printing _all_ of them a

jhenderson:

// code-related symbol name to go with it.

jhendersonUnsubmitted

Done

// Now that we know we're disassembling this section at all, override

- // the choice of which symbols to display by printing _all_ of them a

+ // the choice of which symbols to display by printing _all_ of them at

// this address if the user asked for all symbols.

jhenderson:

bool DisassembleAsData = false;

size_t DisplaySymIndex = SymbolsHere.size() - 1;

if (Obj.isELF() && !DisassembleAll && Section.isText()) {

DisassembleAsData = true; // unless we find a code symbol below

for (size_t i = 0; i < SymbolsHere.size(); ++i) {

jhendersonUnsubmitted

Done

Not sure you need the parentheses, or the "//" emphasis marks. Also, "Arm" should be "ARM".

jhenderson: Not sure you need the parentheses, or the "//" emphasis marks. Also, "Arm" should be "ARM".

uint8_t SymTy = SymbolsHere[i].Type;

if (SymTy != ELF::STT_OBJECT && SymTy != ELF::STT_COMMON) {

jhendersonUnsubmitted

Done

size_t would be the more natural type for loops over SymbolsHere, since that's the value returned by size() and used by the index operator.

Ditto below.

jhenderson: `size_t` would be the more natural type for loops over `SymbolsHere`, since that's the value…

simon_tathamAuthorUnsubmitted

Done

I agree! But I guessed that the widespread existing use of unsigned in similar cases in the LLVM code base (for example, this very loop where SI ranges up to Symbols.size()) was a local idiom that I'd be criticised for going against. I'm glad to see that the opposite is true ;-)

simon_tatham: I agree! But I guessed that the widespread existing use of `unsigned` in similar cases in the…

DisassembleAsData = false;

DisplaySymIndex = i;

}

// Decide which symbol(s) from this collection we're going to print.

std::vector<bool> SymsToPrint(SymbolsHere.size(), false);

// If the user has given the --disassemble-symbols option, then we must

// display every symbol in that set, and no others.

if (!DisasmSymbolSet.empty()) {

bool FoundAny = false;

for (size_t i = 0; i < SymbolsHere.size(); ++i) {

if (DisasmSymbolSet.count(SymNamesHere[i])) {

SymsToPrint[i] = true;

FoundAny = true;

}

// And if none of the symbols here is one that the user asked for, skip

// disassembling this entire chunk of code.

if (!FoundAny)

continue; continue;

} else {

// Otherwise, print whichever symbol at this location is last in the

// Symbols array, because that array is pre-sorted in a way intended to

// correlate with priority of which symbol to display.

SymsToPrint[DisplaySymIndex] = true;

}

// Now that we know we're disassembling this section, override the choice

// of which symbols to display by printing _all_ of them at this address

// if the user asked for all symbols.

// That way, '--show-all-symbols --disassemble-symbol=foo' will print

// only the chunk of code headed by 'foo', but also show any other

// symbols defined at that address, such as aliases for 'foo', or the ARM

// mapping symbol preceding its code.

if (ShowAllSymbols) {

for (size_t i = 0; i < SymbolsHere.size(); ++i)

SymsToPrint[i] = true;

}

uint64_t Start = Symbols[SI].Addr;

if (Start < SectionAddr || StopAddress <= Start) if (Start < SectionAddr || StopAddress <= Start)

continue; continue;

else

FoundDisasmSymbolSet.insert(SymbolName); for (size_t i = 0; i < SymbolsHere.size(); ++i)

FoundDisasmSymbolSet.insert(SymNamesHere[i]);

// The end is the section end, the beginning of the next symbol, or // The end is the section end, the beginning of the next symbol, or

// --stop-address. // --stop-address.

uint64_t End = std::min<uint64_t>(SectionAddr + SectSize, StopAddress); uint64_t End = std::min<uint64_t>(SectionAddr + SectSize, StopAddress);

if (SI + 1 < SE) if (SI < SE)

End = std::min(End, Symbols[SI + 1].Addr); End = std::min(End, Symbols[SI].Addr);

if (Start >= End || End <= StartAddress) if (Start >= End || End <= StartAddress)

jhendersonUnsubmitted

Done

It's not immediately obvious to me why this line has changed. Could you explain please, as it likely means I've missed something.

jhenderson: It's not immediately obvious to me why this line has changed. Could you explain please, as it…

simon_tathamAuthorUnsubmitted

Done

In the existing version of the loop, SI is incremented after the loop body runs, by the ++SI in the for statement itself. So throughout the loop body, SI points at the symbol we're currently disassembling, and SI+1 here indicates the next symbol, whose address marks the point where we're planning to finish this iteration of the disassembly loop.

In the new version, I've removed the ++SI in the for statement, and replaced it with code at the beginning of the loop body that advances SI past all the symbols defined at the same address. So after that code runs, the rest of the loop body sees SI already pointing at the first symbol defined at a later address.

simon_tatham: In the existing version of the loop, `SI` is incremented //after// the loop body runs, by the…

continue; continue;

Start -= SectionAddr; Start -= SectionAddr;

End -= SectionAddr; End -= SectionAddr;

if (!PrintedSection) { if (!PrintedSection) {

PrintedSection = true; PrintedSection = true;

outs() << "\nDisassembly of section "; outs() << "\nDisassembly of section ";

if (!SegmentName.empty()) if (!SegmentName.empty())

outs() << SegmentName << ","; outs() << SegmentName << ",";

outs() << SectionName << ":\n"; outs() << SectionName << ":\n";

} }

outs() << '\n'; outs() << '\n';

for (size_t i = 0; i < SymbolsHere.size(); ++i) {

if (!SymsToPrint[i])

continue;

const SymbolInfoTy &Symbol = SymbolsHere[i];

const StringRef SymbolName = SymNamesHere[i];

jhendersonUnsubmitted

Done

StringRef?

jhenderson: `StringRef`?

if (LeadingAddr) if (LeadingAddr)

outs() << format(Is64Bits ? "%016" PRIx64 " " : "%08" PRIx64 " ", outs() << format(Is64Bits ? "%016" PRIx64 " " : "%08" PRIx64 " ",

SectionAddr + Start + VMAAdjustment); SectionAddr + Start + VMAAdjustment);

if (Obj.isXCOFF() && SymbolDescription) { if (Obj.isXCOFF() && SymbolDescription) {

outs() << getXCOFFSymbolDescription(Symbols[SI], SymbolName) << ":\n"; outs() << getXCOFFSymbolDescription(Symbol, SymbolName) << ":\n";

} else } else

outs() << '<' << SymbolName << ">:\n"; outs() << '<' << SymbolName << ">:\n";

}

// Don't print raw contents of a virtual section. A virtual section // Don't print raw contents of a virtual section. A virtual section

// doesn't have any contents in the file. // doesn't have any contents in the file.

if (Section.isVirtual()) { if (Section.isVirtual()) {

outs() << "...\n"; outs() << "...\n";

continue; continue;

} }

auto Status = DisAsm->onSymbolStart(Symbols[SI], Size, // See if any of the symbols defined at this location triggers target-

Bytes.slice(Start, End - Start), // specific disassembly behavior, e.g. of special descriptors or function

SectionAddr + Start, CommentStream); // prelude information.

// To have round trippable disassembly, we fall back to decoding the

// remaining bytes as instructions.

// If there is a failure, we disassemble the failed region as bytes before

// falling back. The target is expected to print nothing in this case.

// //

// If there is Success or SoftFail i.e no 'real' failure, we go ahead by // We stop this loop at the first symbol that triggers some kind of

// Size bytes before falling back. // interesting behavior (if any), on the assumption that if two symbols

// So if the entire symbol is 'eaten' by the target: // defined at the same address trigger two conflicting symbol handlers,

// Start += Size // Now Start = End and we will never decode as // the object file is probably confused anyway, and it would make even

// // instructions // less sense to present the output of _both_ handlers, because that

// would describe the same data twice.

for (size_t SHI = 0; SHI < SymbolsHere.size(); ++SHI) {

SymbolInfoTy Symbol = SymbolsHere[SHI];

auto Status =

DisAsm->onSymbolStart(Symbol, Size, Bytes.slice(Start, End - Start),

SectionAddr + Start, CommentStream);

if (!Status) {

// If onSymbolStart returns None, that means it didn't trigger any

// interesting handling for this symbol. Try the other symbols

// defined at this address.

continue;

}

if (Status.value() == MCDisassembler::Fail) {

// If onSymbolStart returns Fail, that means it identified some kind

// of special data at this address, but wasn't able to disassemble it

// meaningfully. So we fall back to disassembling the failed region

// as bytes, assuming that the target detected the failure before

// printing anything.

// //

// Right now, most targets return None i.e ignore to treat a symbol // Return values Success or SoftFail (i.e no 'real' failure) are

// separately. But WebAssembly decodes preludes for some symbols. // expected to mean that the target has emitted its own output.

// //

if (Status) { // Either way, 'Size' will have been set to the amount of data

if (Status.value() == MCDisassembler::Fail) { // covered by whatever prologue the target identified. So we advance

outs() << "// Error in decoding " << SymbolName // our own position to beyond that. Sometimes that will be the entire

// distance to the next symbol, and sometimes it will be just a

// prologue and we should start disassembling instructions from where

// it left off.

outs() << "// Error in decoding " << SymNamesHere[SHI]

<< " : Decoding failed region as bytes.\n"; << " : Decoding failed region as bytes.\n";

for (uint64_t I = 0; I < Size; ++I) { for (uint64_t I = 0; I < Size; ++I) {

outs() << "\t.byte\t " << format_hex(Bytes[I], 1, /*Upper=*/true) outs() << "\t.byte\t " << format_hex(Bytes[I], 1, /*Upper=*/true)

<< "\n"; << "\n";

} }

} else {

Size = 0;

}

Start += Size; Start += Size;

break;

}

Index = Start; Index = Start;

if (SectionAddr < StartAddress) if (SectionAddr < StartAddress)

Index = std::max<uint64_t>(Index, StartAddress - SectionAddr); Index = std::max<uint64_t>(Index, StartAddress - SectionAddr);

// If there is a data/common symbol inside an ELF text section and we are if (DisassembleAsData) {

jhendersonUnsubmitted

Done

Is there a subtle behaviour change here if you have multiple symbols at the same address but different types (i.e. one is STT_OBJECT and one isn't, e.g. STT_FUNC)?

jhenderson: Is there a subtle behaviour change here if you have multiple symbols at the same address but…

simon_tathamAuthorUnsubmitted

Done

Potentially, yes. Previously, llvm-objdump would pick just one of the symbols defined at the address, and base its decision on that symbol alone. With this patch, it will go through all of them, and spots any STT_OBJECT even if it's not the symbol last in the sorted list.

This is just the sort of thing I hoped to have a useful discussion about in order to decide what the behaviour should be, to avoid the risk of writing oodles of code to implement a complicated policy that we had no consensus on :-) so thanks for flagging it up.

What do you think we should do if an STT_FUNC and an STT_OBJECT occur at the same address? llvm-objdump's existing policy doesn't look particularly deliberate to me – it's an artefact of the code's previous lack of attention to collocated symbols. Perhaps it's nonetheless best to stick with the existing policy just for stability's sake, but if so, I'd prefer that we'd discussed other options before deciding that.

Other possibilities that spring to mind are to deliberately make STT_OBJECT highest priority (which is what's happening in this version of the code), or to make it lowest priority, or to choose based on some criterion like symbol index in the ELF file (go with whichever symbol was first/last in the actual object file's symtab). And maybe, whichever of those we do, emit a warning that flags up that we had to make an arbitrary decision that could have gone the other way.

What do you think?

(PS I hope you're not going to like the symtab index idea, because that information isn't preserved at all in SymbolInfoTy so it would take a load more plumbing :-)

simon_tatham: Potentially, yes. Previously, `llvm-objdump` would pick just one of the symbols defined at the…

jhendersonUnsubmitted

Done

Looking at the comment block, my instinct says we should treat STT_FUNC as higher priority (possibly assuming it hasn't got size 0) and do regular disassembly. Having an STT_OBJECT/STT_COMMON symbol at the same address as an STT_FUNC symbol sounds like it's unlikely to ever occur in practice ("this code represents both a function and some data??"). If it does, I think it's reasonable to pick one style somewhat arbitrarily. The user can use --disassemble-symbol of the STT_OBJECT symbol if they want to disassemble it as data in this case, I think.

I'm less clear on the second block below about ARM mapping symbols, because I'm not familiar with how ARM mapping symbols are used, and therefore don't think I can make an informed decision on the right approach there.

jhenderson: Looking at the comment block, my instinct says we should treat STT_FUNC as higher priority…

simon_tathamAuthorUnsubmitted

Done

OK, I'll change it to treat code symbols as higher priority than data.

As for mapping symbols, that's a use case I do know something about, and honestly, I think the simplest approach there is to stop checking for STT_OBJECT symbols at all, and just say that if there are mapping symbols in this section, we should use them, and not try to second-guess whether we think they're useful. So I think the best thing is to remove that loop completely, and replace it with a test of MappingSymbols.empty().

simon_tatham: OK, I'll change it to treat code symbols as higher priority than data. As for mapping symbols…

// only disassembling text (applicable all architectures), we are in a

// situation where we must print the data and not disassemble it.

if (Obj.isELF() && !DisassembleAll && Section.isText()) {

uint8_t SymTy = Symbols[SI].Type;

if (SymTy == ELF::STT_OBJECT || SymTy == ELF::STT_COMMON) {

dumpELFData(SectionAddr, Index, End, Bytes); dumpELFData(SectionAddr, Index, End, Bytes);

Index = End; Index = End;

} continue;

} }

bool CheckARMELFData = hasMappingSymbols(Obj) &&

jhendersonUnsubmitted

Done

The change of this logic should correspond to some sort of test case, I think, but I don't think I see anything?

jhenderson: The change of this logic should correspond to some sort of test case, I think, but I don't…

simon_tathamAuthorUnsubmitted

Done

It turned out that I had trouble thinking of something that would have changed as a result of removing this section!

The intention of the old code here is to avoid checking mapping symbols if we're starting disassembly at an STT_OBJECT symbol. But STT_OBJECT symbols are handled by the previous if statement by going to dumpELFData and then terminating this loop iteration, so it's difficult for one to get as far as here in the first place.

If there is any case that could have got here at all without being eaten by the previous test, it must be a confusing edge case of some kind and I haven't put my finger on it yet.

simon_tatham: It turned out that I had trouble thinking of something that would have //changed// as a result…

simon_tathamAuthorUnsubmitted

Done

Aha! There is an edge case affected by this change. If you set --disassemble-all to force disassembly of data sections, then the previous code would have had the side effect of ignoring mapping symbols in code sections, so you'd get Thumb code mistakenly disassembled as Arm.

The new criterion of "use mapping symbols if they're there" stops that failure from happening. I'll add a regression test for it.

simon_tatham: Aha! There //is// an edge case affected by this change. If you set `--disassemble-all` to force…

Symbols[SI].Type != ELF::STT_OBJECT &&

!DisassembleAll;

bool DumpARMELFData = false; bool DumpARMELFData = false;

jhendersonUnsubmitted

Done

Same comment as above.

jhenderson: Same comment as above.

formatted_raw_ostream FOS(outs()); formatted_raw_ostream FOS(outs());

std::unordered_map<uint64_t, std::string> AllLabels; std::unordered_map<uint64_t, std::string> AllLabels;

std::unordered_map<uint64_t, std::vector<std::string>> BBAddrMapLabels; std::unordered_map<uint64_t, std::vector<std::string>> BBAddrMapLabels;

if (SymbolizeOperands) { if (SymbolizeOperands) {

collectLocalBranchTargets(Bytes, MIA, DisAsm, IP, PrimarySTI, collectLocalBranchTargets(Bytes, MIA, DisAsm, IP, PrimarySTI,

SectionAddr, Index, End, AllLabels); SectionAddr, Index, End, AllLabels);

collectBBAddrMapLabels(AddrToBBAddrMap, SectionAddr, Index, End, collectBBAddrMapLabels(AddrToBBAddrMap, SectionAddr, Index, End,

BBAddrMapLabels); BBAddrMapLabels);

} }

while (Index < End) { while (Index < End) {

// ARM and AArch64 ELF binaries can interleave data and text in the // ARM and AArch64 ELF binaries can interleave data and text in the

// same section. We rely on the markers introduced to understand what // same section. We rely on the markers introduced to understand what

// we need to dump. If the data marker is within a function, it is // we need to dump. If the data marker is within a function, it is

// denoted as a word/short etc. // denoted as a word/short etc.

if (CheckARMELFData) { if (!MappingSymbols.empty()) {

char Kind = getMappingSymbolKind(MappingSymbols, Index); char Kind = getMappingSymbolKind(MappingSymbols, Index);

DumpARMELFData = Kind == 'd'; DumpARMELFData = Kind == 'd';

if (SecondarySTI) { if (SecondarySTI) {

if (Kind == 'a') { if (Kind == 'a') {

STI = PrimaryIsThumb ? SecondarySTI : PrimarySTI; STI = PrimaryIsThumb ? SecondarySTI : PrimarySTI;

DisAsm = PrimaryIsThumb ? SecondaryDisAsm : PrimaryDisAsm; DisAsm = PrimaryIsThumb ? SecondaryDisAsm : PrimaryDisAsm;

} else if (Kind == 't') { } else if (Kind == 't') {

STI = PrimaryIsThumb ? PrimarySTI : SecondarySTI; STI = PrimaryIsThumb ? PrimarySTI : SecondarySTI;

▲ Show 20 Lines • Show All 1,224 Lines • ▼ Show 20 Lines static void parseObjdumpOptions(const llvm::opt::InputArgList &InputArgs) {

LeadingAddr = !InputArgs.hasArg(OBJDUMP_no_leading_addr); LeadingAddr = !InputArgs.hasArg(OBJDUMP_no_leading_addr);

RawClangAST = InputArgs.hasArg(OBJDUMP_raw_clang_ast); RawClangAST = InputArgs.hasArg(OBJDUMP_raw_clang_ast);

Relocations = InputArgs.hasArg(OBJDUMP_reloc); Relocations = InputArgs.hasArg(OBJDUMP_reloc);

PrintImmHex = PrintImmHex =

InputArgs.hasFlag(OBJDUMP_print_imm_hex, OBJDUMP_no_print_imm_hex, false); InputArgs.hasFlag(OBJDUMP_print_imm_hex, OBJDUMP_no_print_imm_hex, false);

PrivateHeaders = InputArgs.hasArg(OBJDUMP_private_headers); PrivateHeaders = InputArgs.hasArg(OBJDUMP_private_headers);

FilterSections = InputArgs.getAllArgValues(OBJDUMP_section_EQ); FilterSections = InputArgs.getAllArgValues(OBJDUMP_section_EQ);

SectionHeaders = InputArgs.hasArg(OBJDUMP_section_headers); SectionHeaders = InputArgs.hasArg(OBJDUMP_section_headers);

ShowAllSymbols = InputArgs.hasArg(OBJDUMP_show_all_symbols);

ShowLMA = InputArgs.hasArg(OBJDUMP_show_lma); ShowLMA = InputArgs.hasArg(OBJDUMP_show_lma);

PrintSource = InputArgs.hasArg(OBJDUMP_source); PrintSource = InputArgs.hasArg(OBJDUMP_source);

parseIntArg(InputArgs, OBJDUMP_start_address_EQ, StartAddress); parseIntArg(InputArgs, OBJDUMP_start_address_EQ, StartAddress);

HasStartAddressFlag = InputArgs.hasArg(OBJDUMP_start_address_EQ); HasStartAddressFlag = InputArgs.hasArg(OBJDUMP_start_address_EQ);

parseIntArg(InputArgs, OBJDUMP_stop_address_EQ, StopAddress); parseIntArg(InputArgs, OBJDUMP_stop_address_EQ, StopAddress);

HasStopAddressFlag = InputArgs.hasArg(OBJDUMP_stop_address_EQ); HasStopAddressFlag = InputArgs.hasArg(OBJDUMP_stop_address_EQ);

SymbolTable = InputArgs.hasArg(OBJDUMP_syms); SymbolTable = InputArgs.hasArg(OBJDUMP_syms);

SymbolizeOperands = InputArgs.hasArg(OBJDUMP_symbolize_operands); SymbolizeOperands = InputArgs.hasArg(OBJDUMP_symbolize_operands);

▲ Show 20 Lines • Show All 162 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objdump] Handle multiple syms at same addr in disassembly.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 455194

llvm/test/tools/llvm-objdump/ELF/ARM/disassemble-all-mapping-symbols.s

llvm/test/tools/llvm-objdump/ELF/data-vs-code-priority.s

llvm/test/tools/llvm-objdump/multiple-symbols-mangling.s

llvm/test/tools/llvm-objdump/multiple-symbols.s

llvm/tools/llvm-objdump/ObjdumpOpts.td

llvm/tools/llvm-objdump/llvm-objdump.cpp

[llvm-objdump] Handle multiple syms at same addr in disassembly.
ClosedPublic