This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objdump] Add option to sort symbols during disassembly
Needs ReviewPublic

Authored by stephenneuendorffer on Mar 2 2020, 5:58 PM.

Details

Summary

Note: This only sorts the symbols within the same section.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript

Is there a particular use case for this feature?

The loop that prints each symbol might have some more assumptions that it's printed in address order; it's pretty complicated so I didn't fully review it, but I'm not confident this won't introduce some subtle bug. For an initial test, you might want run llvm-objdump -d --sort on some large binary, and make sure that it's equivalent to llvm-objdump -d after you've reordered the sections with a different post-processing script.

llvm/test/tools/llvm-objdump/X86/Inputs/sort-test.yaml
2

This doesn't appear to be an actual test. Can you add some test assertions, e.g. RUN: lines?

llvm/tools/llvm-objdump/llvm-objdump.cpp
299

Just --sort does not seem to be a specific flag. e.g. should we sort section headers by name? Or what about symbols *not* when disassembling, i.e. llvm-objdump --syms --sort

1330–1333

sortedSymbols should be preallocated (sortedSymbols.resize(Symbols.size())) to avoid the performance hit when increasing the size every time.

1342–1343

Demangling should happen earlier, so that the symbols are also sorted when demangled

Hi, can you describe the use case you have in mind? Is that for a more stable disassembly output when there can be several symbols defined at the same address?

There is no guarantee for the .symtab order but in practice it has some patterns. Both GNU as and llvm-mc sort symbols in object files. Linkers (GNU ld, gold, lld) seem to use a visiting order. So, say, you have several aliases:

.globl foo, _foo, __foo, ___foo
_foo:
__foo:
foo:
___foo:
  ret

After you add or delete instructions, the order among these *foo should not change.

For one thing, llvm-objdump currently uses the symbol table order while I think preferring STB_GLOBAL/STB_WEAK to STB_LOCAL may be nice. GNU objdump may have some other heuristics. I'll dig into its source.

stephenneuendorffer marked 3 inline comments as done.Mar 3 2020, 7:47 PM

The point of this is to disassemble symbols in a section in a fixed order. This might be useful because linking has moved symbols around, or because some other linker disassembles symbols in alphabetical order and one wants to compare. Basically, I found this useful and I thought other people might. It was much easier to modify llvm-objdump directly than to post-process the disassembly.

llvm/tools/llvm-objdump/llvm-objdump.cpp
299

I agree, but struggled what else to actually name it. --sort-symbols-when-disassembling? --sort-symbols, and then change other corresponding modes? I'm new to this code, so I'm not sure what other modes are interesting/appropriate.

1330–1333

Good point.

1342–1343

Makes sense.

MaskRay added a comment.EditedMar 3 2020, 9:04 PM

The point of this is to disassemble symbols in a section in a fixed order. This might be useful because linking has moved symbols around, or because some other linker disassembles symbols in alphabetical order and one wants to compare. Basically, I found this useful and I thought other people might. It was much easier to modify llvm-objdump directly than to post-process the disassembly.

As I commented at https://reviews.llvm.org/D75498#1904244 , some other linker disassembles symbols in alphabetical order does not match what I have observed with GNU ld, gold and lld. Are you using a different linker or am I missing something?

If the symbol resolution does not change, functions have more of fewer instructions, the order should not change.

As I commented at https://reviews.llvm.org/D75498#1904244 , some other linker disassembles symbols in alphabetical order does not match what I have observed with GNU ld, gold and lld. Are you using a different linker or am I missing something?

You're not missing anything. The linker in question is not an open source linker.

stephenneuendorffer marked an inline comment as done.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1342–1343

Except.... that this means allocating space for all the demangled strings, which would have a more significant memory, or a more significant runtime cost to demangling during sorting. Currently, the sorting just happens on StringRefs...
My intuition says that mangling shouldn't change the resulting order significantly (although it might mean that some mangled symbol foo(...)) might get reordered with a non-mangled symbol _foo, but this seems minimally concerning to me?

stephenneuendorffer marked 2 inline comments as done.Mar 4 2020, 3:26 PM
MaskRay added a comment.EditedMar 4 2020, 4:05 PM

As I commented at https://reviews.llvm.org/D75498#1904244 , some other linker disassembles symbols in alphabetical order does not match what I have observed with GNU ld, gold and lld. Are you using a different linker or am I missing something?

You're not missing anything. The linker in question is not an open source linker.

Shouldn't that linker be fixed instead? It is usually a simple data structure replacement... For example, unordered_map -> MapVector.

Sorry that I will make some pushback here. A non-deterministic symbol order also affects other tools like readelf/nm. Fixing the linker is likely the simplest way to make every binary manipulation tool happy.

As I commented at https://reviews.llvm.org/D75498#1904244 , some other linker disassembles symbols in alphabetical order does not match what I have observed with GNU ld, gold and lld. Are you using a different linker or am I missing something?

You're not missing anything. The linker in question is not an open source linker.

Shouldn't that linker be fixed instead? It is usually a simple data structure replacement... For example, unordered_map -> begin.

Sorry that I will make some pushback here. A non-deterministic symbol order also affects other tools like readelf/nm. Fixing the linker is likely the simplest way to make every binary manipulation tool happy.

The short answer is no, we can't fix the other linker. I don't mind the pushback and I don't mind if you say "We don't want this." It's something that I'd rather not carry around internally, but maybe that's just what has to be done. :)

I might be getting the wrong end of the stick here, but since symbol tables have to have local symbols before global symbols already, isn't it possible (indeed quite likely) for linked output symbols to be in a different order other than address order? As far as I know, this ordering requirement is in fact the only requirement in the ELF gABI, and I think we should be writing our tools based on that rather than decisions that some linkers have made. In other words, I think we need to be able to nicely handle symbols in different orders. Whether that really justifies a new option, I'm not sure either way, but it's also worth noting that llvm-nm already provides options to sort the symbols it prints (and indeed does some sorting by default), so I think an argument can be made that there's some prior art here, especially if the switch doesn't introduce much complexity (it doesn't look like it does).

Is there a particular use case for this feature?

The loop that prints each symbol might have some more assumptions that it's printed in address order; it's pretty complicated so I didn't fully review it, but I'm not confident this won't introduce some subtle bug. For an initial test, you might want run llvm-objdump -d --sort on some large binary, and make sure that it's equivalent to llvm-objdump -d after you've reordered the sections with a different post-processing script.

I've not looked at the loop, but unless it handles local and global symbols differently, there can't be any assumption about address order, since locals always appear before globals in the symbol table, regardless of address order.

It sounds like @jhenderson is mildly positive on this. Is there consensus that something something like this should go in? it looks like llvm-nm sorts by default based on name and hence doesn't have an option, but provides:

-no-sort                             - Show symbols in order encountered
-numeric-sort                        - Sort symbols by address
-reverse-sort                        - Sort in reverse order
-size-sort                           - Sort symbols by size

So maybe "-alphabetical-sort" would be partially consistent?
I'm looking for other suggestions on what to name this.

So maybe "-alphabetical-sort" would be partially consistent?
I'm looking for other suggestions on what to name this.

If we go ahead with it, I'd recommend "--lexicographical-sort" or something like that, rather than "alphabetical".