This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objdump] Symbolize binary addresses for low-noisy asm diff.
ClosedPublic

Authored by hoyFB on Jul 20 2020, 10:16 AM.

Details

Summary

When diffing disassembly dump of two binaries, I see lots of noises from mismatched jump target addresses and global data references, which unnecessarily causes diffs on every function, making it impractical. I'm trying to symbolize the raw binary addresses to minimize the diff noise.
In this change, a local branch target is modeled as a label and the branch target operand will simply be printed as a label. Local labels are collected by a separate pre-decoding pass beforehand. A global data memory operand will be printed as a global symbol instead of the raw data address. Unfortunately, due to the way the disassembler is set up and to be less intrusive, a global symbol is always printed as the last operand of a memory access instruction. This is less than ideal but is probably acceptable from checking code quality point of view since on most targets an instruction can have at most one memory operand.

So far only the X86 disassemblers are supported.

Test Plan:

llvm-objdump -d --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :

Disassembly of section .text:

<_start>:
               	push	rax
               	mov	dword ptr [rsp + 4], 0
               	mov	dword ptr [rsp], 0
               	mov	eax, dword ptr [rsp]
               	cmp	eax, dword ptr [rip + 4112]  # 202182 <g>
               	jge	0x20117e <_start+0x25>
               	call	0x201158 <foo>
               	inc	dword ptr [rsp]
               	jmp	0x201169 <_start+0x10>
               	xor	eax, eax
               	pop	rcx
               	ret

llvm-objdump -d --symbolize-operands --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr :

Disassembly of section .text:

<_start>:
               	push	rax
               	mov	dword ptr [rsp + 4], 0
               	mov	dword ptr [rsp], 0
<L1>:
               	mov	eax, dword ptr [rsp]
               	cmp	eax, dword ptr  <g>
               	jge	 <L0>
               	call	 <foo>
               	inc	dword ptr [rsp]
               	jmp	 <L1>
<L0>:
               	xor	eax, eax
               	pop	rcx
               	ret

Note that the jump instructions like jge 0x20117e <_start+0x25> without this work is printed as a real target address and an offset from the leading symbol. With a change in the optimizer that adds/deletes an instruction, the address and offset may shift for targets placed after the instruction. This will be a problem when diffing the disassembly from two optimizers where there are unnecessary false positives due to such branch target address changes. With --symbolize-operand, a label is printed for a branch target instead to reduce the false positives. Similarly, the disassemble of PC-relative global variable references is also prone to instruction insertion/deletion.

Diff Detail

Event Timeline

hoyFB created this revision.Jul 20 2020, 10:16 AM
hoyFB edited the summary of this revision. (Show Details)Jul 20 2020, 10:17 AM
hoyFB added a reviewer: wenlei.
hoyFB updated this revision to Diff 279300.Jul 20 2020, 10:19 AM

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

hoyFB updated this revision to Diff 279343.Jul 20 2020, 2:08 PM

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

MaskRay added inline comments.Jul 20 2020, 4:05 PM
llvm/test/tools/llvm-objdump/X86/disassemble-symbolize-operands.s
3

lld is a subproject and depends on llvm. You cannot use ld.lld in llvm tests.

Could you post some examples of what the actual difference in output is, please? I don't quite follow from the tests/description what is intended to change as a result of this option.

Also, if you are adding a new option to llvm-objdump, please remember to document it in llvm/docs/CommandGuide/llvm-objdump.rst.

hoyFB edited the summary of this revision. (Show Details)Jul 21 2020, 10:17 AM
hoyFB marked 2 inline comments as done.Jul 21 2020, 10:29 AM

Could you post some examples of what the actual difference in output is, please? I don't quite follow from the tests/description what is intended to change as a result of this option.

Also, if you are adding a new option to llvm-objdump, please remember to document it in llvm/docs/CommandGuide/llvm-objdump.rst.

The main difference in output is to print branch target as labels instead of real addresses, also for PC-relative global references. I updated the summary. Please let me know if it is clear. Thanks.

llvm/test/tools/llvm-objdump/X86/disassemble-symbolize-operands.s
3

The original test seemed to pass. Did that end up using the pre-installed lld on the test machine?

I switched to using a prebuilt binary instead.

hoyFB updated this revision to Diff 279570.Jul 21 2020, 10:29 AM
hoyFB marked an inline comment as done.

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

hoyFB updated this revision to Diff 279571.Jul 21 2020, 10:31 AM

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

MaskRay added inline comments.Jul 21 2020, 11:22 AM
llvm/docs/CommandGuide/llvm-objdump.rst
206 ↗(On Diff #279571)

You probably want to mean a linked image (not a relocatable object file).

Capitalized Intel isn't supported. att is much more common and --x86-asm-syntax=intel is rare. As a related fact, LLVM's Intel syntax assembler gets less scrutiny and tends to have some problems.

llvm/test/tools/llvm-objdump/X86/disassemble-symbolize-operands.test
1 ↗(On Diff #279571)

Please avoid a checked-in executable. Its content is difficult to inspect.
See X86/elf-disassemble-symbol-references.yaml for a yaml2obj example.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1576

I wonder why you need this. For BOLT or similar post-link rewriting stuff?

Thanks for the example, they helped me understand much better what you're trying to achieve and +1 to the idea definitely. In a weird coincidence, one of our toolchain team only last week demo'ed to the team a script to make comparing diassembly much less noisy. This option may help with some of that.

llvm/docs/CommandGuide/llvm-objdump.rst
202 ↗(On Diff #279571)

instead of real -> instead of a real

jhenderson added inline comments.Jul 22 2020, 1:37 AM
llvm/docs/CommandGuide/llvm-objdump.rst
202–204 ↗(On Diff #279571)

It might be worth including a couple of simple examples of what gets printed with the option. I found the example you posted in the patch description a much clearer way of expressing this compared to just words (you obviously should still keep the words).

206 ↗(On Diff #279571)

Is there anything fundamentally preventing this from working with the default assembly syntax or relocatable objects? Or is it just that you haven't implemented it yet?

208 ↗(On Diff #279571)

Nit: delete second blank line.

llvm/include/llvm/MC/MCInstPrinter.h
68

"symbolize branch target and memory reference operands"

(no need to repeat operands twice this way)

llvm/test/tools/llvm-objdump/X86/disassemble-symbolize-operands.test
1 ↗(On Diff #279571)

You probably want to use FileCheck's --match-full-lines to show that you are capturing all the content on the line and that there isn't any additional text before/after the thing you check for.

4–10 ↗(On Diff #279571)

Consider using CHECK-NEXT, rather than CHECK, if possible.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1580–1583

I don't think X86 is exempt from potentially having data embedded in the text section. I think I've seen jump tables that are, for example.

It's not clear to me, why a target having potential data embedding is an issue?

1594

I'd be slightly concerned by this additional pass over the disassembly, due to performance reasons, but I guess it's behind a switch, so it's not that big a deal. Plus, I can't think how you'd identify where the labels should go otherwise, since they might appear before the reference.

hoyFB updated this revision to Diff 279934.Jul 22 2020, 1:33 PM
hoyFB marked 13 inline comments as done.

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

hoyFB edited the summary of this revision. (Show Details)Jul 22 2020, 1:34 PM
hoyFB added inline comments.
llvm/docs/CommandGuide/llvm-objdump.rst
206 ↗(On Diff #279571)

Yes, a linked image is more accurate.

Changed Intel to lower case.

The att syntax is now supported too.

206 ↗(On Diff #279571)

We need a tool to compare executable image since very often the immediate object files are not available.

Regarding the assembly syntax, we are mostly comfortable reading the intel syntax. But since the att syntax is more common, I'll add support for that.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1576

Good question. This is needed when doing optimization work where we compare code quality by checking binary disassembly.

1580–1583

Yeah, on X86 a jump table could be placed inside a function as a .data instruction. So literally it is still an instruction. I noticed that on Arm there might be non-instruction data encoded in the .text section and special handling is needed when disassembling instructions. I removed the comment to reduce the confusion.

1594

Exactly, two-pass disassembling is needed to support backward jumps.

hoyFB marked an inline comment as done.Jul 22 2020, 2:22 PM
hoyFB added inline comments.
llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml
9 ↗(On Diff #279934)

This is less than ideal because of the current less intrusive implementation. It's probably fine from asm diffing standpoint. To fix this we may want to pass the symbol/label resolution result from the common code into the per-target inst printer and print the symbols right in place for a memory operand. But that may cause changes to the core MCInstPrinter interface and the callbacks to initialize target inst printers.

hoyFB updated this revision to Diff 279954.Jul 22 2020, 3:15 PM

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

It looks like you might still have a binary file in the patch, that is no longer needed?

llvm/docs/CommandGuide/llvm-objdump.rst
206 ↗(On Diff #279571)

I'm not sure you didn't misunderstand my comment earlier. Let's say a user does have a single intermediate object file they want to disassemble, and it has plenty of references within itself. These addresses will be present, thus comparing it against a different version of the same object might produce a noisy diff. It sounds like this option would be helpful there, so why doesn't it work?

209–212 ↗(On Diff #279954)

I would actively include multi-line output snippets as your examples rather than using prose. Something like:
"

jge 0x20117e <_start+0x25>
dword ptr [rip + 4112]

might become

<g>:
  jge  <L0>
  dword ptr [rip + 4112]
<L0>:

"
(Please edit the above to reflect a theoretically real output. Also, please use the rst support for code examples. I think there will be other examples, e.g. in the llvm-symbolizer documentation.)

llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml
2–3 ↗(On Diff #279954)

These two lines are quite long. I'd recommend you split them over multiple lines like so:

# RUN: llvm-objdump %t -d --symbolize-operands --x86-asm-syntax=att --no-show-raw-insn --no-leading-addr | \
# RUN:   FileCheck %s --match-full-lines --check-prefix=ATT
9 ↗(On Diff #279954)

What's with the spaces before the ',' here.

15–22 ↗(On Diff #279954)

Here, and in the ATT example above, I'd add leading spaces, so that everything is lined up like it would be in the actual output. In other words, I'd expect something a bit like:

# INTEL:      <_start>:
# INTEL-NEXT:   push rax
# INTEL-NEXT: <L1>:
# INTEL-NEXT:   cmp  eax, dword ptr <g>
# INTEL-NEXT:   jge <L0>
# INTEL-NEXT:   jmp <L1>
# INTEL-NEXT: <L0>:
# INTEL-NEXT:   ret
44 ↗(On Diff #279954)

g isn't obviously an arbitrary symbol name to me. I'd suggest using something more obvious, e.g. simply symbol.

9 ↗(On Diff #279934)

It's not clear to me what is less than ideal here? What would you prefer in an ideal situation?

hoyFB marked 7 inline comments as done.Jul 24 2020, 9:19 AM

It looks like you might still have a binary file in the patch, that is no longer needed?

The binary file should have been removed.

llvm/docs/CommandGuide/llvm-objdump.rst
206 ↗(On Diff #279571)

Thanks for the clarification. Yes, we have a fundamental issue with disassembling object files because the symbol resolution unit (in llvm-objdump.cpp) doesn't handle relocations. For local jump targets this is probably fine since the jump distance is computed from the start of the current function. For global jump targets like a function call target or a global variable access, the symbolization will not work without considering the relocations. This is true even without symbolization where a call target will be disassembled as an offset from the current function.

llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml
9 ↗(On Diff #279954)

It is for the alignment of the first operand which isn't printed with the symbolization.

9 ↗(On Diff #279934)

Sorry, I should have commented on this line : # ATT-NEXT: cmpl , %eax <g> . Ideally I'd like something like cmpl <g>, %eax.

hoyFB updated this revision to Diff 280489.Jul 24 2020, 9:20 AM

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

jhenderson added inline comments.Aug 3 2020, 1:27 AM
llvm/docs/CommandGuide/llvm-objdump.rst
206 ↗(On Diff #280489)

You probably want to mention that this option currently only supports X86 output.

211 ↗(On Diff #280489)

As this is associated with the option, it needs indenting with a couple of spaces, just like the text describing the option. Same goes below. If you take a look at the output, you'll see that adding a couple of spaces before should cause the code to be shifted slightly to the right.

llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml
9 ↗(On Diff #279954)

So I think we want to fix this, but I a) am not familiar enough with the MCInstPrinter etc code to have a good recommendation, and b) think it could be deferred to a follow-up patch. Perhaps @MaskRay has a suggestion?

17 ↗(On Diff #280489)

Nit: add one more space (the '<' doesn't line up with those below currently).

hoyFB marked 3 inline comments as done.Aug 4 2020, 2:43 PM
hoyFB added inline comments.
llvm/docs/CommandGuide/llvm-objdump.rst
211 ↗(On Diff #280489)

Done adding indentation.

BTW, how should I see the the out of the .rst file?

hoyFB updated this revision to Diff 283033.Aug 4 2020, 2:46 PM
hoyFB marked an inline comment as done.

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

jhenderson accepted this revision.Aug 5 2020, 12:26 AM

Looks good from my point of view, but please wait for @MaskRay.

llvm/docs/CommandGuide/llvm-objdump.rst
211 ↗(On Diff #280489)

You need to set LLVM_ENABLE_SPHINX to on in your CMake configuration and then build the docs-llvm-html target. See https://llvm.org/docs/GettingStarted.html for details. Depending on your system configuration, you might need to install Sphinx and (I think) recommonmark too.

This revision is now accepted and ready to land.Aug 5 2020, 12:26 AM
hoyFB added a comment.Aug 11 2020, 9:41 AM

@MaskRay How does this change look to you? Please let me know your comments. Thanks!

MaskRay added inline comments.Aug 11 2020, 9:47 AM
llvm/docs/CommandGuide/llvm-objdump.rst
213 ↗(On Diff #283033)

Delete the trailing space

217 ↗(On Diff #283033)

Delete the trailing space

llvm/tools/llvm-objdump/llvm-objdump.cpp
1576

I'm reading this function now.

1929

I'll generally refrain from DenseMap on integer types because -1 and -2 cannot be used as DenseMap keys. I'd use unordered_map to avoid gotchas (even if in this context it may work)

MaskRay added inline comments.Aug 11 2020, 9:51 AM
llvm/tools/llvm-objdump/llvm-objdump.cpp
327

This needs to be ObjdumpCat

I have seen a similar feature in GNU objdump. FWIW CCed you on https://sourceware.org/pipermail/binutils/2020-August/112803.html

MaskRay requested changes to this revision.EditedAug 11 2020, 10:26 AM

I think the approach looks good. The rough corners (excess spaces) should be fixable. This should be good to go once some remaining stuff are fixed. I CCed you on binutils but don't expect they to block our progress. Just hope we can take inspirations from them if any:) (And ideally we can find some common ground, though I don't expect strongly on it)

llvm/include/llvm/MC/MCInstPrinter.h
52

= nullptr

MCInstPrinter is used by non-objdump tools. This will be an uninitialized read.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1602

Target >= Start && Target < End can be moved before !Labels.count(Target)

1603

("L" + Twine(LabelCount++)).str()

This revision now requires changes to proceed.Aug 11 2020, 10:26 AM
hoyFB marked 6 inline comments as done.Aug 11 2020, 1:47 PM
hoyFB added inline comments.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1929

Good to know -1 and -2 are not allowed to be keys with DenseMap. Changed to unordered_map.

hoyFB updated this revision to Diff 284881.Aug 11 2020, 1:48 PM
hoyFB marked an inline comment as done.

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

hoyFB added a comment.Aug 11 2020, 1:50 PM

I have seen a similar feature in GNU objdump. FWIW CCed you on https://sourceware.org/pipermail/binutils/2020-August/112803.html

Thanks! Good to know there's similar need to the GNU tool.

The last update looks good. Let's wait a bit for binutils' comment.. I also have a question about whether we should name the option --symbolize-operands, or is there a better name. It drops the existing offset/address operand as well as adding a symbol.

llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml
1 ↗(On Diff #284881)

You can omit --docnum=1 since it is the only document

llvm/tools/llvm-objdump/llvm-objdump.cpp
2074
hoyFB marked 2 inline comments as done.Aug 11 2020, 10:37 PM

The last update looks good. Let's wait a bit for binutils' comment.. I also have a question about whether we should name the option --symbolize-operands, or is there a better name. It drops the existing offset/address operand as well as adding a symbol.

Printing a symbol instead of an offset/address is sort of symbolization to me. That said, a more meaningful name is welcomed.

hoyFB updated this revision to Diff 284967.Aug 11 2020, 10:38 PM

Updating D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

MaskRay accepted this revision.Aug 14 2020, 5:04 PM

In the absence of binutils responses, let's do this. please consider pushing it at some point next week.

This revision is now accepted and ready to land.Aug 14 2020, 5:04 PM
This revision was landed with ongoing or failed builds.Aug 17 2020, 4:55 PM
This revision was automatically updated to reflect the committed changes.
dblaikie added inline comments.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1929

(FWIW for signed integers, DenseMap's int tombstones are int max and int min - for unsigned integral types it's int max, and int max - 1 - but yeah, if you need the complete range of a 64 bit integer, DenseMap won't suit because of the need for empty/tombstone values)

thakis added a subscriber: thakis.Apr 13 2021, 5:52 PM
thakis added inline comments.
llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml
2 ↗(On Diff #286176)

Does --x86-asm-syntax=intel do something different from -M Intel? The latter is the normal objdump name for this, so I'm wondering if the flag doesn't do what you need, or if you didn't use it for a different reason.

hoy added a subscriber: hoy.Apr 13 2021, 6:00 PM
hoy added inline comments.
llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml
2 ↗(On Diff #286176)

I didn't realize -M Intel worked with llvm-objdump. It worked with objdump though.

llvm-objdump -d -M Intel a.out
a.out:	file format elf64-x86-64
llvm-objdump: error: 'a.out': Unrecognized disassembler option: Intel

Am I missing anything?