This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objdump] -d: delete spaces among raw instruction bytes
AbandonedPublic

Authored by MaskRay on May 3 2020, 2:09 PM.

Details

Summary

On a disassembly line, the raw bytes are relatively less important.
--no-show-raw-insn can hide them but the option is not the default.
Delete spaces to leave columns to most useful stuff (for example, I plan
to implement visual arrows similar to --visual-jumps added in GNU
objdump 2.34, but I'd take more inspirations from radare2's pd command).

Also move the instruction left by 8 columns. This is especially helpful
for targets with x86, which has potentially long instructions.
Before:

2015c4: ff 15 be 12 00 00             callq   *4798(%rip)  # 202888 <crtstuff.c+0x202888>
2015ca: f4                            hlt
2015cb: 0f 1f 44 00 00                nopl    (%rax,%rax)

After (8 columns narrower):

2015c4: ff15be120000          callq   *4798(%rip)  # 202888 <crtstuff.c+0x202888>
2015ca: f4                    hlt
2015cb: 0f1f440000            nopl    (%rax,%rax)

Spaces are not deleted for non-x86 because otherwise the patch would
have to touch ~700 more tests. I may update them based on feedback.
This diff just includes some representative test changes. 40+ test
changes are omitted.

AArch64 instructions are encoded in little-endian and GNU objdump prints
04030201 for byte sequence 01 02 03 04. Personally I find the
hexpair order annoying because I prefer the way bytes are presented in a
hex editor.

Diff Detail

Event Timeline

MaskRay created this revision.May 3 2020, 2:09 PM
joerg added a subscriber: joerg.May 3 2020, 3:06 PM

I'm not sure about this change. It might make sense on x86, but many RISC architectures love to put data into the instruction stream and being able to pick them out is quite important for understanding the disassembly.

MaskRay added a comment.EditedMay 3 2020, 3:46 PM

I'm not sure about this change. It might make sense on x86, but many RISC architectures love to put data into the instruction stream and being able to pick them out is quite important for understanding the disassembly.

I don't understand your comment. This patch delete spaces to make the output compacter, how does that make RISC architectures unhappy?

joerg added a comment.May 3 2020, 4:06 PM

The more compact output is a lot harder to read when you have to find embedded data. I wouldn't mind dropping half the spaces to have columns of 4 hex digits, but anything more is really hard to read.

MaskRay added a comment.EditedMay 3 2020, 4:22 PM

The more compact output is a lot harder to read when you have to find embedded data. I wouldn't mind dropping half the spaces to have columns of 4 hex digits, but anything more is really hard to read.

Embedded data is like the following:

// llvm/test/tools/llvm-objdump/ELF/ARM/disassemble-code-data-mix.s (also see D79284)
// ARM .byte/.short/.word
00000020 <$d.1>:
      20:       00 00 00 00     .word   0x00000000

// STT_OBJECT/STT_COMMON symbols
00000024 <myStr>:
      24: 74 65 73 74 20 73 74 72         test str
      2c: 69 6e 67 00                     ing.

If we extend this patch to affect non-x86 *instructions* as well:

      18: 04b09de4              ldr     r11, [sp], #4
      1c: 1eff2fe1              <unknown>

## For consistency, we should delete spaces here as well.
00000020 <$d.1>:
      20:       00 00 00 00     .word   0x00000000

## dumpELFData, unaffected.
00000024 <myStr>:
      24: 74 65 73 74 20 73 74 72         test str
      2c: 69 6e 67 00                     ing.

Note that dumpELFData is not affected.

joerg added a comment.May 3 2020, 4:44 PM

ARM's data detection only works for unstripped binaries. Other architectures often don't even have such markers at all.

psmith added a comment.May 4 2020, 8:32 AM

I don't have a strong opinion here, the spaced bytes more closely match the output of the llvm-mc which can make disassembler tests easier to write. On the other hand it can be useful to see the structure of the disassembly of instructions when the mapping symbols exist.
Arm GNU objdump does the following:

address: 12345678 <spaces> <disassembly of arm instruction
address: 1234 5678 <spaces> for 32-bit Thumb instruction made up of two half words
address: 1234 <spaces> for 16-bit Thumb instruction made up a half word
address: 12345678 <spaces> for .word 0x12345678

I don't have a lot of experience of non Arm RISC architectures though so I don't know what looks reasonable for them.

MaskRay added a subscriber: rovka.May 4 2020, 8:46 AM
mtrent added a comment.May 4 2020, 9:36 AM

ARM's data detection only works for unstripped binaries. Other architectures often don't even have such markers at all.

I agree that ideal disassembly format differs from architecture to architecture.

I also agree that this particular change is fine for x86_64.

mtrent added inline comments.May 4 2020, 9:41 AM
llvm/lib/MC/MCInstPrinter.cpp
37

Isn't this just llvm::dumpBytes with an optional 3rd parameter to suppress the First if/else test? I guess it's only a small amount of copy-pasted code...

jhenderson added a comment.EditedMay 5 2020, 12:49 AM

Hmm... I'm not convinced that this is the right thing to do. Believe it or not, there will be people out there with scripts who parse the instruction bytes in their scripts for whatever reason. I can think of some reasons where this might be slightly useful. Changing this behaviour will break things for them. I also feel like it's less readable to me, as it takes more effort to identify the individual bytes, but I acknowledge that's a personal opinion.

You said the motivation is because you want more columns to do another feature. If people want to use that feature and need narrower output, why can't they choose to opt-in (which they'd be doing for the new option) by disabling the raw instruction bytes?

Also, what about another option to control the spacing? Something like --compact, or maybe even something crazy that allows you to format individual bytes using something like a printf format specifier. A good example of this is the od tool which provides different options for width per byte.

MaskRay abandoned this revision.Jun 10 2020, 10:32 AM

Abandoning.

Will do D81590 (decrease instruction indentation for non-x86)