Page MenuHomePhabricator

RFC: Proposed change in the disassembly default format in lldb
ClosedPublic

Authored by jasonmolenda on Feb 11 2015, 9:11 PM.

Details

Reviewers
jasonmolenda
Summary

In r219544 (2014-10-10) I changed the default disassembly format to more closely resemble gdb's disassembly format. After living on this format for a few months, there are obvious shortcomings with C++ and Objective-C programs and I want to try a new approach.

Originally lldb's disassembly would display the Module & Function/Symbol name on a line by itself when a new function/symbol began, on each line of assembly display the file/load address followed by opcode, operands, and comments (e.g. showing the target of a branch insn). Branches to the same function would have a comment listing the full function name plus an offset. Note that the addresses did not display the offset, just raw addresses, meaning you had to compare the full address of the branch target with the disassembly output to find the target of the branch. When the branch target was in inlined code, lldb would print all of the inlined functions in the comment field (on separate lines).

In October I changed this to more closely resemble gdb's output: Each line has the file/load address, the function name, the offset into the function ("+35"), opcode, operand, comment. Comments pointing to the same function behaved the same but inlined functions were not included. I try to elide function argument types (e.g. from a demangled C++ name) but with templated methods it can be enormous.

This style of disassembly looks pretty good for short C function names. Like

(lldb) disass -c 20

0x7fff94fbe188 <mach_msg_trap>: movq   %rcx, %r10
0x7fff94fbe18b <mach_msg_trap+3>: movl   $0x100001f, %eax
0x7fff94fbe190 <mach_msg_trap+8>: syscall

-> 0x7fff94fbe192 <mach_msg_trap+10>: retq

0x7fff94fbe193 <mach_msg_trap+11>: nop    
0x7fff94fbe194 <mach_msg_overwrite_trap>: movq   %rcx, %r10

but as soon as you get a hefty C++ name in there, it becomes very messy:

0x107915454 <CommandObjectBreakpointList::DoExecute+68>: jne 0x1be9331 ; CommandObjectBreakpointList::DoExecute + 113 at CommandObjectBreakpoint.cpp:1420

Or, an extreme example that I found in lldb with 30 seconds of looking (function name only) -

std::1::function<std::1::shared_ptr<lldb_private::TypeSummaryImpl> (lldb_private::ValueObject&)>::function<CommandObjectTypeSummary::CommandObjectTypeSummary(lldb_private::CommandInterpreter&)::'lambda'(lldb_private::ValueObject&)>

I want to go with a hybrid approach between these two styles. When there is a new symbol, we print the full module + function name. On each assembly line, we print the file/load address, the offset into the function in angle brackets, opcode, operand, and in the comments branches to the SAME function follow the <+36> style. An example:

(lldb) disass
LLDB`CommandObjectBreakpointList::DoExecute:

0x107915410 <+0>:    pushq  %rbp
0x107915411 <+1>:    movq   %rsp, %rbp
0x107915414 <+4>:    subq   $0x170, %rsp
0x10791541b <+11>:   movq   %rdi, -0x20(%rbp)
0x10791541f <+15>:   movq   %rsi, -0x28(%rbp)
0x107915423 <+19>:   movq   %rdx, -0x30(%rbp)
0x107915427 <+23>:   movq   -0x20(%rbp), %rdx

-> 0x10791542b <+27>: movq %rdx, %rsi

0x10791542e <+30>:   movb   0x165(%rdx), %al
0x107915434 <+36>:   andb   $0x1, %al
0x107915436 <+38>:   movq   %rsi, %rdi
0x107915439 <+41>:   movzbl %al, %esi
0x10791543c <+44>:   movq   %rdx, -0xf8(%rbp)
0x107915443 <+51>:   callq  0x107d87bb0               ; lldb_private::CommandObject::GetSelectedOrDummyTarget at CommandObject.cpp:1045
0x107915448 <+56>:   movq   %rax, -0x38(%rbp)
0x10791544c <+60>:   cmpq   $0x0, -0x38(%rbp)
0x107915454 <+68>:   jne    0x107915481               ; <+113> at CommandObjectBreakpoint.cpp:1420
0x10791545a <+74>:   leaq   0xf54d21(%rip), %rsi      ; "Invalid target. No current target or breakpoints."
0x107915461 <+81>:   movq   -0x30(%rbp), %rdi
0x107915465 <+85>:   callq  0x107d93640               ; lldb_private::CommandReturnObject::AppendError at CommandReturnObject.cpp:135
0x10791546a <+90>:   movl   $0x1, %esi
0x10791546f <+95>:   movq   -0x30(%rbp), %rdi
0x107915473 <+99>:   callq  0x107d93760               ; lldb_private::CommandReturnObject::SetStatus at CommandReturnObject.cpp:172
0x107915478 <+104>:  movb   $0x1, -0x11(%rbp)
0x10791547c <+108>:  jmp    0x1079158bd               ; <+1197> at CommandObjectBreakpoint.cpp:1470
0x107915481 <+113>:  movq   -0x38(%rbp), %rdi

The main drawback for this new arrangement is that you may be looking at a long series of instructions and forget the name of the function/method. You'll need to scroll backwards to the beginning of the disassembly to find this function's names. Minor details include doing a two-pass over the instruction list to find the maximum length of the address component and padding all the lines so the opcodes line up. For instance,

(lldb) disass -c 30 -n mach_msg_trap
libsystem_kernel.dylib`mach_msg_trap:

0x7fff94fbe188 <+0>:  movq   %rcx, %r10
0x7fff94fbe18b <+3>:  movl   $0x100001f, %eax
0x7fff94fbe190 <+8>:  syscall 
0x7fff94fbe192 <+10>: retq   
0x7fff94fbe193 <+11>: nop

dyld`mach_msg_trap:

0x7fff6a867210 <+0>:  movq   %rcx, %r10
0x7fff6a867213 <+3>:  movl   $0x100001f, %eax
0x7fff6a867218 <+8>:  syscall 
0x7fff6a86721a <+10>: retq   
0x7fff6a86721b <+11>: nop

The disassembly format can be overridden by the 'disassembly-format' setting if people have specific preferences. But I think this new hybrid style of disassembly will work the best as a default given the kinds of method names we see with OO languages.

Comments? I'd like to land this in a couple days if no one feels strongly about it.

Diff Detail

Repository
rL LLVM

Event Timeline

jasonmolenda retitled this revision from to RFC: Proposed change in the disassembly default format in lldb.
jasonmolenda updated this object.
jasonmolenda edited the test plan for this revision. (Show Details)
jasonmolenda set the repository for this revision to rL LLVM.
jasonmolenda added a subscriber: Unknown Object (MLST).

Personally I prefer always seeing the full function name, even if it's
messy. As an aside, have you considered embedding the function name
directly as the operand to the jump instruction, as opposed to as a
comment? For example, instead of this:

0x107915443 <+51>: callq 0x107d87bb0 ;
lldb_private::CommandObject::GetSelectedOrDummyTarget at
CommandObject.cpp:1045

we would see this:

0x107915443 <+51>: callq lldb_private::CommandObject::G
etSelectedOrDummyTarget (0x107d87bb0) ;
[CommandObject.cpp:1045]

I find this much easier to read, because unless you're a magician who
thinks in hexadecimal, chances are you are more interested in the name than
the address, so your eyes will always have to scroll to the right to find
the function name. This way your eyes sometimes don't have to scroll to
the right.

What if this were done only for functions whose names were a certain length or longer? When the function is a reasonable length, it's nice to see it embedded directly at the call site. But maybe for functions whose name is too long, we could print some kind of magic string. I find it a little jarring to just see <+31> for example. In my head I know it's <+31> from this function, but it's still a little awkward. What about a magic identifier instead? Something like (this func) + 31. I'll use my "symbol name instead of address format" since I'm trying to sell that idea at the same time to give a few examples.

0x107915454 <+68>: jne CommandObjectBreakpointList::DoExecute + 113 ; [CommandObjectBreakpoint.cpp:1420]
0x107915454 <+68>: jne (this func) + 113 ; [CommandObjectBreakpoint.cpp:1420]
0x107915454 <+68>: jne MyClass<int>::DoSomething() + 113 ; [CommandObjectBreakpoint.cpp:1420]

In the second line, (this func) is used because the name to display is determined "too complicated" by some heuristic. I'm not sure if we care that it be deterministic or not. More often than not we prefer to see the function name inlined. So a heuristic could take into account length plus number of angle brackets, for example.

Initially I had this using "*" to mean current-function. So offsets were expressed like <*+36> but when it was all said and done, the "*" wasn't adding any information so I nixed it. If the <>'s weren't present, the * would be necessary but I think with the offset in the brackets, it's not needed. I'm not wedded to omitting the "*" but I think we're fine without it. I definitely don't want to lose the "+" and have the offsets expressed as "<30>" - that seems a little too much.

I originally was trying to think of a way to alternate the disassembly style based on the function name length. But that makes it hard to have a disassembly-format customization capability unless there was a disassembly-format-short-names and disassembly-format-long-names ;). And someone argued that it would be more confusing to users if we have multiple styles of disassembly formatting depending on the length of the function name -- people aren't going to know why the disassembler formats it one way for certain functions and a different way for a slightly longer function name function.

Well, my underlying premise is that I think we should display as much information about the function as we can without getting in the way. <+36> doesn't display any information (well, aside from the offset, as that's a given), so personally I think something like *always* choosing a display like <(this func)+36> is better than <+36>. But then you have to wonder, if the actual name of this function is shorter than the length of the string "(this func)" why not just display it? And from there, I start to question how long is too long. One way to choose might perhaps be to choose the maximum length to which you are justifying the comments, to guarantee that you never have a function's name spill over into the comments section. Then you have a deterministic, easy to understand rule about when the function name will be truncated to (this func).

BTW, do you have any thoughts on my suggestion regarding displaying the function name inline? More specifically, comparing the following two formats:

; function name embedded. Regular function vs. "too long" function
0x107915454 <+68>: jne CommandObjectBreakpointList::DoExecute + 113 (107915481) ; [CommandObjectBreakpoint.cpp:1420]
0x107915454 <+68>: jne <(this func) + 113> (107915481) ; [CommandObjectBreakpoint.cpp:1420]

; function name not embedded. Regular function vs. "too long" function
0x107915454 <+68>: jne 0x107915481 ; <+113> at CommandObjectBreakpoint.cpp:1420
0x107915454 <+68>: jne 0x107915481 ; <(this func) +113> at CommandObjectBreakpoint.cpp:1420

Using the embedded approach, you could actually provide a debugger setting such as 'disassembly-comment-column' and since the algorithm for determining whether to truncate the name would be based on whether it spills into the comment column, changing the value of the setting would allow longer function names to be displayed in an easy to understand way.

I like your suggestion of showing the comment contents as the operand instead of the raw address (or in addition to the raw address) -- but there are lots of things that can end up in the comments (and llvm can add additional things like hints about what an instruction does) so it might not make sense in every case.

I was ignoring the idea primarily because I want to stay focused on fixing the current output style.

gdb solved this too-long-name problem by having a maximum name length. If the function name was any longer than that, it was truncated. This was a poor solution IMO.

Why do people care where an instruction is in memory?

They want to know where the pc is currently
They have an address and want to know what instruction that is
They want to know an offset into the function to follow a branch instruction's behavior

And so my new format includes the pc-indicator, the address, and the offset of each instruction. I believe they don't need to see the function name on every line -- they need some way to know when one function ends and another begins (which we give them by a blank line and then a line with the new function name by itself). It would be NICE if we could include the function name but in OO programs, these are often too long to include (especially when the instruction branches to another part of that same function) without forcing a line wrap, and they add much less value than the current-pc/addr/offset information.

Isn't this proposal only for cases of jumping to somewhere else in the same function? Admittedly that's usually the most common type of jump, since it accounts for every control flow instruction.

In any case, I think my suggestion of displaying the operand to the jump instruction as either

jmp function + offset (addr) ; comment

or

jmp <(this func) + offset> (addr) ; comment

looks the best, but if you think that's outside the scope of this change, then I would at least vote for

jump addr ; <(this func) + offset>

(basically the same as your original proposal, except add some kind of word or string in the middle of <+offset> to make it easier on the eyes.

Right, I'm saying that

jmp function+offset (addr) ; comment

is a separate issue that I don't want to touch right now.

Yeah, Greg and I talked about what to put to indicate "this function". Like, literally we were thinking "this" at one point. Greg suggested a function pointer, like "(*)", and then we shortened it to just "*". But after a couple of days looking at output like

0x100032 <*+36>: jump addr ; <*+36>

we were like, you know, the "*" isn't really necessary, we're just used to seeing something there.

I'm not opposed to putting something before the + -- I'm not sure what it should be.

Also, would the 'disassembly-format' setting be to switch betwen old style and new style? Or would it just be toggle between displaying the function name or not displaying the function name? A toggle between displaying and not displaying the function name seems the best to me. This makes the behavior more straightforward and reduces the amount of code to maintain, and I think better addresses the issue at hand: Whether a line is going to be readable due to a too-long function name or not. If your original plan was for the 'disassembly-format' setting to control old style vs. new style, then I would vote for an additional option called 'disassembly-function-names' or something similar. There's plenty of times I'm debugging C code with short function names and I don't want to pay the penalty for something that isn't an issue.

In the future, I have ideas for a fully customizable disassembly format that would allow you to specify a format string with placeholders, then people could arrange the disassembly output however they wanted.

I didn't explain the disassembly-format in my proposal - it already works like you're describing. The current lldb format is

${current-pc-arrow}${addr-file-or-load}{ <${function.name-without-args}${function.concrete-only-addr-offset-no-padding}>}:

A user can put this in their ~/.lldbinit file to keep that style of output.

Oh cool, I didn't know that was a thing :). In that case, it seems all my
concerns are moot, as i can just change the default. Sorry for the noise :)

Actually, on the bike ride home I realized that one of your concerns is justified: There is no formatter for the comment field in the disassembly output. A lot of different things can appear here -- this could be an address to an instruction in the current function, in a different function, in a segment of memory with no symbol at all, and there are a bunch of special behaviors like if it's the address of const data string, lldb prints the string. And I mentioned earlier, lldb has assembler hints it can output and we've got a low-priority ToDo to expose those hints to the user -- in the comment field too.

This is a drag - my patch, as it is written today, is going to write branch instructions inside the current function as <+36> regardless of any format setting. :/

emaste added a subscriber: emaste.Feb 12 2015, 7:52 AM
jasonmolenda accepted this revision.Feb 13 2015, 3:26 PM
jasonmolenda added a reviewer: jasonmolenda.

Preparing to commit.

This revision is now accepted and ready to land.Feb 13 2015, 3:26 PM
jasonmolenda closed this revision.Feb 13 2015, 3:26 PM

Landed in r229186.

Sending include/lldb/Core/Address.h
Sending include/lldb/Core/Disassembler.h
Sending include/lldb/Core/FormatEntity.h
Sending include/lldb/Symbol/SymbolContext.h
Sending source/API/SBInstruction.cpp
Sending source/API/SBInstructionList.cpp
Sending source/Breakpoint/BreakpointLocation.cpp
Sending source/Commands/CommandObjectSource.cpp
Sending source/Core/Address.cpp
Sending source/Core/Debugger.cpp
Sending source/Core/Disassembler.cpp
Sending source/Core/FormatEntity.cpp
Sending source/Plugins/Disassembler/llvm/DisassemblerLLVMC.cpp
Sending source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp
Sending source/Symbol/SymbolContext.cpp
Sending source/Symbol/Variable.cpp
Sending source/Target/StackFrame.cpp
Sending source/Target/ThreadPlanTracer.cpp
Sending test/functionalities/abbreviation/TestAbbreviations.py
Sending test/functionalities/inferior-assert/TestInferiorAssert.py
Transmitting file data ....................
Committed revision 229186.