This is an archive of the discontinued LLVM Phabricator instance.

[lldb-vscode] Add support for disassembly view
Needs ReviewPublic

Authored by eloparco on Dec 19 2022, 5:31 PM.

Details

Summary

Implement DAP (Debug Adapter Protocol) for Disassemble Requests.

Diff Detail

Event Timeline

eloparco created this revision.Dec 19 2022, 5:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 19 2022, 5:31 PM
eloparco requested review of this revision.Dec 19 2022, 5:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 19 2022, 5:31 PM
eloparco added a comment.EditedDec 20 2022, 2:40 AM

That's how it currently looks like:

eloparco updated this revision to Diff 484579.Dec 21 2022, 7:27 AM

Avoid tracking stack frame boundaries. It is unnecessary and makes the disassembler functionality not work in multi-threaded programs.

@JDevlieghere let me know if there's anything I can do to make the review easier

Sorry for delay, I was on recharge for all of December.

clayborg requested changes to this revision.Jan 4 2023, 6:01 PM

Sorry for delay, I was on recharge for all of December.

lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

This will work if the min and max opcode byte size are the same, like for arm64, the min and max are 4. This won't work for x86 or arm32 in thumb mode. So when backing up, we need to do an address lookup on the address we think we want to go to, and then adjust the starting address accordingly. Something like:

SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target);

now we have a section offset address that can tell us more about what it is. We can find the SBFunction or SBSymbol for this address and use those to find the right instructions. This will allow us to correctly disassemble code bytes.

We can also look at the section that the memory comes from and see what the section contains. If the section is data, then emit something like:

0x00001000 .byte 0x23
0x00001001 .byte 0x34
...

To find the section type we can do:

SBSection section = start_sbaddr.GetSection();
if (section.IsValid() && section.GetSectionType() == lldb::eSectionTypeCode) {
 // Disassemble from a valid boundary
} else {
  // Emit a byte or long at a time with ".byte 0xXX" or other ASM directive for binary data
}

We need to ensure we start disassembling on the correct instruction boundary as well as our math for "start_addr" might be in between actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then we know we have instructions, and if we are not, then we can emit ".byte" or other binary data directives. So if we do have lldb::eSectionTypeCode as our section type, then we should have a function or symbol, and we can get instructions from those objects easily:

if (section.IsValid() && section.GetSectionType() == lldb::eSectionTypeCode) {
 lldb::SBInstructionList instructions;
 lldb::SBFunction function = start_sbaddr.GetFunction();
 if (function.IsValid()) {
    instructions = function.GetInstructions(g_vsc.target);
 } else {
    symbol = start_sbaddr.GetSymbol();
    if (symbol.IsValid())
      instructions = symbol.GetInstructions(g_vsc.target);
}
const size_t num_instrs = instructions.GetSize();
if (num_instrs > 0) {
  // we found instructions from a function or symbol and we need to 
  // find the matching instruction that we want to start from by iterating
  // over the instructions and finding the right address
  size_t matching_idx = num_instrs; // Invalid index
  for (size_t i=0; i<num_instrs; ++i) {
    lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i);
    if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) {
      matching_idx = i;
      break;
    }
  }
  if (matching_idx < num_instrs) {
    // Now we can print the instructions from [matching_idx, num_instrs)
    // then we need to repeat the search for the next function or symbol. 
    // note there may be bytes between functions or symbols which we can disassemble
    // by calling _get_instructions_from_memory(...) but we must find the next
    // symbol or function boundary and get back on track
  }
2194

Remove any and all printf, or fprintf statements. You can't print anything to stderr or stdout as this is where the DAP packets are get emitted to. We do make it so this won't affect lldb-vscode by doing some magic with the STDOUT/STDERR file handles, but this output will be sent to /dev/null most likely. You can print something to a console (using "g_vsc.SendOutput(...)" is one way).

2244
This revision now requires changes to proceed.Jan 4 2023, 6:01 PM
eloparco added inline comments.Jan 5 2023, 2:32 AM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

Sorry, I should have provided a proper explanation.

I use the maximum instruction size as the "worst case". Basically, I need to read a portion of memory but I do not know the start address and the size. For the start address, if I want to read N instructions before base_addr I need to read at least starting from base_addr - N * max_instruction_size: if all instructions are of size max_instruction_size I will read exactly N instructions; otherwise I will read more than N instructions and prune the additional ones afterwards. Same for applies for the size.

Since start_addr is based on a "worst case", it may be an address in the middle of an instruction. In that case, that first instruction will be misinterpreted, but I think that is negligible.

The logic is similar to what is implemented in other VS Code extensions, like https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134.

Does it make sense?

eloparco added inline comments.Jan 5 2023, 2:50 AM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2194

I suppose I have to replace llvm::errs() too, right?

eloparco updated this revision to Diff 486514.Jan 5 2023, 3:06 AM

Remove printf usage

eloparco marked 2 inline comments as done.Jan 5 2023, 3:07 AM
eloparco updated this revision to Diff 486553.Jan 5 2023, 6:39 AM

Add integration tests for disassemble request

eloparco updated this revision to Diff 486554.Jan 5 2023, 6:42 AM

Add integration tests for disassemble request

clayborg added inline comments.Jan 5 2023, 3:37 PM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

The issue is, you might end up backing up by N bytes, and you might not end up on an opcode boundary. Lets say you have x86 disassembly like:

0x100002e37 <+183>: 48 8b 4d f8              movq   -0x8(%rbp), %rcx
0x100002e3b <+187>: 48 39 c8                 cmpq   %rcx, %rax
0x100002e3e <+190>: 0f 85 09 00 00 00        jne    0x100002e4d               ; <+205> at main.cpp
0x100002e44 <+196>: 8b 45 94                 movl   -0x6c(%rbp), %eax
0x100002e47 <+199>: 48 83 c4 70              addq   $0x70, %rsp
0x100002e4b <+203>: 5d                       popq   %rbp
0x100002e4c <+204>: c3                       retq   
0x100002e4d <+205>: e8 7e 0f 00 00           callq  0x100003dd0               ; symbol stub for: __stack_chk_fail
0x100002e52 <+210>: 0f 0b                    ud2

Let's say you started with the address 0x100002e4c, and backed up by the max opcode size of 15 for x86_64, that would be 0x100002e3d. You would start disassembling on a non opcode boundary as this is the last byte of the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly incorrectly. So we need to make use of the functions or symbol boundaries to ensure we are disassembling correctly. If we have no function or symbol, we can do our best. But as you can see we would get things completely wrong in this case and we need to fix this as detailed.

2194

yes! the main issue is, will the user expect to see this output in the console and will it make sense to the user. I don't know what the user will think if they see "current line not found in disassembled instructions" in the debug console. That goes for all output to the console. It will have to make sense to the user. I don't know if the user will care and or be able to do anything about this message. It also isn't prefixed with a "warning:" or "error:". I would vote to remove it.

eloparco added inline comments.Jan 6 2023, 1:50 AM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

Actually, I was mislead by the fact that so far I tried on both:

  • my ARM machine, with instructions of fixed size (4)
  • with WAMR disassembling to WASM, where, when the start address is in the middle of an instruction, only that first instruction is misinterpreted

Without going into sections and symbols as you were proposing, the solution can be as easy as done in this VS Code Extension: https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.

I'll update the code and put some additional comments to clarify the logic. Let me know what you think.

eloparco added inline comments.Jan 6 2023, 1:52 AM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2194

Right, it is just a print to debug, is there anything else I can use for that purpose?
Otherwise I'll just get rid of it as you were saying.

eloparco added inline comments.Jan 6 2023, 2:05 AM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2194

Probably better to remove it rather than prefixing the message with "debug"

eloparco updated this revision to Diff 486794.Jan 6 2023, 3:12 AM

Fix disassemble on variable-length instructions and disassemble with positive offset

jingham requested changes to this revision.EditedJan 10 2023, 3:43 PM

I haven't followed the lldb-vscode codebase closely enough to have useful comments on that part of the code.

But so far as I can tell, you didn't add a test for SBTarget::GetMaximumOpcodeByteSize, so you should add that before this goes in. Maybe it was folded into the lldb-vscode tests and I missed it, but even so this should have a direct test; I don't think we should require people to build lldb-vscode to test general features of lldb.

This revision now requires changes to proceed.Jan 10 2023, 3:43 PM
eloparco updated this revision to Diff 488117.Jan 11 2023, 2:05 AM

Add binding and test for SBTarget::GetMaximumOpcodeByteSize()

@jingham I added the test that was missing

jingham accepted this revision.Jan 11 2023, 11:27 AM

That looks fine. I'm removing my objection, but that's just to the SB API parts, I'm not commenting on the vscode part.

Are you planning on updating the reverse disassembly code still on this patch?

lldb/include/lldb/API/SBTarget.h
844

Do we still need this API change with the new approach? We can't really use this unless we include GetMinimumOpcodeByteSize() and only use this value to backup if the min and max are the same.

lldb/tools/lldb-vscode/lldb-vscode.cpp
2183–2194

Use C++ comments instead of C style comments.

Are you planning on updating the reverse disassembly code still on this patch?

What do you mean by "reverse disassembly"?

Are you planning on updating the reverse disassembly code still on this patch?

What do you mean by "reverse disassembly"?

I mean when we try and back up by some offset.

eloparco added inline comments.Jan 11 2023, 3:50 PM
lldb/include/lldb/API/SBTarget.h
844

In principle we could assume a maximum opcode of size 16 in the calculation, but knowing the maximum actually supported by the architecture can save us from reading unnecessary chunks of memory. In this way, we avoid reading instructions that will then be discarded because resulting in more than instructionCount total instructions.

Are you planning on updating the reverse disassembly code still on this patch?

What do you mean by "reverse disassembly"?

I mean when we try and back up by some offset.

Sorry, still not clear. So probably not intended for this patch :)

eloparco updated this revision to Diff 488415.Jan 11 2023, 4:04 PM

Use single-line comments

eloparco marked an inline comment as done.Jan 11 2023, 4:05 PM

Are you planning on updating the reverse disassembly code still on this patch?

What do you mean by "reverse disassembly"?

I mean when we try and back up by some offset.

Sorry, still not clear. So probably not intended for this patch :)

If you mean setting breakpoints and stepping through disassembled instructions, that's outside the purpose of this initial patch

I mean we can not just subtract something, any number, from any address unless we have fixed size opcodes. If we do this for x86, you can get complete garbage with no hope of ever getting back on track and this disassembly just won't make sense at all and will be useless. I thought my x86 example spelled out why it is bad to backup. If we start disassembling in the middle of an opcode, we can attempt to disassemble immediate values that are encoded into the middle of an opcode. Since x86 instructions can be 1 byte to 15 bytes, we might disassembly garbage and never align up to real opcodes.

So I see a few solutions:

  • use functions and symbols to get actual boundaries, and use sections to detect when there are instruction (code) and not
  • check if min and max opcode sizes are the same and only try to backup if they are the same, but still use sections to disassembly data as .byte or .long directives
eloparco added inline comments.Jan 11 2023, 4:46 PM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

Actually, I was mislead by the fact that so far I tried on both:

  • my ARM machine, with instructions of fixed size (4)
  • with WAMR disassembling to WASM, where, when the start address is in the middle of an instruction, only that first instruction is misinterpreted

Without going into sections and symbols as you were proposing, the solution can be as easy as done in this VS Code Extension: https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.

I'll update the code and put some additional comments to clarify the logic. Let me know what you think.

2177

@clayborg what I wrote in the comment before this one is what I implemented (Diff 6), do you think that is not enough? I'll try it on a x86 linux machine to make sure it works there

clayborg added inline comments.Jan 11 2023, 5:13 PM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

Just checked out your changes, and you are still just subtracting a value from the start address and attempting to disassemble from memory which is the problem. We need to take that subtracted address, and look it up as suggested in previous code examples I posted. If you find a function to symbol, ask those objects for their instructions. and then try to use those.

But basically for _any_ disassembly this is what I would recommend doing:

  • first resolve the "start_address" (no matter how you come up the address) that want to disassemble into a SBAddress
  • check its section. If the section is valid and contains instructions, call a function that will disassemble the address range for the section that starts at "start_address" and ends at the end of the section. We can call this "disassemble_code" as a function. More details on this below
  • If the section does not contain instructions, just read the bytes and emit a lines like:
0x1000 .byte 0x12
0x1000 .byte 0x34
...

Now for the disassemble_code function. We know the address range for this is in code. We then need to resolve the address passed to "disassemble_code" into a SBAddress and ask that address for a SBFunction or SBSymbol as I mentioned. Then we ask the SBFunction or SBSymbol for all instructions that they contain, and then use any instructions that fall into the range we have. If there is no SBFunction or SBSymbol, then disassemble an instruction at a time and then see if the new address will resolve to a function or symbol.

eloparco added inline comments.Jan 15 2023, 4:20 PM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

Tried my changes on a linux x86 machine and the loop for (unsigned i = 0; i < max_instruction_size; i++) { (L2190) takes care of the start_address possibly being in the middle of an instruction, so that's not a problem. The problem I faced is that it tries to read too far from base_addr and the ReadMemory() operation returns few instructions (without reaching base_addr). That was not happening on my macOS M1 (arm) machine.

To solve, I changed the loop at L2190 to

for (unsigned i = 0; i < bytes_offset; i++) {
    auto sb_instructions =
        _get_instructions_from_memory(start_addr + i, disassemble_bytes);

and if start_addr is in sb_instructions we're done and can exit the loop. That worked.

Another similar thing that can be done is to start from start_sbaddr as you were saying, increment the address until a valid section is found. Then call _get_instructions_from_memory() passing the section start.
What do you think? Delegating the disassembling to ReadMemory() + GetInstructions() looks simpler to me than to manually iterate over sections and get instructions from symbols and functions.
Is there any shortcoming I'm not seeing?

clayborg added inline comments.Jan 16 2023, 1:34 PM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

so your

for (unsigned i = 0; i < max_instruction_size; i++) {

disassembles and tries to make sure you make it back to the original base address from the original disassemble packet? That can work but could a a bit time consuming?

The main issue, as you saw on x86, is you don't know what is in memory. You could have unreadable areas of memory when trying to disassemble. Also if you do have good memory that does contain instructions, there can be padding in between functions or even data between functions that the function accesses that can't be correctly disassembled and could throw things off again.

The memory regions are the safest way to traverse memory to see what you have and would help you deal with holes in the memory. You can ask about a memory region with:

lldb::SBError SBProcess::GetMemoryRegionInfo(lldb::addr_t load_addr, lldb::SBMemoryRegionInfo &region_info);

If you ask about an invalid address, like zero on most platforms, it will return a valid "region_info" with the bounds of the unreadable address range filled in no read/write/execute permissions:

(lldb) script
Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D.
>>> region = lldb.SBMemoryRegionInfo()
>>> err = lldb.process.GetMemoryRegionInfo(0, region)
[0x0000000000000000-0x0000000100000000 ---]

So you could use the memory region info to your advantage here. If you have execute permissions, disassemble as instructions, and if you don't emit ".byte 0xXX" for each byte. If there are no permissions, you can emit some other string like "0x00000000: <invalid memory>".

That being said, even when you do find an executable section of memory, there can be different stuff there even if a section _is_ executable. For instance, if we ask about the next memory region on a mac M1:

>>> err = lldb.process.GetMemoryRegionInfo(0x0000000100000000, region)
>>> print(region)
[0x0000000100000000-0x0000000100004000 R-X]

Notice that this is read + execute (you can access these via:

region.IsReadable()
region.IsWritable()
region.IsExecutable()

But at this address, this is the mach-o header which doesn't make sense to try and disassemble:

(lldb) memory read 0x0000000100000000
0x100000000: cf fa ed fe 0c 00 00 01 00 00 00 00 02 00 00 00  ................
0x100000010: 11 00 00 00 18 03 00 00 85 00 20 00 00 00 00 00  .......... .....
(lldb) memory read -fx -s4 -c 4 0x0000000100000000
0x100000000: 0xfeedfacf 0x0100000c 0x00000000 0x00000002

0xfeedfacf is the mach-o magic bytes for little endian 64 bit mach-o files.

(lldb) disassemble --start-address 0x0000000100000000
a.out`_mh_execute_header:
    0x100000000 <+0>:  .long  0xfeedfacf                ; unknown opcode
    0x100000004 <+4>:  .long  0x0100000c                ; unknown opcode
    0x100000008 <+8>:  udf    #0x0
    0x10000000c <+12>: udf    #0x2
    0x100000010 <+16>: udf    #0x11
    0x100000014 <+20>: udf    #0x318
    0x100000018 <+24>: .long  0x00200085                ; unknown opcode
    0x10000001c <+28>: udf    #0x0

So this is the main reason why I would suggest just disassembling using .byte or .long when we aren't in a function or symbol.

eloparco added inline comments.Jan 17 2023, 1:15 AM
lldb/tools/lldb-vscode/lldb-vscode.cpp
2177

understood, so in that case we also have to do the same for _handle_disassemble_positive_offset, because even there, when we use ReadInstructions(), we could have some unreadable areas in the middle and ReadInstructions() would stop prematurely when running into one of them, resulting in less instructions returned than expected