Implement DAP (Debug Adapter Protocol) for Disassemble Requests.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Avoid tracking stack frame boundaries. It is unnecessary and makes the disassembler functionality not work in multi-threaded programs.
Sorry for delay, I was on recharge for all of December.
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | This will work if the min and max opcode byte size are the same, like for arm64, the min and max are 4. This won't work for x86 or arm32 in thumb mode. So when backing up, we need to do an address lookup on the address we think we want to go to, and then adjust the starting address accordingly. Something like: SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target); now we have a section offset address that can tell us more about what it is. We can find the SBFunction or SBSymbol for this address and use those to find the right instructions. This will allow us to correctly disassemble code bytes. We can also look at the section that the memory comes from and see what the section contains. If the section is data, then emit something like: 0x00001000 .byte 0x23 0x00001001 .byte 0x34 ... To find the section type we can do: SBSection section = start_sbaddr.GetSection(); if (section.IsValid() && section.GetSectionType() == lldb::eSectionTypeCode) { // Disassemble from a valid boundary } else { // Emit a byte or long at a time with ".byte 0xXX" or other ASM directive for binary data } We need to ensure we start disassembling on the correct instruction boundary as well as our math for "start_addr" might be in between actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then we know we have instructions, and if we are not, then we can emit ".byte" or other binary data directives. So if we do have lldb::eSectionTypeCode as our section type, then we should have a function or symbol, and we can get instructions from those objects easily: if (section.IsValid() && section.GetSectionType() == lldb::eSectionTypeCode) { lldb::SBInstructionList instructions; lldb::SBFunction function = start_sbaddr.GetFunction(); if (function.IsValid()) { instructions = function.GetInstructions(g_vsc.target); } else { symbol = start_sbaddr.GetSymbol(); if (symbol.IsValid()) instructions = symbol.GetInstructions(g_vsc.target); } const size_t num_instrs = instructions.GetSize(); if (num_instrs > 0) { // we found instructions from a function or symbol and we need to // find the matching instruction that we want to start from by iterating // over the instructions and finding the right address size_t matching_idx = num_instrs; // Invalid index for (size_t i=0; i<num_instrs; ++i) { lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i); if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) { matching_idx = i; break; } } if (matching_idx < num_instrs) { // Now we can print the instructions from [matching_idx, num_instrs) // then we need to repeat the search for the next function or symbol. // note there may be bytes between functions or symbols which we can disassemble // by calling _get_instructions_from_memory(...) but we must find the next // symbol or function boundary and get back on track } | |
2196 | Remove any and all printf, or fprintf statements. You can't print anything to stderr or stdout as this is where the DAP packets are get emitted to. We do make it so this won't affect lldb-vscode by doing some magic with the STDOUT/STDERR file handles, but this output will be sent to /dev/null most likely. You can print something to a console (using "g_vsc.SendOutput(...)" is one way). | |
2246 |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | Sorry, I should have provided a proper explanation. I use the maximum instruction size as the "worst case". Basically, I need to read a portion of memory but I do not know the start address and the size. For the start address, if I want to read N instructions before base_addr I need to read at least starting from base_addr - N * max_instruction_size: if all instructions are of size max_instruction_size I will read exactly N instructions; otherwise I will read more than N instructions and prune the additional ones afterwards. Same for applies for the size. Since start_addr is based on a "worst case", it may be an address in the middle of an instruction. In that case, that first instruction will be misinterpreted, but I think that is negligible. The logic is similar to what is implemented in other VS Code extensions, like https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134. Does it make sense? |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2196 | I suppose I have to replace llvm::errs() too, right? |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | The issue is, you might end up backing up by N bytes, and you might not end up on an opcode boundary. Lets say you have x86 disassembly like: 0x100002e37 <+183>: 48 8b 4d f8 movq -0x8(%rbp), %rcx 0x100002e3b <+187>: 48 39 c8 cmpq %rcx, %rax 0x100002e3e <+190>: 0f 85 09 00 00 00 jne 0x100002e4d ; <+205> at main.cpp 0x100002e44 <+196>: 8b 45 94 movl -0x6c(%rbp), %eax 0x100002e47 <+199>: 48 83 c4 70 addq $0x70, %rsp 0x100002e4b <+203>: 5d popq %rbp 0x100002e4c <+204>: c3 retq 0x100002e4d <+205>: e8 7e 0f 00 00 callq 0x100003dd0 ; symbol stub for: __stack_chk_fail 0x100002e52 <+210>: 0f 0b ud2 Let's say you started with the address 0x100002e4c, and backed up by the max opcode size of 15 for x86_64, that would be 0x100002e3d. You would start disassembling on a non opcode boundary as this is the last byte of the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly incorrectly. So we need to make use of the functions or symbol boundaries to ensure we are disassembling correctly. If we have no function or symbol, we can do our best. But as you can see we would get things completely wrong in this case and we need to fix this as detailed. | |
2196 | yes! the main issue is, will the user expect to see this output in the console and will it make sense to the user. I don't know what the user will think if they see "current line not found in disassembled instructions" in the debug console. That goes for all output to the console. It will have to make sense to the user. I don't know if the user will care and or be able to do anything about this message. It also isn't prefixed with a "warning:" or "error:". I would vote to remove it. |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | Actually, I was mislead by the fact that so far I tried on both:
Without going into sections and symbols as you were proposing, the solution can be as easy as done in this VS Code Extension: https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51. I'll update the code and put some additional comments to clarify the logic. Let me know what you think. |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2196 | Right, it is just a print to debug, is there anything else I can use for that purpose? |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2196 | Probably better to remove it rather than prefixing the message with "debug" |
I haven't followed the lldb-vscode codebase closely enough to have useful comments on that part of the code.
But so far as I can tell, you didn't add a test for SBTarget::GetMaximumOpcodeByteSize, so you should add that before this goes in. Maybe it was folded into the lldb-vscode tests and I missed it, but even so this should have a direct test; I don't think we should require people to build lldb-vscode to test general features of lldb.
That looks fine. I'm removing my objection, but that's just to the SB API parts, I'm not commenting on the vscode part.
Are you planning on updating the reverse disassembly code still on this patch?
lldb/include/lldb/API/SBTarget.h | ||
---|---|---|
844 | Do we still need this API change with the new approach? We can't really use this unless we include GetMinimumOpcodeByteSize() and only use this value to backup if the min and max are the same. | |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
2185–2196 | Use C++ comments instead of C style comments. |
lldb/include/lldb/API/SBTarget.h | ||
---|---|---|
844 | In principle we could assume a maximum opcode of size 16 in the calculation, but knowing the maximum actually supported by the architecture can save us from reading unnecessary chunks of memory. In this way, we avoid reading instructions that will then be discarded because resulting in more than instructionCount total instructions. |
If you mean setting breakpoints and stepping through disassembled instructions, that's outside the purpose of this initial patch
I mean we can not just subtract something, any number, from any address unless we have fixed size opcodes. If we do this for x86, you can get complete garbage with no hope of ever getting back on track and this disassembly just won't make sense at all and will be useless. I thought my x86 example spelled out why it is bad to backup. If we start disassembling in the middle of an opcode, we can attempt to disassemble immediate values that are encoded into the middle of an opcode. Since x86 instructions can be 1 byte to 15 bytes, we might disassembly garbage and never align up to real opcodes.
So I see a few solutions:
- use functions and symbols to get actual boundaries, and use sections to detect when there are instruction (code) and not
- check if min and max opcode sizes are the same and only try to backup if they are the same, but still use sections to disassembly data as .byte or .long directives
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 |
| |
2179 | @clayborg what I wrote in the comment before this one is what I implemented (Diff 6), do you think that is not enough? I'll try it on a x86 linux machine to make sure it works there |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | Just checked out your changes, and you are still just subtracting a value from the start address and attempting to disassemble from memory which is the problem. We need to take that subtracted address, and look it up as suggested in previous code examples I posted. If you find a function to symbol, ask those objects for their instructions. and then try to use those. But basically for _any_ disassembly this is what I would recommend doing:
0x1000 .byte 0x12 0x1000 .byte 0x34 ... Now for the disassemble_code function. We know the address range for this is in code. We then need to resolve the address passed to "disassemble_code" into a SBAddress and ask that address for a SBFunction or SBSymbol as I mentioned. Then we ask the SBFunction or SBSymbol for all instructions that they contain, and then use any instructions that fall into the range we have. If there is no SBFunction or SBSymbol, then disassemble an instruction at a time and then see if the new address will resolve to a function or symbol. |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | Tried my changes on a linux x86 machine and the loop for (unsigned i = 0; i < max_instruction_size; i++) { (L2190) takes care of the start_address possibly being in the middle of an instruction, so that's not a problem. The problem I faced is that it tries to read too far from base_addr and the ReadMemory() operation returns few instructions (without reaching base_addr). That was not happening on my macOS M1 (arm) machine. To solve, I changed the loop at L2190 to for (unsigned i = 0; i < bytes_offset; i++) { auto sb_instructions = _get_instructions_from_memory(start_addr + i, disassemble_bytes); and if start_addr is in sb_instructions we're done and can exit the loop. That worked. Another similar thing that can be done is to start from start_sbaddr as you were saying, increment the address until a valid section is found. Then call _get_instructions_from_memory() passing the section start. |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | so your for (unsigned i = 0; i < max_instruction_size; i++) { disassembles and tries to make sure you make it back to the original base address from the original disassemble packet? That can work but could a a bit time consuming? The main issue, as you saw on x86, is you don't know what is in memory. You could have unreadable areas of memory when trying to disassemble. Also if you do have good memory that does contain instructions, there can be padding in between functions or even data between functions that the function accesses that can't be correctly disassembled and could throw things off again. The memory regions are the safest way to traverse memory to see what you have and would help you deal with holes in the memory. You can ask about a memory region with: lldb::SBError SBProcess::GetMemoryRegionInfo(lldb::addr_t load_addr, lldb::SBMemoryRegionInfo ®ion_info); If you ask about an invalid address, like zero on most platforms, it will return a valid "region_info" with the bounds of the unreadable address range filled in no read/write/execute permissions: (lldb) script Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D. >>> region = lldb.SBMemoryRegionInfo() >>> err = lldb.process.GetMemoryRegionInfo(0, region) [0x0000000000000000-0x0000000100000000 ---] So you could use the memory region info to your advantage here. If you have execute permissions, disassemble as instructions, and if you don't emit ".byte 0xXX" for each byte. If there are no permissions, you can emit some other string like "0x00000000: <invalid memory>". That being said, even when you do find an executable section of memory, there can be different stuff there even if a section _is_ executable. For instance, if we ask about the next memory region on a mac M1: >>> err = lldb.process.GetMemoryRegionInfo(0x0000000100000000, region) >>> print(region) [0x0000000100000000-0x0000000100004000 R-X] Notice that this is read + execute (you can access these via: region.IsReadable() region.IsWritable() region.IsExecutable() But at this address, this is the mach-o header which doesn't make sense to try and disassemble: (lldb) memory read 0x0000000100000000 0x100000000: cf fa ed fe 0c 00 00 01 00 00 00 00 02 00 00 00 ................ 0x100000010: 11 00 00 00 18 03 00 00 85 00 20 00 00 00 00 00 .......... ..... (lldb) memory read -fx -s4 -c 4 0x0000000100000000 0x100000000: 0xfeedfacf 0x0100000c 0x00000000 0x00000002 0xfeedfacf is the mach-o magic bytes for little endian 64 bit mach-o files. (lldb) disassemble --start-address 0x0000000100000000 a.out`_mh_execute_header: 0x100000000 <+0>: .long 0xfeedfacf ; unknown opcode 0x100000004 <+4>: .long 0x0100000c ; unknown opcode 0x100000008 <+8>: udf #0x0 0x10000000c <+12>: udf #0x2 0x100000010 <+16>: udf #0x11 0x100000014 <+20>: udf #0x318 0x100000018 <+24>: .long 0x00200085 ; unknown opcode 0x10000001c <+28>: udf #0x0 So this is the main reason why I would suggest just disassembling using .byte or .long when we aren't in a function or symbol. |
lldb/tools/lldb-vscode/lldb-vscode.cpp | ||
---|---|---|
2179 | understood, so in that case we also have to do the same for _handle_disassemble_positive_offset, because even there, when we use ReadInstructions(), we could have some unreadable areas in the middle and ReadInstructions() would stop prematurely when running into one of them, resulting in less instructions returned than expected |
Do we still need this API change with the new approach? We can't really use this unless we include GetMinimumOpcodeByteSize() and only use this value to backup if the min and max are the same.