This is an archive of the discontinued LLVM Phabricator instance.

Disassemble AArch64 pc-relative address calculations & symbolicate
ClosedPublic

Authored by jasonmolenda on Jul 31 2021, 12:53 AM.

Details

Summary

lldb's disassembler for arm64 / aarch64 doesn't recognize the ADRP+ADD instruction pair often used to calculate pc-relative addresses, and symbolicate what is being pointed to, making it hard to understand the assembly; doing the computation manually is annoying. This patch adds a bit of arm64-specific knowledge to the disassembly symbolicator to remember the state and compute the address being specified when the target is aarch64.

I'm fixing the other thing that I dislike the most about lldb's arm64 disassembly in the llvm aarch64 instruction printer, via https://reviews.llvm.org/D107196 .

Diff Detail

Event Timeline

jasonmolenda created this revision.Jul 31 2021, 12:53 AM
jasonmolenda requested review of this revision.Jul 31 2021, 12:53 AM

The test is with MachO but I assume this would apply to ELF, COFF etc, since it's a general disassembly feature?

lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp
1315

Add aarch64_be here.

Triple does have a isAArch64 which does this but I think you just get the ArchType enum here so you can't use it which is a shame.

1325

There is a 32 bit variant of ADD with W registers, I assume LLVMDisassembler_ReferenceType_In_ARM64_ADDXri is specific to the X version though, correct?

Which saves you checking the "sf" field later. (plus I don't know that any compiler would bother using the W version for this sort of thing anyway)

1328

Add comments to these two lines saying what they're checking for.

(looks correct from what the ARMARM says)

1332–1335

Comments here too. The second one is sign extension, correct?

1339

"sh" is a single bit field so you could just:

if ((addrxi_inst >> 22) & 1)
1342

ditto on the comments, but the math seems fine.

1347

This method is pretty long as it is, does it make sense to make the aarch64 specific part of it another private method?

lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.h
77

Just to check I understand, these are member vars because SymbolLookup will first be called on the adrp. So you write these values, then it's called again for the add, which is where you use them?

If so I would add a comment along those lines.

lldb/test/API/functionalities/disassemble/aarch64-adrp-add/TestAArch64AdrpAdd.py
36

You could break after finding it.

37

if not found_hi_string and...

42

A message would help here, like "Did not find \"HI\" string in disassembly.".

Of course if you want to debug it the logging is actually going to help but just for anyone seeing it as a random failure in a general test suite run.

Thanks so much for taking the time to read the patch over and make suggestions David. I cribbed the bit masking / manip from MachODump.cpp's SymbolizerSymbolLookUp but you're right about it being better with at least a little bit of comments about what is going on there.

Yeah, this is a very aarch64 specific thing in the generic symbolizer, but it's not an especially large function so I didn't feel too bad about it. If we started needing to handle multi-instruction sequences like this in the symbolizer (where we save state from previous instructions to determine a result), then this definitely would not scale well.

I've written it so that I only recognize the pattern with ADRP is immediately followed by an ADD, that was maybe a choice I could have gone either way on. I clear the saved ADRP instruction when it's not followed by that ADD in the same instruction. I didn't want to mis-calculate an address if the instruction pair using the same register were in two different basic blocks or something.

lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp
1325

Ah, good point. I did a quick check and clang currently generates the same adrp+add with an x register, and then ANDs the result with 0xffffffff for our "arm64_32" target (ILP32 codegen that executes in the lower 4GB of virtual address space but uses the aarch64 ISA), so it works out.

Yeah, this is a very aarch64 specific thing in the generic symbolizer, but it's not an especially large function so I didn't feel too bad about it. If we started needing to handle multi-instruction sequences like this in the symbolizer (where we save state from previous instructions to determine a result), then this definitely would not scale well.

Yeah I mainly mentioned it because of vague memories of some other function that grew a whole bunch of these. Fine for now.

(And in general I like the change, we get this sort of request for the disassembly tools a lot but haven't really thought about it for lldb)

I've written it so that I only recognize the pattern with ADRP is immediately followed by an ADD, that was maybe a choice I could have gone either way on.

That bit makes sense to me. I suppose instruction scheduling might separate them but this probably hits 99% of them.

jasonmolenda added a reviewer: DavidSpickett.

Update patch to address David's suggestion that the bit masking & testing needed some comments to make it possible to follow. In the process of doing that, I noticed that MachODump.cpp's SymbolizerSymbolLookUp, where I cribbed the bit stuff from, doesn't handle an ADRP referring to a 4k block before $pc; it doesn't sign extend the page imm value correctly from the ADRP. I updated the test case to have a symbol 4k earlier than $pc so lldb has to compute this correctly to symbolicate the address, and added both an arm64 and arm64_32 yaml encoded copy of the binary.

Ah, add one more comment that David suggested.

jasonmolenda added inline comments.Aug 4 2021, 11:48 PM
lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp
1328

There's a comment right before the conditional expr which says what it's checking. The check that m_adrp_insn is actually an adrp instruction is unnecessary; I cribbed that from the MachODump file, but the only way I set the ivar to the instruction is if we had type_ptr==LLVMDisassembler_ReferenceType_In_ARM64_ADRP in the previous instruction.

1332–1335

Yeah, it was being done incorrectly.

1339

Yeah, I rewrote it the way you suggest. I don't know why it was written that way in MachODump.cpp (incorrectly), but also why would the sh bit be set on the ADD when that just shifts the value up 12 bits like the ADRP instruction did already. It's like a less rangey version of ADRP as soon as the sh bit is set.

DavidSpickett added inline comments.Aug 6 2021, 5:46 AM
lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp
1328

The check that m_adrp_insn is actually an adrp instruction is unnecessary

In that we could have an Optional<uint32_t> and assume if set, it's an adrp. But this is functionally the same.

Comment bikeshedding, put the conditions in the same order as the text? So readers know that 0x1f is the register?
(though if you were questioning this code the first port of call would be the arm manuals but regardless)

// If previous instruction was ADRP and this is ADD, and it's to
// the same register, this is pc-relative address calculation.
if (m_adrp_address == pc - 4 &&
    (m_adrp_insn & 0x9f000000) == 0x90000000 && 
    *type_ptr == LLVMDisassembler_ReferenceType_In_ARM64_ADDXri &&
    (m_adrp_insn & 0x1f) == ((value >> 5) & 0x1f)) {
lldb/test/API/functionalities/disassemble/aarch64-adrp-add/TestAArch64AdrpAdd.py
16–28

Could move most of this into a utility fn too.

30

no_debug_info_test here?

61

Shouldn't this be within the if Trace?

lldb/test/API/functionalities/disassemble/aarch64-adrp-add/main.c
261

I admit a bit of a "wat" moment here, but I assume this is setting things up so that foo is some distance behind (at a lower address) than main?
So we have foo, which is before main and HI which is after for a negative, positive offset.

Could do with a comment for how many instructions this is. Maybe you could #define something for maybe 64 nops at a time just to make it a little less verbose.

Matt added a subscriber: Matt.Aug 10 2021, 1:26 PM

Address David's second review comments -- explained the unusual formatting of the main.c in the test case, change the m_adrp_insn to be an Optional instead of using 0 as a "not-set" value. I'll prob land this tomorrow unless someone has additional feedback, I think it's in reasonable shape.

This revision is now accepted and ready to land.Aug 12 2021, 1:13 AM
This revision was landed with ongoing or failed builds.Aug 12 2021, 2:44 PM
This revision was automatically updated to reflect the committed changes.