This extends llvm-objdump and MCInstrAnalysis to be able to do more in
depth analysis of instructions which cannot be done by looking at
just a single instruction. I would like some feedback on the approach
taken so far, whether this is suitable or if some parts should be
moved or changed into more generic interfaces.
Currently MCInstrAnalysis has the evaluateBranch method which is given
a single MCInst that is known to be a branch will return if possible
the target address of that instruction, with no context other than the
current PC.
In GNU objdump, the RISC-V disassembler can determine the address of
more instructions, for example calculating branch/call targets built
over a number of instructions, and the addresses of targets of load
and store instruction, both immediates and those which are GP-relative.
To add such functionality, I have extended MCInstrAnalysis with a new
evaluateInst function, which provides the same functionality as
evaluateBranch, but has the ability to store information across multiple
instructions allowing these patterns to be picked up. The default
implementation does the same check llvm-objdump does before calling
evaluateInst, so is a NFC change for other targets, but for RISC-V
does a more detailed analysis. This allows evaluateInst to be safe to
call on all instructions and thus is called for each instruction.
In the RISC-V implementation there is a cache of the current state of
all known registers, and upon evaluating an instruction, updates the
known GPR state in case a future instruction needs that value. To avoid
mis-identifying addresses, upon a change of control flow the cache is
invalidated, except for if GP is defined (by the __global_pointer$
symbol, in which case it is reset to the known value).
For this to work there have had to be a couple of other changes to
llvm-objdump that are target specific, but would be good to make
generic. These are:
- The MCInstrAnalysis object is no longer const, should this analysis
belong in this class (which given the evaluateBranch call, I think
makes sense)
- There is a resetAnalysis hook which provides an additional signal
that control flow has changed, I think as a "new symbol/file"
indicator this should be fine.
- I've added a address printout after the instruction just for RISC-V;
this matches GNU objdump's behaviour, but does not necessarily have
to land with the rest of the patch.
- I do an explicit check for __global_pointer$ and pass this through
to MCInstrAnalysis since it has no concept of anything more than
MCInsts, there may be scope for making this more generic.
Since I haven't written tests yet, I can give an example of before/
after with this patch:
00010074 <main>: 10074: 63 15 05 00 bne a0, zero, 10 <main+0xa> 10078: 03 a5 c1 8f lw a0, -1796(gp) 1007c: 82 80 c.jr ra 1007e: 23 ae a1 8e sw a0, -1796(gp) 10082: 7d 15 c.addi a0, -1 10084: c5 bf c.j -16 <main> 00010090 <_start>: 10090: 97 21 00 00 auipc gp, 2 10094: 93 81 41 ba addi gp, gp, -1116 10098: 17 15 00 00 auipc a0, 1 1009c: 13 05 85 49 addi a0, a0, 1176 100a0: 17 16 00 00 auipc a2, 1 100a4: 13 06 06 4b addi a2, a2, 1200 100a8: 09 8e c.sub a2, a0 100aa: 81 45 c.li a1, 0 100ac: 97 00 00 00 auipc ra, 0 100b0: e7 80 20 1e jalr ra, 482(ra) 100b4: 17 05 00 00 auipc a0, 0 100b8: 13 05 85 13 addi a0, a0, 312 100bc: 97 00 00 00 auipc ra, 0 100c0: e7 80 40 0f jalr ra, 244(ra) 100c4: 97 00 00 00 auipc ra, 0 100c8: e7 80 40 16 jalr ra, 356(ra) 100cc: 02 45 c.lwsp a0, 0(sp) 100ce: 4c 00 c.addi4spn a1, sp, 4 100d0: 01 46 c.li a2, 0 100d2: 97 00 00 00 auipc ra, 0 100d6: e7 80 20 fa jalr ra, -94(ra) 100da: 17 03 00 00 auipc t1, 0 100de: 67 00 63 0e jalr zero, 230(t1)
00010074 <main>: 10074: 63 15 05 00 bne a0, zero, 10 #1007e <main+0xa> 10078: 03 a5 c1 8f lw a0, -1796(gp) #11490 <__bss_start> 1007c: 82 80 c.jr ra 1007e: 23 ae a1 8e sw a0, -1796(gp) #11490 <__bss_start> 10082: 7d 15 c.addi a0, -1 10084: c5 bf c.j -16 #10074 <main> 00010090 <_start>: 10090: 97 21 00 00 auipc gp, 2 10094: 93 81 41 ba addi gp, gp, -1116 #11c34 <_end+0x6e4> 10098: 17 15 00 00 auipc a0, 1 1009c: 13 05 85 49 addi a0, a0, 1176 #11530 <__bss_start> 100a0: 17 16 00 00 auipc a2, 1 100a4: 13 06 06 4b addi a2, a2, 1200 #11550 <_end> 100a8: 09 8e c.sub a2, a0 100aa: 81 45 c.li a1, 0 100ac: 97 00 00 00 auipc ra, 0 100b0: e7 80 20 1e jalr ra, 482(ra) #1028e <memset> 100b4: 17 05 00 00 auipc a0, 0 100b8: 13 05 85 13 addi a0, a0, 312 #101ec <__libc_fini_array> 100bc: 97 00 00 00 auipc ra, 0 100c0: e7 80 40 0f jalr ra, 244(ra) #101b0 <atexit> 100c4: 97 00 00 00 auipc ra, 0 100c8: e7 80 40 16 jalr ra, 356(ra) #10228 <__libc_init_array> 100cc: 02 45 c.lwsp a0, 0(sp) 100ce: 4c 00 c.addi4spn a1, sp, 4 100d0: 01 46 c.li a2, 0 100d2: 97 00 00 00 auipc ra, 0 100d6: e7 80 20 fa jalr ra, -94(ra) #10074 <main> 100da: 17 03 00 00 auipc t1, 0 100de: 67 00 63 0e jalr zero, 230(t1) #101c0 <exit>
Any other comments/suggestions on how to implement this are welcome.
We can just construct a new MCInstrAnalysis instance. See my main comment.