This is the initial set of changes for a new tool called llvm-mctoll that raises binaries back to llvm bitcode. Currently there is support for raising Arm32 and x64 Linux elf shared libraries and simple executables such as dhrystone.
Here is a summary of features in varying states of completion:
- Function boundary identification. Analyzes the text section of an elf input binary (executable or shared library) to identify function boundaries.
- CFG construction. Builds the CFG for a function and the corresponding MachineFunction representation along with the constituent MachineBlocks. The MachineFunction object is used to materialize a Function object by raising the instructions of MachineBlock into BasicBlocks of the Function object.
- Instruction raising. Stack accesses are abstracted to alloca instructions. Various abstract instruction classes are defined - such as memory referencing instructions, floating point instructions, register move instructions, binary operator instructions, etc.
- Function prototype discovery. A MachineFunction is analyzed to create an abstract function prototype. The current implementation assumes that the binaries are 64-bit and are compiled from C sources. The function prototype discovery algorithm assumes C calling-conventions and is limited to arguments passed on the stack ( > 6 args is not implemented yet). Calls to variadic functions are discovered by analyzing the instructions. Linkage to external functions (such as to glibc) is handled by maintaining a table of known function signatures.
- Information from various sections of the ELF binary - such as GOT, PLT, data sections and symbol table is used to materialize materialize machine-independent abstractions such as string constants, external call linkage etc.
- There are tests that try to cover much of the major functionality for both Arm32 and x64.