Introduction
LLVM supports multiple debug information formats (namely DWARF and CodeView) in different binary formats (e.g. ELF, PDB, Mach-O). Understanding the mappings between source code and debug information can be complex, and it is a problem we have commonly encountered when triaging debug information issues.
The output from tools such as llvm-dwarfdump or llvm-readobj use a close representation of the internal debug information format and in our experience, we have found that they require a good knowledge of those formats to understand the output, limiting who can triage and address such issues quickly. Even for the experts, it can sometimes take a lot of time and effort to triage issues due to the inherent complexity.
llvm-dva
At Sony, we have been developing an LLVM-based debug information analysis tool which we have called llvm-dva (short for LLVM debug information visual analyzer), designed to visualize these mappings. It's based entirely on the existing LLVM libraries for debug info parsing, target support, etc. and at this stage we believe that its proven its worth internally to the point where we would like to propose upstreaming it as part of the mainline LLVM project alongside existing tools such as llvm-dwarfdump.
llvm-dva is a command line tool that process debug info contained in a binary file and produces a debug information format agnostic "Logical View", which is a high-level semantic representation of the debug info, independent of the low-level format.
The logical view is composed of the tradition programming elements as: scopes, types, symbols, lines. These elements can display additional information, such as variable coverage factor, lexical block level, disassembly code, code ranges, etc.
The diversity of llvm-dva command line options enables the creation of very rich logical views to include more low-level debug information: disassembly code associated with the debug lines, variables runtime location and coverage, internal offsets for the elements within the binary file, etc.
With llvm-dva, we aim to address the following points:
- Which variables are dropped due to optimization?
- Why I cannot stop at a particular line?
- Which lines are associated to a specific code range?
- Does the debug information represent the original source?
- What is the semantic difference between the debug info generated by different toolchain versions?
This interval tree is not memory/performance efficient (every node has a vector). An augmented binary search tree (red-black tree) is superior. If there is an existing red-black implementation, augmenting it will also have less code (https://github.com/radareorg/radare2/pull/8381)