The idea of that refactoring is to speed up DWARFLinker by utilizing parallel execution and minimizing memory usages(in case the object file contains multiple compilation units). This patch creates a new library DWARFLinkerNext, which is called from standard dsymutil using --use-dlnext command-line option. This is a second version of the patch which additionally does types merging and produces a predictable result(generated .dSYM file is the same between similar runs). The best configuration is able to produce .debug_info table of 68% less in size and works two times faster than the current upstream dsymutil.
The overall linking process looks like this:
parrallel_for_each(ObjectFile) { for_each (Compile Unit) { 1. load Clang modules. } parrallel_for_each(Compile Unit) { 1. load input DWARF for Compile Unit. 2. report warnings for Clang modules. 3. analyze live DIEs. 4. analyze types if ODR deduplication is requested. 5. clone DIEs(Generate output DIEs and resulting DWARF tables). The result is in OutDebugInfoBytes, which is ELF file containg DWARF tables corresponding for current compile unit. 6. cleanup Input and Output DIEs. } deallocate loaded Object file. } if (ODR deduplication is requested) generate artificial compilation unit "Type Table"(It uses partially generated dies at clone stage). for_each (ObjectFile) { for_each (Compile Unit) { 1. set offsets to Compile Units DWARF tables. 2. sort offsets/attributes/patches to have a predictable result. 3. patch size/offsets fields. 4. generate index tables. 5. move DWARF tables of compile units into the resulting file. } }
i.e. every compile unit is processed separately, visited only once
(except case inter-CU references exist), and used data are freed
after the compile unit is processed. The resulting file is glued
from generated debug tables, corresponding to separate compile units.
Handling inter-CU references: inter-CU references are hard to process
using only one pass. f.e. if CU1 references CU100 and CU100 references
CU1, we could not finish handling of CU1 until we finished CU100.
Thus we either need to load all CUs into the memory, either load CUs several times.
This patch discovers dependencies during the first pass. So that depending compile
units stay loaded into the memory. And then, during the second pass it handles
inter-connected compile units.
That approach works well for cases when the number of inter-CU references is low. It allows not to load all CUs into the memory at once.
Changes from the current implementation(making DWARFLinkerNext to be binary incompatible with current DWARFLinker):
a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables). b) .debug_frame. Already generated CIE records are not reused between object files c) ODR type deduplication works using another approach than current dsymutil. All types are moved into the artificial compilation unit. References to the types are changed so that they point to types from artificial compilation unit. Declarations and partial declarations of the same type are merged into single definition/declaration located in the artificial compilation unit.
Some DWARF features are not supported(it might be supported though): DWARF64, split dwarf, DWARF 5.
Performance results for this patch for the clang binary(Darwin 24-core 64G):
- clang binary, ODR deduplication is ON:
|---------------------------------------------------------------------- | | dsymutil | dsymutil --use-dlnext | |-------|------------------------------|------------------------------| | |exec time| memory | DWARF(*)|exec time| memory | DWARF | | | sec | GB | MB | sec | GB | MB | |-------|------------------------------|------------------------------| |threads| | | | | | | |-------|------------------------------|------------------------------| | 1 | 161 | 16.5 | 486 | 204 | 12.0 | 157 | |-------|------------------------------|------------------------------| | 2 | 101 | 17.9 | 486 | 118 | 12.0 | 157 | |-------|------------------------------|------------------------------| | 4 | 101 | 17.9 | 486 | 75 | 12.0 | 157 | |-------|------------------------------|------------------------------| | 8 | 101 | 17.9 | 486 | 51 | 12.0 | 157 | |-------|------------------------------|------------------------------| | 16 | 101 | 17.9 | 486 | 42 | 12.2 | 157 | |---------------------------------------------------------------------|
(*) DWARF is the size of .debug_info section.
- clang binary, ODR deduplication is OFF(--no-odr):
|---------------------------------------------------------------------- | | dsymutil | dsymutil --use-dlnext | |-------|------------------------------|------------------------------| | |exec time| memory | DWARF |exec time| memory | DWARF | | | sec | GB | MB | sec | GB | MB | |-------|------------------------------|------------------------------| |threads| | | | | | | |-------|------------------------------|------------------------------| | 1 | 227 | 16.3 | 1460 | 227 | 16.2 | 1470 | |-------|------------------------------|------------------------------| | 2 | 218 | 17.9 | 1460 | 132 | 16.2 | 1470 | |-------|------------------------------|------------------------------| | 4 | 218 | 17.9 | 1460 | 83 | 16.5 | 1470 | |-------|------------------------------|------------------------------| | 8 | 218 | 17.9 | 1460 | 58 | 16.5 | 1470 | |-------|------------------------------|------------------------------| | 16 | 218 | 17.9 | 1460 | 47 | 16.9 | 1470 | |---------------------------------------------------------------------|
Effective CPU utilization ratio for that implementation is 52%.
There still exist room for improvements. The run-time performance might be speed-up to 10-20%. The run-time memory usage might be decreased up to 10%.
clang-tidy: warning: invalid case style for parameter 'unloadFunc' [readability-identifier-naming]
not useful