Page MenuHomePhabricator

[WIP][dsymutil][DWARFlinker] implement separate multi-thread processing for compile units.
Needs ReviewPublic

Authored by avl on Feb 4 2021, 7:11 AM.

Details

Summary

The idea of that refactoring is to speed-up DWARFLinker by utilizing parallel execution
and minimizing memory usages(in case the object file contains multiple compilation units).
This patch creates new library DWARFLinkerNext, which is called by standard dsymutil,
using --use-dlnext command line option.

The overall linking process looks like this:

parrallel_for_each(ObjectFile) {
  for_each (Compile Unit) {
    1. load Clang modules.
  }
    
  parrallel_for_each(Compile Unit) {
    1. load input DWARF for Compile Unit.
    2. report warnings for Clang modules.
    3. search for type declarations.
    4. analyze live DIEs.
    5. clone DIEs(Generate output DIEs and resulting DWARF tables).
       The result is in OutDebugInfoBytes, which is ELF file containg
       DWARF tables corresponding for current compile unit.
    6. cleanup Input and Output DIEs.
  }  
  
  deallocate loaded Object file.
}

for_each (ObjectFile) {
  for_each (Compile Unit) {
    1. set offsets to Compile Units DWARF tables.
    2. patch size/offsets fields.
    3. generate index tables.
    4. move DWARF tables of compile units into the resulting file.
  }
}

i.e. every compile unit is processed separately, visited only once
(except case inter-CU references exist), and used data are freed
after the compile unit is processed. The resulting file is glued
from generated debug tables, corresponding to separate compile units.

Handling inter-CU references: inter-CU references are hard to process
using only one pass. f.e. if CU1 references CU100 and CU100 references
CU1, we could not finish handling of CU1 until we finished CU100.
Thus we either need to load all CUs into the memory, either load CUs several times.

This patch discovers dependencies during the first pass. So that depending compile units stay loaded into the memory. And then, during the second pass it handles inter-connected compile units.

That approach works well for cases when the number of inter-CU references is low. It allows not to load all CUs into the memory.

Changes from the current implementation(making DWARFLinkerNext to be binary incompatible with current DWARFLinker):

a) No common abbreviation table. Each compile unit has
   its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space,
   so it looks cheap to have separate abbreviations tables.
   Later, it might be optimized a bit(by removing equal abbreviations tables).
   
b) Removed check for ambiguous type declarations.
   Existed check does not solve the problem in general(current check does not work if there are ambiguous 
   types in different compile units). 

c) .debug_frame. Already generated CIE records are not reused between object files

d) In multi-thread mode there is no predictable order of how compile units/dies are processed.
   That leads to selecting "canonical" dies, or adding strings into the debug_str table in some unpredictable order.
   It is semantically correct but creates not-matching references:
   
   DW_AT_name [DW_FORM_strp]     ( .debug_str[0x00000095] = "AnonC")   
   DW_AT_name [DW_FORM_strp]     ( .debug_str[0x0000007f] = "AnonC")
   
   DW_AT_type [DW_FORM_ref_addr] (0x000000000000014f "U")
   DW_AT_type [DW_FORM_ref_addr] (0x0000000000000170 "U")
   
e) .debug_names is not supported by this patch.

Performance results for that patch for the clang binary:

|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  |dSYM size|exec time|  memory  |dSYM size|
|       |   sec   |    GB    |    GB   |   sec   |    GB    |   GB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   144   |   15.2   |   1.2   |   165   |   15.4   |   1.2   |
|-------|------------------------------|------------------------------|
|   2   |    92   |   16.7   |   1.2   |    93   |   15.3   |   1.2   |
|-------|------------------------------|------------------------------|
|   4   |    92   |   16.7   |   1.2   |    59   |   15.4   |   1.2   |
|-------|------------------------------|------------------------------|
|   8   |    92   |   16.7   |   1.2   |    40   |   15.8   |   1.3   |
|-------|------------------------------|------------------------------|
|  16   |    92   |   16.7   |   1.2   |    35   |   15.8   |   1.3   |
|---------------------------------------------------------------------|

Performance results for that patch for the clang binary built with modules:

|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  |dSYM size|exec time|  memory  |dSYM size|
|       |   sec   |    GB    |    GB   |   sec   |    GB    |   GB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |    61   |   8.3    |   1.1   |    71   |    9.2   |   1.1   |
|-------|------------------------------|------------------------------|
|   2   |    50   |   8.7    |   1.1   |    44   |    9.2   |   1.1   |
|-------|------------------------------|------------------------------|
|   4   |    50   |   8.7    |   1.1   |    32   |    9.7   |   1.1   |
|-------|------------------------------|------------------------------|
|   8   |    50   |   8.7    |   1.1   |    25   |    9.7   |   1.1   |
|-------|------------------------------|------------------------------|
|  16   |    50   |   8.7    |   1.1   |    23   |    9.7   |   1.1   |
|---------------------------------------------------------------------|

Effective CPU utilization ratio for that implementation is 43%. (i.e. it might be 2x improved theoretically).

Testing. It passes the current dsymutil lit tests set(though it requires some GF updates):

--use-dlnext --num-threads 1

dsymutil/X86/accelerator.test - Fail since debug_names table is not supported by the patch.
dsymutil/X86/dsym-companion.test - Update GF(offset to the abbreviation table is no more equal to 0)
dsymutil/X86/frame-1.test - Update GF(CIE record is not reused between object files now).
dsymutil/X86/generate-empty-CU.test - Update GF.

--use-dlnext --num-threads 8

Other than the above failures there are several more, which happens because of unpredictable order of compile units handling
(due to the nature of parallel execution).

tools/dsymutil/X86/odr-fwd-declaration.cpp
tools/dsymutil/X86/odr-member-functions.cpp
tools/dsymutil/X86/odr-uniquing.cpp

I marked that work as WIP, since I would like to gather opinions whether that refactoring is useful and to consult on implementation details.

Diff Detail

Unit TestsFailed

TimeTest
360 msx64 debian > libarcher.races::task-dependency.c
Script: -- : 'RUN: at line 13'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-dependency.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/deflake.bash /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-dependency.c
390 msx64 debian > libarcher.races::task-taskgroup-unrelated.c
Script: -- : 'RUN: at line 13'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskgroup-unrelated.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskgroup-unrelated.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/deflake.bash /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskgroup-unrelated.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskgroup-unrelated.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskgroup-unrelated.c
330 msx64 debian > libarcher.races::task-taskwait-nested.c
Script: -- : 'RUN: at line 13'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskwait-nested.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskwait-nested.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/deflake.bash /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskwait-nested.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskwait-nested.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskwait-nested.c
350 msx64 debian > libarcher.races::task-two.c
Script: -- : 'RUN: at line 13'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-two.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-two.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/deflake.bash /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-two.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-two.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-two.c
240 msx64 debian > libarcher.task::task-barrier.c
Script: -- : 'RUN: at line 15'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/task/task-barrier.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/task/Output/task-barrier.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/task/Output/task-barrier.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/task/Output/task-barrier.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/task/task-barrier.c
View Full Test Results (13 Failed)

Event Timeline

avl created this revision.Feb 4 2021, 7:11 AM
avl requested review of this revision.Feb 4 2021, 7:11 AM
Herald added a project: Restricted Project. · View Herald Transcript
avl edited the summary of this revision. (Show Details)Feb 4 2021, 7:16 AM

Handling inter-CU references: inter-CU references are hard to process using only one pass. f.e. if CU1 references CU100 and CU100 references CU1, we could not finish handling of CU1 until we finished CU100. Thus we either need to load all CUs into the memory, either load CUs several times.

What kind of inter-CU references do you see? There really shouldn't be any unless they are all in the same .o file and were produced by an LTO pass. Are there any inter CU references that I am forgetting about?

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution (This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't the linking process create on large uniqued abbrev table as it links each .o file's DWARF into the final output? This info isn't that large that is true, but it'd be nice to have just one table as long as we are optimizing the DWARF.

b) Removed check for ambiguous type declarations. Existed check does not solve the problem in general (current check does not work if there are ambiguous types in different compile units).

I am assuming you didn't pull out ODR type uniquing all together right? That is very important and one of the main reasons for using dsymutil. There are many things that can make this ODR type uniquing fail to work correctly, but those are bugs that should be fixed. One of the main things is if a class or struct has a forward declaration to another class inside of it. If this is the case the llvm dsymutil doesn't try and see if everyone has the same full class definition with a forward decl on it, it just punts and emits another copy of the type. I saw this when I would have a "FILE" type from the unix fopen(...) function. If multiple files had this type in a C++ source file, they wouldn't get uniqued because one of the members had a forward declaration. The reason it doesn't get uniqued is dsymutil worries someone might have a "FILE" type info in the DWARF that _does_ have a full definition for that forward declaration. A lot of stuff the compiler emits these days also causes many of the types to not be correctly removed even though they aren't used.

So there are many C++ compiler things that seem to be keeping things alive in the DWARF output by dsymutil these days that used to not be there before. When ODR works correctly, the first instance of a type gets emitted from the first object file file that has it, and then all other .o files that have this same type remove it and have all CU references within changed to a DW_FORM_ref_addr inter CU reference. We should make sure that this isn't happening to many clang types and causing us to not unique types.

When the type uniquing works correctly it really reduces the output DWARF size. Back at Apple the WebCore.framework DWARF debug info went from 1.8GB down to 450MB.

As an example of C++ compiler type bloat, you can see this if you do "#include <cstdio>". If you compile this with a recent clang:

#include <stdio.h>

int main() {
    printf("hello\n");
    return 0;
}

The DWARF looks very small:

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer	("Apple clang version 12.0.0 (clang-1200.0.32.29)")
              DW_AT_language	(DW_LANG_C_plus_plus_11)
              DW_AT_name	("main.cpp")
              DW_AT_LLVM_sysroot	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk")
              DW_AT_APPLE_sdk	("MacOSX.sdk")
              DW_AT_stmt_list	(0x00000000)
              DW_AT_comp_dir	("/Users/gclayton/Documents/src/args")
              DW_AT_low_pc	(0x0000000100003f60)
              DW_AT_high_pc	(0x0000000100003f8a)

0x00000032:   DW_TAG_subprogram
                DW_AT_low_pc	(0x0000000100003f60)
                DW_AT_high_pc	(0x0000000100003f8a)
                DW_AT_frame_base	(DW_OP_reg6 RBP)
                DW_AT_name	("main")
                DW_AT_decl_file	("/Users/gclayton/Documents/src/args/main.cpp")
                DW_AT_decl_line	(4)
                DW_AT_type	(0x000000000000004b "int")
                DW_AT_external	(true)

0x0000004b:   DW_TAG_base_type
                DW_AT_name	("int")
                DW_AT_encoding	(DW_ATE_signed)
                DW_AT_byte_size	(0x04)

0x00000052:   NULL

But if you include the "cstdio" file instead:

#include <cstdio>

int main() {
    printf("hello\n");
    return 0;
}

This ends up having tons of extra type stuff that really bloats the DWARF. The C++ header ends up with full definitions for every function in stdio.h and data types (like "FILE"!) for functions that are not used in the .o file. In this case the C++ code add DW_TAG_imported_declaration to all these types and that ends up keeping all of these extra types alive since something refers to it and ends up with way toe much stuff in the final dSYM. We can see the DW_TAG_imported_declaration below:

0x00000032:   DW_TAG_namespace
                DW_AT_name	("std")

0x00000037:     DW_TAG_namespace
                  DW_AT_name	("__1")
                  DW_AT_export_symbols	(true)

0x0000003c:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cstdio")
                    DW_AT_decl_line	(109)
                    DW_AT_import	(0x0000000000000195)
`

And later we see:

0x00000193:       NULL

0x00000194:     NULL

0x00000195:   DW_TAG_typedef
                DW_AT_type	(0x00000000000001a0 "__sFILE")
                DW_AT_name	("FILE")
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(157)

0x000001a0:   DW_TAG_structure_type
                DW_AT_calling_convention	(DW_CC_pass_by_value)
                DW_AT_name	("__sFILE")
                DW_AT_byte_size	(0x98)
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(126)

0x000001a9:     DW_TAG_member
                  DW_AT_name	("_p")
                  DW_AT_type	(0x000000000000029a "unsigned char*")
                  DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                  DW_AT_decl_line	(127)
                  DW_AT_data_member_location	(0x00)
....

c) .debug_frame. Already generated CIE records are not reused between object files

Can't we unique these when the .o file gets linked on the main thread?

d) In multi-thread mode there is no predictable order of how compile units/dies are processed. That leads to selecting "canonical" dies, or adding strings into the debug_str table in some unpredictable order. It is semantically correct but creates not-matching references:

DW_AT_name [DW_FORM_strp]     ( .debug_str[0x00000095] = "AnonC")   
DW_AT_name [DW_FORM_strp]     ( .debug_str[0x0000007f] = "AnonC")

DW_AT_type [DW_FORM_ref_addr] (0x000000000000014f "U")
DW_AT_type [DW_FORM_ref_addr] (0x0000000000000170 "U")

Yes this will happen and it is ok as long as "--num-threads 1" allows consistent builds. Any clients that require dSYM files to be the same can use this slower method

e) .debug_names is not supported by this patch.

Any reason why this can't be done? I believe that dsymutil just generates the .debug_names (or Apple accelerator tables) after you have written all of the DWARF out by traversing it no? If this doesn't happen, it can be added with the --update option to dsymutil, but again, this takes time.

One of the main benefits of running dsymutil is the accelerator tables that speed up debugging. If you are solely looking for a way to reduce link times, then this succeeds. But if there are no accelerator tables, each time you debug this the debugger is going to have to manually index all of the DWARF and cause long delays each time you debug something. You might as well just not run the dsymutil and leave the DWARF in the .o files as that is the workflow that is designed to be used during the edit/compile/debug workflow.

I haven't looked at the code itself yet, just wanted to get some more overall info on the approach.

Snipping a fragment here:

A lot of stuff the compiler emits these days also causes many of the types to not be correctly removed even though they aren't used.
So there are many C++ compiler things that seem to be keeping things alive in the DWARF output by dsymutil these days that used to not be there before.

Depends on your definition of "used" I guess, and I don't think it's
much related to dsymutil. at least the example you give is
specifically about imported_declarations - I implemented those (... 7
years ago https://github.com/llvm/llvm-project/commit/bd4837665b2ee1e77dc9337e11ee2aed52a00a93
) & yeah, they're not great. Specifically the compiler doesn't track
if they are used for name lookup, so we don't have a way to only emit
those that were used - and even if we did emit only the used ones,
they aren't used in the DWARF sense (if you have a "using namespace
x;" in code - references to names in namespace 'x' don't refer to that
using declaration in the DWARF (unlike a typedef, or a using
declaration "using x = y;")) so you wouldn't be able to strategically
drop them in dsymutil. (eg: if a certain using directive was only used
by one function body, but that function was optimized away or not
linked - there's no chain of liveness between that function and the
using directive - no way to know which using directives should be kept
around).

& even using declarations for functions (unlike for types) - again, no
chain of connection. So if you call a function via a using declaration
("void f1(); using f2 = f1; void f3() { f2(); }") there's nothing in
the DWARF that says f2 is used by f3 - with call_site debug info we
could describe the call-site, but it wouldn't cover all references to
f2 (indirect calls, for instance).

So, yeah, compiler improvements to track the used-ness of using
declarations and directives would be nice. Oh, but even if we knew
they were used, we'd have to know /where/ they were used from (eg: if
the were used in an inline function that was never called, that
wouldn't be much use to us/we wouldn't want to emit them in the
resulting DWARF).

But even if we did have that information, unfortunately I'm not sure
it'd help dsymutil greatly - because currently those entities don't
always form chains of ownership within the DWARF, so you just sort of
have to keep carrying them.

Though they'd still benefit from all the usual type deduplication
stuff - so this is probably orthogonal to the discussion.

  • Dave
avl added a comment.Feb 5 2021, 7:38 AM

Handling inter-CU references: inter-CU references are hard to process using only one pass. f.e. if CU1 references CU100 and CU100 references CU1, we could not finish handling of CU1 until we finished CU100. Thus we either need to load all CUs into the memory, either load CUs several times.

What kind of inter-CU references do you see? There really shouldn't be any unless they are all in the same .o file and were produced by an LTO pass. Are there any inter CU references that I am forgetting about?

yes. That is the case: they are all in the same .o file and were produced by an LTO pass.

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution (This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't the linking process create on large uniqued abbrev table as it links each .o file's DWARF into the final output? This info isn't that large that is true, but it'd be nice to have just one table as long as we are optimizing the DWARF.

If we try to generate it while cloning DIEs it leads to slow execution, because threads are waiting for each other.
(All threads need access to the global abbreviation table for every die. Thus execution time spent on waiting).

If we try to implement it as some post-process step(when all compile units are processed) - then it would also require too much additional time. For each DIE we need to add abbreviations into the global abbreviation table. This might change the abbreviation code. Since the abbreviation code is ULEB128 - The size of DIE might change. That means we need to update all reference attributes. It would take a significant amount of time. Technically it might be done, but it looks like it would not be practical.

Probably, it might be easily done for "num-threads 1" case only.

b) Removed check for ambiguous type declarations. Existed check does not solve the problem in general (current check does not work if there are ambiguous types in different compile units).

I am assuming you didn't pull out ODR type uniquing all together right? That is very important and one of the main reasons for using dsymutil.

Right. This patch does mostly the same ODR deduplication as the current dsymutil.
There are small differences:

  1. In multi-thread mode, some canonical dies might become not seen by some threads.

    If the request for setting canonical dies come from several threads at the same time - some threads might receive information that there is not canonical die(though it is actually, it would be set by another thread). As the result, some type would not be de-duplicated. But the number of such cases is small. For the clang, there is no noticeable difference in the size of the binary.
  1. Ambiguous types. That is what currently done by this code: https://github.com/llvm/llvm-project/blob/main/llvm/lib/DWARFLinker/DWARFLinkerDeclContext.cpp#L19

    i.e. when dsymutil can not distinguish function type because of parameterized arguments. It marks the declaration context as invalid. The current workaround does not solve the problem fully. A similar approach might be implemented(with the same limitations) by this patch. But, probably, we might want to think of a better solution.

There are many things that can make this ODR type uniquing fail to work correctly, but those are bugs that should be fixed. One of the main things is if a class or struct has a forward declaration to another class inside of it. If this is the case the llvm dsymutil doesn't try and see if everyone has the same full class definition with a forward decl on it, it just punts and emits another copy of the type. I saw this when I would have a "FILE" type from the unix fopen(...) function. If multiple files had this type in a C++ source file, they wouldn't get uniqued because one of the members had a forward declaration. The reason it doesn't get uniqued is dsymutil worries someone might have a "FILE" type info in the DWARF that _does_ have a full definition for that forward declaration. A lot of stuff the compiler emits these days also causes many of the types to not be correctly removed even though they aren't used.

So there are many C++ compiler things that seem to be keeping things alive in the DWARF output by dsymutil these days that used to not be there before. When ODR works correctly, the first instance of a type gets emitted from the first object file file that has it, and then all other .o files that have this same type remove it and have all CU references within changed to a DW_FORM_ref_addr inter CU reference. We should make sure that this isn't happening to many clang types and causing us to not unique types.

When the type uniquing works correctly it really reduces the output DWARF size. Back at Apple the WebCore.framework DWARF debug info went from 1.8GB down to 450MB.

As an example of C++ compiler type bloat, you can see this if you do "#include <cstdio>". If you compile this with a recent clang:

#include <stdio.h>

int main() {
    printf("hello\n");
    return 0;
}

The DWARF looks very small:

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer	("Apple clang version 12.0.0 (clang-1200.0.32.29)")
              DW_AT_language	(DW_LANG_C_plus_plus_11)
              DW_AT_name	("main.cpp")
              DW_AT_LLVM_sysroot	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk")
              DW_AT_APPLE_sdk	("MacOSX.sdk")
              DW_AT_stmt_list	(0x00000000)
              DW_AT_comp_dir	("/Users/gclayton/Documents/src/args")
              DW_AT_low_pc	(0x0000000100003f60)
              DW_AT_high_pc	(0x0000000100003f8a)

0x00000032:   DW_TAG_subprogram
                DW_AT_low_pc	(0x0000000100003f60)
                DW_AT_high_pc	(0x0000000100003f8a)
                DW_AT_frame_base	(DW_OP_reg6 RBP)
                DW_AT_name	("main")
                DW_AT_decl_file	("/Users/gclayton/Documents/src/args/main.cpp")
                DW_AT_decl_line	(4)
                DW_AT_type	(0x000000000000004b "int")
                DW_AT_external	(true)

0x0000004b:   DW_TAG_base_type
                DW_AT_name	("int")
                DW_AT_encoding	(DW_ATE_signed)
                DW_AT_byte_size	(0x04)

0x00000052:   NULL

But if you include the "cstdio" file instead:

#include <cstdio>

int main() {
    printf("hello\n");
    return 0;
}

This ends up having tons of extra type stuff that really bloats the DWARF. The C++ header ends up with full definitions for every function in stdio.h and data types (like "FILE"!) for functions that are not used in the .o file. In this case the C++ code add DW_TAG_imported_declaration to all these types and that ends up keeping all of these extra types alive since something refers to it and ends up with way toe much stuff in the final dSYM. We can see the DW_TAG_imported_declaration below:

0x00000032:   DW_TAG_namespace
                DW_AT_name	("std")

0x00000037:     DW_TAG_namespace
                  DW_AT_name	("__1")
                  DW_AT_export_symbols	(true)

0x0000003c:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cstdio")
                    DW_AT_decl_line	(109)
                    DW_AT_import	(0x0000000000000195)
`

And later we see:

0x00000193:       NULL

0x00000194:     NULL

0x00000195:   DW_TAG_typedef
                DW_AT_type	(0x00000000000001a0 "__sFILE")
                DW_AT_name	("FILE")
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(157)

0x000001a0:   DW_TAG_structure_type
                DW_AT_calling_convention	(DW_CC_pass_by_value)
                DW_AT_name	("__sFILE")
                DW_AT_byte_size	(0x98)
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(126)

0x000001a9:     DW_TAG_member
                  DW_AT_name	("_p")
                  DW_AT_type	(0x000000000000029a "unsigned char*")
                  DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                  DW_AT_decl_line	(127)
                  DW_AT_data_member_location	(0x00)
....

According to the problem of imported declarations - this patch does the same as the current dsymutil.
The solution for that problem would be nice to have. Though this is out of the scope of this patch.

c) .debug_frame. Already generated CIE records are not reused between object files

Can't we unique these when the .o file gets linked on the main thread?

We can, but we need to re-parse generated debug_frame then.
Probably, If it would not take a lot of time, we might to try.

e) .debug_names is not supported by this patch.

Any reason why this can't be done? I believe that dsymutil just generates the .debug_names (or Apple accelerator tables) after you have written all of the DWARF out by traversing it no? If this doesn't happen, it can be added with the --update option to dsymutil, but again, this takes time.

There is no fundamental reason. I think it might be supported.
I just need more time to check that code and understand how to implement it properly.
I would update this patch with implementation later.

avl added a comment.Feb 22 2021, 9:55 AM

@JDevlieghere @aprantl @echristo @dblaikie @clayborg

What do you think of this patch? Is it OK in general and Is it worth to divide it into smaller patches and integrate?

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

I believe lld's multithreading still produces deterministic output, for instance.

avl added a comment.Feb 22 2021, 11:12 AM

yep. this implementation is non-deterministic.
I tried to preserve the current ODR algorithm(which relies on concrete type sequences).

I thought about following improvement(though not for this patch),
which should be deterministic even in multi-thread mode:

Implement artificial type table with type merging:

Source DWARF:

0x0000000b: DW_TAG_compile_unit 
              DW_AT_name = "module1.cpp"
                            
0x0000002e: DW_TAG_structure_type 
              DW_AT_name = "S"

              NULL

0x0000008c:   DW_TAG_subprogram 

0x000000aa:     DW_TAG_variable  
                  DW_AT_name = "s"
                  DW_AT_type (0x0000002e "S")

            NULL

0x0000012c: DW_TAG_compile_unit 
              DW_AT_name "module2.cpp"

0x0000014f:   DW_TAG_structure_type 
                DW_AT_name = "S"

0x00000157:     DW_TAG_structure_type
                  DW_AT_name = "Nested"

0x00000160: DW_TAG_subprogram 

0x00000168:   DW_TAG_variable  
                   DW_AT_name = "n"
                   DW_AT_type (0x00000157 "Nested")

            NULL

New DWARF:

0x0000000b: DW_TAG_compile_unit
              DW_AT_name = "__type_table"

0x0000002e:   DW_TAG_structure_type 
                DW_AT_name = "S"

0x00000036:     DW_TAG_structure_type
                DW_AT_name = "Nested"

              NULL

0x0000012c: DW_TAG_compile_unit
              DW_AT_name = "module1.cpp"

0x0000014c:   DW_TAG_subprogram 

0x00000150:     DW_TAG_variable  
                  DW_AT_name = "s"
                  DW_AT_type (0x0000002e "S")

            NULL

0x0000018c: DW_TAG_compile_unit
              DW_AT_name "module2.cpp"

0x000001bc:   DW_TAG_subprogram 

0x000001c0:     DW_TAG_variable  
                  DW_AT_name = "n"
                  DW_AT_type (0x00000036 "Nested")

            NULL

In this example, "New DWARF is the result of ODR unique algorithm differrent from the current one.
The current ODR algorithm is - remember the first type occurrence and set all following references(to the same type) into that first met type.
The new ODR algorithm might be: move all types into separate artificial compile unit containing type table.
In this case type table might be always created in the same form despite the type processing order.

As additional bonus type merging might save noticeable amount of space.

clayborg added a comment.EditedFeb 22 2021, 11:50 AM
In D96035#2579263, @avl wrote:

@JDevlieghere @aprantl @echristo @dblaikie @clayborg

What do you think of this patch? Is it OK in general and Is it worth to divide it into smaller patches and integrate?

The original dsymutil (pre llvm version) that I wrote was deterministic, and so is the current one. That being said, I am not too worried about this because the quality of the resulting debug info is the main goal here.

My main concerns are:

  • The final .debug_str should only have one copy of a string. It should be possible to create individual string tables and merge them in the last loop you have that fixes up offsets right?
  • I am unclear on how ODR uniquing works here without cross talking between threads and slowing everything down. Since they currently create new DW_FORM_ref_addr that reference other already streamed out CUs, how does that work in this patch?
  • I am unclear on what state the inter CU references are in in this patch when they are in the same .o file from LTO. Do they work? Are you loading all CUs into memory? I would be ok with loading all CUs from the same file at one time.

I do your idea of creating a new compile unit for types in order to make ODR better as I have thought of this. A few extra ideas on this:

  • it might be best to just create type units on the fly instead of making fake compile units?
  • if we don't want to create/use type units then:
    • create a new DW_TAG_compile_unit whose source path is the DW_AT_decl_file for the type in question
    • Bonus points if we can add all types form the same DW_AT_decl_file to the same DW_TAG_compile_unit
    • Some type class declarations that use compiler default created constructors, copy constructor, assignment operators might omit these member functions in various different compile units if they were or were not used by that compile unit. If we can augment the uniqued type definition to include these, that would be a huge win. Right now when ODR finds type "class A" with no compiler created constructor, it will re-emit a full type definition because DWARF needs to be able to reference it _if_ a previous copy did not include the needed method

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

avl added a comment.Feb 23 2021, 11:53 AM

The final .debug_str should only have one copy of a string. It should be possible to create individual string tables and merge them in the last loop you have that fixes up offsets right?

Right. The final .debug_str has only one copy of a string.
The implementation is a bit different though:

While cloning compile unit, DWARFLinker remembers the pointer to the string and reference to the string attribute:

// create delayed task for setting proper offset value for the string
// attribute.
CU.noteOffsetData(It, *String);

After cloning is complete, DWARFLinker puts remembered strings into the common StringPool and attribute is updated:

void DWARFLinker::assignDelayedOffsets(CompileUnit &CU) {
  std::unique_lock<std::mutex> Guard(StringPoolMutex);

  for (const CompileUnit::OffsetsData &Data : CU.getStringsOffsetData())
    setStringsOffset(Data.AttrToPatch, Data.String);
}

This way common resource TheOffsetsStringPool is locked only once per compile unit.
After all compile units are processed the .debug_str is generated from the TheOffsetsStringPool
which has only one copy of each string.

I am unclear on how ODR uniquing works here without cross talking between threads and slowing everything down. Since they currently create new DW_FORM_ref_addr that reference other already streamed out CUs, how does that work in this patch?

ODR uniquing cross talks between threads. It is implemented this way:

During analyzing declaration context stage DWARFLinker remembers properties of each declaration context for compile unit:

// In multy-threaded case it is in-effective to lock&use
// common resource ODRContexts.getChildDeclContext() for
// every type declaration context. Instead, we remember
// DeclarationContextKey and create DeclarationContext and
// update Info.Ctxt field later. That way we lock
// ODRContexts.getChildDeclContext() only once for each compile
// unit.

CU.noteContextData(Current.ContextKey, &Info);

Later, after all types are analyzed, corresponding type declaration contexts are created in the common
ODRContexts pool.

void DWARFLinker::assignDelayedDeclContexts(CompileUnit &CU) {
  std::unique_lock<std::mutex> Guard(ContextsMutex);

  for (const CompileUnit::DeclContextKeyData &Data : CU.getDeclContextData())
    setDeclContext(Data.Key, Data.Info);
}

At that stage, Declaration contexts(which are already is in ODRContexts pool) might be reused.

I am unclear on what state the inter CU references are in in this patch when they are in the same .o file from LTO. Do they work? Are you loading all CUs into memory? I would be ok with loading all CUs from the same file at one time.

inter CU references are supported. This patch tries to load/unload not dependent compile units and load at one time only inter-connected compile units.
Let`s we have following dependencies:

CU1 references CU2, CU2 references CU4, CU4 references CU3, CU5(does not reference other CUs)

  1. first pass. CU1 loaded. It is seen that it needs CU2, so left CU1 in memory. CU2 loaded. It is seen that it needs CU4, so left CU2 in memory. CU3 loaded. processed. unloaded. CU4 loaded. It is seen that it needs CU3, so left CU4 in memory. CU5 loaded. processed. unloaded.
  2. second pass. CU1 processed. CU2 processed. CU3 loaded. processed. CU4 processed.

The reason why I try to avoid loading all CUs from the same file at one time, this utility - https://reviews.llvm.org/D86539.
Implementation which loads all CUs from the same file at one time would require a lot of memory if it optimize single binary(not a set of object files).

it might be best to just create type units on the fly instead of making fake compile units?

Type units might be created, but they require additional space. We might lose some size improvement.

create a new DW_TAG_compile_unit whose source path is the DW_AT_decl_file for the type in question

I thought about single artificial compile unit for all types, but yes we might implement it this way.

Some type class declarations that use compiler default created constructors, copy constructor, assignment operators might omit these member functions in various different compile units if they were or were not used by that compile unit. If we can augment the uniqued type definition to include these, that would be a huge win.

yes. That is the idea for type merging. it might save a lot of space.

avl added a comment.Feb 23 2021, 11:59 AM

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Might changing the ODR uniquing algorithm be a solution then? (That is what I described in some previous message - do not reuse existing type declarations, but generate an artificial compile unit containing the types)? Such artificial compile unit containing the types might be generated in the deterministic way.

In D96035#2582780, @avl wrote:

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Yes :)

avl added a comment.Feb 23 2021, 12:11 PM
In D96035#2582780, @avl wrote:

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Yes :)

I see. I tried to preserve existing idea of ODR uniquing in this patch. To make it in multi-thread way and to have a deterministic behavior - Could we consider to change current ODR uniquing algorithm?

In D96035#2582780, @avl wrote:

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Might changing the ODR uniquing algorithm be a solution then? (That is what I described in some previous message - do not reuse existing type declarations, but generate an artificial compile unit containing the types)? Such artificial compile unit containing the types might be generated in the deterministic way.

I'm not sure why that would necessarily be better/faster - it'd still require two passes, right? One to collect the types, then another to emit the unit with the types and the units referencing that?

If it requires two passes, what about one pass that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones? That could then potentially produce the same kind of (& possibly exactly the same) output as is the case today, rather than introducing a new CU?

avl added a comment.Feb 23 2021, 2:00 PM

I'm not sure why that would necessarily be better/faster - it'd still require two passes, right? One to collect the types, then another to emit the unit with the types and the units referencing that?

If it requires two passes, what about one pass that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones? That could then potentially produce the same kind of (& possibly exactly the same) output as is the case today, rather than introducing a new CU?

I think I did not understand the idea. Current processing looks similar to the above description.
For the single compile unit, we do the declaration context analysis step that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones. Later we emit the body of the compile unit based on the results of the declaration context analysis step.

When we decide which of the type definitions to keep in the unit we use ODR uniquing algorithm which sets the first met type definition as canonical and uses it later for type references in other compile units.

But we do not have a fixed order of compile units, they are processed in parallel. If both CU1 and CU2 have the same type definition, depending on the real order, canonical type definition might be set for CU1 or CU2. How could we avoid that non-determinism using additional pass?

Speaking of the solution with artificial CU it has several advantages before the current ODR algorithm:

  1. Since DWARFLinker generates artificial CU then it might generate it in a deterministic way. Despite the real order of processed types, we might always generate them in the same form(sorted by name or another).
  1. Current ODR algorithm makes CU processing be depending on common resource - ODR Declaration context. That slowdowns parallel execution(threads wait for each other). If we would generate separate artificial CU for types then processing of current CU might be done without waiting for ODR Declaration context. it might increase effective CPU utilization ratio.
  1. Since we generate types - it is possible to implement type merging, which might save a significant amount of space.
In D96035#2583116, @avl wrote:

I'm not sure why that would necessarily be better/faster - it'd still require two passes, right? One to collect the types, then another to emit the unit with the types and the units referencing that?

If it requires two passes, what about one pass that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones? That could then potentially produce the same kind of (& possibly exactly the same) output as is the case today, rather than introducing a new CU?

I think I did not understand the idea. Current processing looks similar to the above description.
For the single compile unit, we do the declaration context analysis step that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones. Later we emit the body of the compile unit based on the results of the declaration context analysis step.

When we decide which of the type definitions to keep in the unit we use ODR uniquing algorithm which sets the first met type definition as canonical and uses it later for type references in other compile units.

But we do not have a fixed order of compile units, they are processed in parallel. If both CU1 and CU2 have the same type definition, depending on the real order, canonical type definition might be set for CU1 or CU2. How could we avoid that non-determinism using additional pass?

Comparing your proposal - avoiding nondeterminism by sorting the types by name in the new type-holding CU, we could do something similar, but instead sorting the CUs a type appears in to pick which of those existing locations to be the canonical home. (or, indeed, could sort by the original linear visitation order)

eg: multithreaded walk over each CU and find each type (this applies to both your proposal and what I'm suggesting here, I think) - then, rather than sorting by name and putting them in a new CU, sort by the "index" of the CU the type appeared in (where the index is the order the CUs would've been visited in in the current linear algorithm) then place/preserve the canonical type in the CU with the lowest index where the type appears?

Then the second pass goes and emits the CUs, consulting this list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere.

I think there might be merit to the approach you're suggesting too - but I'm less sure it's necessary to significantly alter the output scheme to achieve the benefits of parallelism.

(as for type merging - that might also be possible with the scheme I'm proposing - if we're rewriting DIEs anyway, seems plausible we could add new child DIEs to canonical type type DIEs)

One idea is for each type we want to unique, each CU as it is being parsed on its own thread creates a type map that maps "type compile unit path" (which is DW_AT_decl_file of that type) to a list of uniqued types with decl context info. This would allow all types with the same DW_AT_decl_file to be stored in the same type compile unit. As we emit the debug info for a normal compile unit, any type references are changed to be DW_AT_ref_addr references with values that will need to be fixed up. After all normal CUs have been emitted, combine all of the CU specific type maps together, and now emit the type compile units (carefully to include all needed children in each type to make the most complete version of the type possible and emit them) in a sorted order to allow determinism. Then we fix up all DW_AT_ref_addr references in the normal CUs to point to the one type definition from the type compile units.

Sorry if this is restating what has already been said/implied!

avl added a comment.Feb 24 2021, 5:04 AM

Comparing your proposal - avoiding nondeterminism by sorting the types by name in the new type-holding CU, we could do something similar, but instead sorting the CUs a type appears in to pick which of those existing locations to be the canonical home. (or, indeed, could sort by the original linear visitation order)

eg: multithreaded walk over each CU and find each type (this applies to both your proposal and what I'm suggesting here, I think) - then, rather than sorting by name and putting them in a new CU, sort by the "index" of the CU the type appeared in (where the index is the order the CUs would've been visited in in the current linear algorithm) then place/preserve the canonical type in the CU with the lowest index where the type appears?

Then the second pass goes and emits the CUs, consulting this list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere.

My understanding is that this way assumes all CUs should be loaded into the memory and it does extra pass. i.e.

  1. the first pass enumerates in a multithreaded manner all object files, all compile units, and creates an indexed map(the list of type homes). (In the result all CUs from all object files are loaded into the memory at the same time. The indexed map is also in the memory).
  1. the second pass enumerates in a multithreaded manner all object files, all compile units, and emits bodies(consulting the list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file.

The scheme implemented in this patch and which might be done with additional compile unit keeping types - visit CU only once and then it might be unloaded from the memory:

  1. the first pass enumerates in a multithreaded manner all object files, all compile units. Each CU is loaded, analyzed for types(there would be created a list of attribute referencing types), emitted. types moved into the artificial CU. unloaded.
  1. Emit artificial CU(After the first pass is finished all CU are unloaded from the memory, except the artificial one).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file. (At this stage type's references from already emitted CU bodies would be patched to proper offsets inside artificial CU).

this scheme does not need to have two passes and it does not need to load all CUs into the memory at the same time.

(as for type merging - that might also be possible with the scheme I'm proposing - if we're rewriting DIEs anyway, seems plausible we could add new child DIEs to canonical type type DIEs)

That would be one more reason to keep CUs in memory. When we rewrite a canonical DIE in some CU we would need to gather all children from all other CUs.

The artificial CU(from the scheme with separate CU keeping types) would have merged type so it does not require to keep source CU.

avl added a comment.Feb 24 2021, 5:37 AM

One idea is for each type we want to unique, each CU as it is being parsed on its own thread creates a type map that maps "type compile unit path" (which is DW_AT_decl_file of that type) to a list of uniqued types with decl context info. This would allow all types with the same DW_AT_decl_file to be stored in the same type compile unit. As we emit the debug info for a normal compile unit, any type references are changed to be DW_AT_ref_addr references with values that will need to be fixed up. After all normal CUs have been emitted, combine all of the CU specific type maps together, and now emit the type compile units (carefully to include all needed children in each type to make the most complete version of the type possible and emit them) in a sorted order to allow determinism. Then we fix up all DW_AT_ref_addr references in the normal CUs to point to the one type definition from the type compile units.

Sorry if this is restating what has already been said/implied!

yep. That is pretty match the same scenario which I talked about.

The difference is that you propose to create type units(DW_TAG_type_unit) for the types while my suggestion is to use artificial compile unit(that would save space needed by type units headers).

This would allow all types with the same DW_AT_decl_file to be stored in the same type compile unit.

Do you mean DW_TAG_type_unit or DW_TAG_compile_unit here?

It looks like we could not put several types into the same type unit:

"A type unit entry owns debugging information entries that
represent the definition of a single type, plus additional debugging information
entries that may be necessary to include as part of the definition of the type."

Why it would be good to group types on DW_AT_decl_file basis?

friss added a comment.Feb 24 2021, 8:21 AM
In D96035#2584686, @avl wrote:

The scheme implemented in this patch and which might be done with additional compile unit keeping types - visit CU only once and then it might be unloaded from the memory:

  1. the first pass enumerates in a multithreaded manner all object files, all compile units. Each CU is loaded, analyzed for types(there would be created a list of attribute referencing types), emitted. types moved into the artificial CU. unloaded.

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).
Conceptually, I think the idea of moving types to a "central" location is valid, it seems semantically equivalent to the kind of debug info we would get with -gmodules. Another kind of approach would be to gather enough information to create a dependency graph of CUs and then orchestrate parallel processing according to this. Even if this would limit the parallelism at the beginning of a link, I think it's very hard to intuitively figure out whether it would perform worse than the approach above.

avl added a comment.Feb 24 2021, 9:40 AM

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).

We might sort types on name basis inside that artificial CU. In this case it would always be emitted in the same order.

In D96035#2585324, @avl wrote:

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).

We might sort types on name basis inside that artificial CU. In this case it would always be emitted in the same order.

I don't think finding how to sort them is the complicated part. In my mind the issue is keeping them around until you can sort them as I'm not sure DIEs can exist not attached to a CU. Also, it's not just the one type, but the transitive closure of all its dependencies that you need to consider which makes things (much) more complicated. As I said before, I haven't worked on this in a long time, so feel free to shoot my concerns down!

In D96035#2585324, @avl wrote:

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).

We might sort types on name basis inside that artificial CU. In this case it would always be emitted in the same order.

I don't think finding how to sort them is the complicated part. In my mind the issue is keeping them around until you can sort them as I'm not sure DIEs can exist not attached to a CU. Also, it's not just the one type, but the transitive closure of all its dependencies that you need to consider which makes things (much) more complicated. As I said before, I haven't worked on this in a long time, so feel free to shoot my concerns down!

I agree with Fred in that it might be too expensive to keep all of the CUs around so that we can later copy type data out of them at the end into uniqued compile units.

For testing DWARF, a while back I created a DWARF generator set of classes:

llvm/unittests/DebugInfo/DWARF/DwarfGenerator.h
llvm/unittests/DebugInfo/DWARF/DwarfGenerator.cpp

This allows the creation of stand alone DWARF using a new set of DWARF classes:

llvm::dwarfgen::DIE
llvm::dwarfgen::CompileUnit
llvm::dwarfgen::Generator

It might allow us to copy the types into a llvm::dwarfgen::Generator instance and might easy the ability to add and remove DIEs to parent DIEs, like adding new methods to classes and slowly adding new types a compile unit DIE. Check the test files in llvm/unittests/DebugInfo/DWARF that include the DwarfGenerator.h file to see examples of tests that create DWARF using this generator. The nice thing is all references to other DIEs are actual pointers to llvm::dwarfgen::DIE objects, so you can insert and remove DIEs as needed to anything within a llvm::dwarfgen::CompileUnit, even across llvm::dwarfgen::CompileUnit objects if needed, and when you generate the DWARF, it all gets finalized at that point. So maybe we can use the DWARF generator for all types. When each compile unit is done being parsed on its thread, it can enter a type copying phase where it makes sure all of the uniqued types it needs are copied into a single smart class that uses the DWARF generator. Then we can unload the compile units from the .o files right away when we are done.

In D96035#2584686, @avl wrote:

Comparing your proposal - avoiding nondeterminism by sorting the types by name in the new type-holding CU, we could do something similar, but instead sorting the CUs a type appears in to pick which of those existing locations to be the canonical home. (or, indeed, could sort by the original linear visitation order)
eg: multithreaded walk over each CU and find each type (this applies to both your proposal and what I'm suggesting here, I think) - then, rather than sorting by name and putting them in a new CU, sort by the "index" of the CU the type appeared in (where the index is the order the CUs would've been visited in in the current linear algorithm) then place/preserve the canonical type in the CU with the lowest index where the type appears?
Then the second pass goes and emits the CUs, consulting this list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere.

My understanding is that this way assumes all CUs should be loaded into the memory and it does extra pass. i.e.

Sounds like your proposal would require that too - or require reloading the CUs twice? I think either strategy (keeping them loaded, or reloading them) could be used for either of our proposed directions (creating a separate unit, or using the existing units)?

  1. the first pass enumerates in a multithreaded manner all object files, all compile units, and creates an indexed map(the list of type homes). (In the result all CUs from all object files are loaded into the memory at the same time. The indexed map is also in the memory).
  1. the second pass enumerates in a multithreaded manner all object files, all compile units, and emits bodies(consulting the list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file.

The scheme implemented in this patch and which might be done with additional compile unit keeping types - visit CU only once and then it might be unloaded from the memory:

  1. the first pass enumerates in a multithreaded manner all object files, all compile units. Each CU is loaded, analyzed for types(there would be created a list of attribute referencing types), emitted. types moved into the artificial CU. unloaded.

"types moved into the artificial CU" means what exactly? Copying them type DIE tree into some intermediate, in-memory representation?

Ah, yes, though keeping them in memory may be expensive - it might be cheaper to rewalk the units/know the offsets to the DIEs and reload them in some way.

  1. Emit artificial CU(After the first pass is finished all CU are unloaded from the memory, except the artificial one).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file. (At this stage type's references from already emitted CU bodies would be patched to proper offsets inside artificial CU).

this scheme does not need to have two passes and it does not need to load all CUs into the memory at the same time.

(as for type merging - that might also be possible with the scheme I'm proposing - if we're rewriting DIEs anyway, seems plausible we could add new child DIEs to canonical type type DIEs)

That would be one more reason to keep CUs in memory. When we rewrite a canonical DIE in some CU we would need to gather all children from all other CUs.

The artificial CU(from the scheme with separate CU keeping types) would have merged type so it does not require to keep source CU.

Yes, if we're merging types it becomes somewhat more complicated - again, could do a two pass (without necessarily keeping all CUs in memory) - go quickly skim the CUs, for each type, record the type name and member list - then merge those lists, keeping the type record for the lowest indexed CU, and a merged member list that mentions which offset/CU each member comes from. This is essentially the same data you'd have to keep in memory (either fully in memory, or some kind of skeleton data that refers to the CUs/input file itself, so it can be reparsed as needed) for the "emit a separate unit full of types at the end" approach.

And in any case, you couldn't emit the units in parallel in the first pass, because you wouldn't know which offsets to write them to, right? (because the size of units will be changing during this process)

So I'm not sure where the parallelism comes into your scheme (& even in my scheme it'd be a bit non-trivial - I guess you'd have to record all the DIE references in each unit that might become cross-unit references (so you know you'll have to change their encoded size) and all the types (so you know whether that'll get bigger (if members are merged in) or smaller (if the type is removed in favor of referencing a type in another unit) - not sure there's a huge difference in performance/complexity between the two, perhaps.

avl added a comment.Feb 25 2021, 1:34 AM

I agree with Fred in that it might be too expensive to keep all of the CUs around so that we can later copy type data out of them at the end into uniqued compile units.

It might allow us to copy the types into a llvm::dwarfgen::Generator instance and might easy the ability to add and remove DIEs to parent DIEs, like adding new methods to classes and slowly adding new types a compile unit DIE.

When each compile unit is done being parsed on its thread, it can enter a type copying phase where it makes sure all of the uniqued types it needs are copied into a single smart class that uses the DWARF generator. Then we can unload the compile units from the .o files right away when we are done.

Agreed. That is exactly solution which I am talking about. I did not examined whether llvm::dwarfgen::Generator is suitable for that yet but I would be happy to use it if it is.

avl added a comment.Feb 25 2021, 2:27 AM

Sounds like your proposal would require that too - or require reloading the CUs twice? I think either strategy (keeping them loaded, or reloading them) could be used for either of our proposed directions (creating a separate unit, or using the existing units)?

No. My solution does not require to keep all CUs in memory or to reload them twice(for the purpose of ODR types deduplication).

It loads CU, analyzes types, creates list of type references, removes types. Emit DWARF for that CU. unload CU.

It does not need to load that CU again or keep input DWARF of that CU in the memory.
It only needs to fix-up remembered type references in generated DWARF after all CUs are processed and artificial CU is built.
It keeps all types in memory but that requires less space than all CUs.

"types moved into the artificial CU" means what exactly? Copying them type DIE tree into some intermediate, in-memory representation?

yes.

Ah, yes, though keeping them in memory may be expensive - it might be cheaper to rewalk the units/know the offsets to the DIEs and reload them in some way.

I think it would not be cheaper to load CU from the disk again. But we can do experiment and select the more effective solution. i.e. solution with artificial CU allows us to have a choice.

If we would implement solution which load all CUs in memory or reload them from disk - then we would not have a choice.

Yes, if we're merging types it becomes somewhat more complicated - again, could do a two pass (without necessarily keeping all CUs in memory) - go quickly skim the CUs, for each type, record the type name and member list - then merge those lists, keeping the type record for the lowest indexed CU, and a merged member list that mentions which offset/CU each member comes from. This is essentially the same data you'd have to keep in memory (either fully in memory, or some kind of skeleton data that refers to the CUs/input file itself, so it can be reparsed as needed) for the "emit a separate unit full of types at the end" approach.
And in any case, you couldn't emit the units in parallel in the first pass, because you wouldn't know which offsets to write them to, right? (because the size of units will be changing during this process)

Not exactly right. This prototype uses exactly the scheme with one pass(which loads/parse/handle/unload CU only once). It processes units in parallel, generates resulting DWARF and the list of references which should be patched. After all CUs are processed and final sizes are known. It has additional pass which writes correct offsets. That additional pass is _much_ cheaper than load/parse/handle/unload CUs again.

So I'm not sure where the parallelism comes into your scheme (& even in my scheme it'd be a bit non-trivial - I guess you'd have to record all the DIE references in each unit that might become cross-unit references (so you know you'll have to change their encoded size) and all the types (so you know whether that'll get bigger (if members are merged in) or smaller (if the type is removed in favor of referencing a type in another unit) - not sure there's a huge difference in performance/complexity between the two, perhaps.

right. That is the scheme(remember cross-unit references and patch them later) is used by this patch(except that it does not do types merging and does not have separate CU keeping types). This scheme allows to process CU separately, in parallel. It gives us more than 2x performance improvement. It also might be improved more(theoretically, If threads inter-dependency would be decreased).

In D96035#2587154, @avl wrote:

Sounds like your proposal would require that too - or require reloading the CUs twice? I think either strategy (keeping them loaded, or reloading them) could be used for either of our proposed directions (creating a separate unit, or using the existing units)?

No. My solution does not require to keep all CUs in memory or to reload them twice(for the purpose of ODR types deduplication).

It loads CU, analyzes types, creates list of type references, removes types. Emit DWARF for that CU. unload CU.

It does not need to load that CU again or keep input DWARF of that CU in the memory.
It only needs to fix-up remembered type references in generated DWARF after all CUs are processed and artificial CU is built.
It keeps all types in memory but that requires less space than all CUs.

Yeah, having to keep the types in memory is the bit I'm getting at - but yes, it's not all CUs. If they're being unloaded/reloaded though, it might be less clear whether it's so costly to preserve the existing output behavior. Though I don't personally have a problem with creating a synthetic/arbitrary CU to put all the types in either. But other folks (Apple folks with a vested interest in the dsymutil behavior) might disagree.

"types moved into the artificial CU" means what exactly? Copying them type DIE tree into some intermediate, in-memory representation?

yes.

Ah, yes, though keeping them in memory may be expensive - it might be cheaper to rewalk the units/know the offsets to the DIEs and reload them in some way.

I think it would not be cheaper to load CU from the disk again. But we can do experiment and select the more effective solution. i.e. solution with artificial CU allows us to have a choice.

If we would implement solution which load all CUs in memory or reload them from disk - then we would not have a choice.

Yes, if we're merging types it becomes somewhat more complicated - again, could do a two pass (without necessarily keeping all CUs in memory) - go quickly skim the CUs, for each type, record the type name and member list - then merge those lists, keeping the type record for the lowest indexed CU, and a merged member list that mentions which offset/CU each member comes from. This is essentially the same data you'd have to keep in memory (either fully in memory, or some kind of skeleton data that refers to the CUs/input file itself, so it can be reparsed as needed) for the "emit a separate unit full of types at the end" approach.
And in any case, you couldn't emit the units in parallel in the first pass, because you wouldn't know which offsets to write them to, right? (because the size of units will be changing during this process)

Not exactly right. This prototype uses exactly the scheme with one pass(which loads/parse/handle/unload CU only once). It processes units in parallel, generates resulting DWARF

By "generating resulting DWARF" I guess that has to be buffered in memory? (how could two CUs be written to the final output file at the same time? You wouldn't know where to write to because you wouldn't know how big the previous unit would be if you haven't finished processing it)

avl added a comment.Mar 5 2021, 1:40 PM

Yeah, having to keep the types in memory is the bit I'm getting at - but yes, it's not all CUs. If they're being unloaded/reloaded though, it might be less clear whether it's so costly to preserve the existing output behavior. Though I don't personally have a problem with creating a synthetic/arbitrary CU to put all the types in either. But other folks (Apple folks with a vested interest in the dsymutil behavior) might disagree.

I see, thanks. I tend to implement such a behavior (creating a synthetic/arbitrary CU to put all the types in either) since I believe that would be the most effective way.
(the practice might show if it is). It would be good to know whether that solution is acceptable/unacceptable before starting implementation. @JDevlieghere @aprantl @friss Would you find "implementation moving types into separate synthetic/arbitrary CU" useful and worth integrating into upstream dsymutil?(if it would show good performance/size results being implemented)

By "generating resulting DWARF" I guess that has to be buffered in memory? (how could two CUs be written to the final output file at the same time? You wouldn't know where to write to because you wouldn't know how big the previous unit would be if you haven't finished processing it)

Right. The DWARF for each CU is buffered in memory. And after all CUs are processed - They are glued together into the final full resulting DWARF file. That is exactly how this implementation works.

By "generating resulting DWARF" I guess that has to be buffered in memory? (how could two CUs be written to the final output file at the same time? You wouldn't know where to write to because you wouldn't know how big the previous unit would be if you haven't finished processing it)

Right. The DWARF for each CU is buffered in memory. And after all CUs are processed - They are glued together into the final full resulting DWARF file. That is exactly how this implementation works.

That seems quite expensive :s

"how this implementation works" - the patch in this review? The existing checked-in implementation? (apologies as I'm not especially familiar with the implementation of dsymutil as it is today)

avl added a comment.Mar 5 2021, 2:19 PM

That seems quite expensive :s

yep, it requires additional space. Though it allows to have benefits:

  • it allows separate parallel processing(this patch has more than 2x performance improvement).
  • plain DWARF requires less space than parsed/generated(DWARFDie/DIE).

"how this implementation works" - the patch in this review? The existing checked-in implementation? (apologies as I'm not especially familiar with the implementation of dsymutil as it is today)

yes, the patch in this review.