This is an archive of the discontinued LLVM Phabricator instance.

[dsymutil][DWARFlinker] implement separate multi-thread processing for compile units.
AcceptedPublic

Authored by avl on Feb 4 2021, 7:11 AM.

Details

Summary

The idea of that refactoring is to speed up DWARFLinker by utilizing parallel execution and minimizing memory usages(in case the object file contains multiple compilation units). This patch creates a new library DWARFLinkerNext, which is called from standard dsymutil using --use-dlnext command-line option. This is a second version of the patch which additionally does types merging and produces a predictable result(generated .dSYM file is the same between similar runs). The best configuration is able to produce .debug_info table of 68% less in size and works two times faster than the current upstream dsymutil.

The overall linking process looks like this:

parrallel_for_each(ObjectFile) {
  for_each (Compile Unit) {
    1. load Clang modules.
  }
    
  parrallel_for_each(Compile Unit) {
    1. load input DWARF for Compile Unit.
    2. report warnings for Clang modules.
    3. analyze live DIEs.
    4. analyze types if ODR deduplication is requested.
    5. clone DIEs(Generate output DIEs and resulting DWARF tables).
       The result is in OutDebugInfoBytes, which is ELF file containg
       DWARF tables corresponding for current compile unit.
    6. cleanup Input and Output DIEs.
  }  
  
  deallocate loaded Object file.
}

if (ODR deduplication is requested)
  generate artificial compilation unit "Type Table"(It uses partially generated dies at clone stage).

for_each (ObjectFile) {
  for_each (Compile Unit) {
    1. set offsets to Compile Units DWARF tables.
    2. sort offsets/attributes/patches to have a predictable result.
    3. patch size/offsets fields.
    4. generate index tables.
    5. move DWARF tables of compile units into the resulting file.
  }
}

i.e. every compile unit is processed separately, visited only once
(except case inter-CU references exist), and used data are freed
after the compile unit is processed. The resulting file is glued
from generated debug tables, corresponding to separate compile units.

Handling inter-CU references: inter-CU references are hard to process
using only one pass. f.e. if CU1 references CU100 and CU100 references
CU1, we could not finish handling of CU1 until we finished CU100.
Thus we either need to load all CUs into the memory, either load CUs several times.
This patch discovers dependencies during the first pass. So that depending compile
units stay loaded into the memory. And then, during the second pass it handles
inter-connected compile units.

That approach works well for cases when the number of inter-CU references is low. It allows not to load all CUs into the memory at once.

Changes from the current implementation(making DWARFLinkerNext to be binary incompatible with current DWARFLinker):

a) No common abbreviation table. Each compile unit has
   its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space,
   so it looks cheap to have separate abbreviations tables.
   Later, it might be optimized a bit(by removing equal abbreviations tables).
   
b) .debug_frame. Already generated CIE records are not reused between object files

c) ODR type deduplication works using another approach than current dsymutil. All types are moved into the artificial compilation unit. References to the types are changed so that they point to types from artificial compilation unit. Declarations and partial declarations of the same type are merged into single definition/declaration located in the artificial compilation unit.

Some DWARF features are not supported(it might be supported though): DWARF64, split dwarf, DWARF 5.

Performance results for this patch for the clang binary(Darwin 24-core 64G):

  1. clang binary, ODR deduplication is ON:
|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  | DWARF(*)|exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   161   |   16.5   |   486   |   204   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|   2   |   101   |   17.9   |   486   |   118   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|   4   |   101   |   17.9   |   486   |    75   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|   8   |   101   |   17.9   |   486   |    51   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|  16   |   101   |   17.9   |   486   |    42   |   12.2   |   157   |
|---------------------------------------------------------------------|

(*) DWARF is the size of .debug_info section.

  1. clang binary, ODR deduplication is OFF(--no-odr):
|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  |  DWARF  |exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   227   |   16.3   |   1460  |   227   |   16.2   |   1470  |
|-------|------------------------------|------------------------------|
|   2   |   218   |   17.9   |   1460  |   132   |   16.2   |   1470  |
|-------|------------------------------|------------------------------|
|   4   |   218   |   17.9   |   1460  |    83   |   16.5   |   1470  |
|-------|------------------------------|------------------------------|
|   8   |   218   |   17.9   |   1460  |    58   |   16.5   |   1470  |
|-------|------------------------------|------------------------------|
|  16   |   218   |   17.9   |   1460  |    47   |   16.9   |   1470  |
|---------------------------------------------------------------------|

Effective CPU utilization ratio for that implementation is 52%.

There still exist room for improvements. The run-time performance might be speed-up to 10-20%. The run-time memory usage might be decreased up to 10%.

Diff Detail

Event Timeline

avl created this revision.Feb 4 2021, 7:11 AM
avl requested review of this revision.Feb 4 2021, 7:11 AM
Herald added a project: Restricted Project. · View Herald Transcript
avl edited the summary of this revision. (Show Details)Feb 4 2021, 7:16 AM

Handling inter-CU references: inter-CU references are hard to process using only one pass. f.e. if CU1 references CU100 and CU100 references CU1, we could not finish handling of CU1 until we finished CU100. Thus we either need to load all CUs into the memory, either load CUs several times.

What kind of inter-CU references do you see? There really shouldn't be any unless they are all in the same .o file and were produced by an LTO pass. Are there any inter CU references that I am forgetting about?

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution (This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't the linking process create on large uniqued abbrev table as it links each .o file's DWARF into the final output? This info isn't that large that is true, but it'd be nice to have just one table as long as we are optimizing the DWARF.

b) Removed check for ambiguous type declarations. Existed check does not solve the problem in general (current check does not work if there are ambiguous types in different compile units).

I am assuming you didn't pull out ODR type uniquing all together right? That is very important and one of the main reasons for using dsymutil. There are many things that can make this ODR type uniquing fail to work correctly, but those are bugs that should be fixed. One of the main things is if a class or struct has a forward declaration to another class inside of it. If this is the case the llvm dsymutil doesn't try and see if everyone has the same full class definition with a forward decl on it, it just punts and emits another copy of the type. I saw this when I would have a "FILE" type from the unix fopen(...) function. If multiple files had this type in a C++ source file, they wouldn't get uniqued because one of the members had a forward declaration. The reason it doesn't get uniqued is dsymutil worries someone might have a "FILE" type info in the DWARF that _does_ have a full definition for that forward declaration. A lot of stuff the compiler emits these days also causes many of the types to not be correctly removed even though they aren't used.

So there are many C++ compiler things that seem to be keeping things alive in the DWARF output by dsymutil these days that used to not be there before. When ODR works correctly, the first instance of a type gets emitted from the first object file file that has it, and then all other .o files that have this same type remove it and have all CU references within changed to a DW_FORM_ref_addr inter CU reference. We should make sure that this isn't happening to many clang types and causing us to not unique types.

When the type uniquing works correctly it really reduces the output DWARF size. Back at Apple the WebCore.framework DWARF debug info went from 1.8GB down to 450MB.

As an example of C++ compiler type bloat, you can see this if you do "#include <cstdio>". If you compile this with a recent clang:

#include <stdio.h>

int main() {
    printf("hello\n");
    return 0;
}

The DWARF looks very small:

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer	("Apple clang version 12.0.0 (clang-1200.0.32.29)")
              DW_AT_language	(DW_LANG_C_plus_plus_11)
              DW_AT_name	("main.cpp")
              DW_AT_LLVM_sysroot	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk")
              DW_AT_APPLE_sdk	("MacOSX.sdk")
              DW_AT_stmt_list	(0x00000000)
              DW_AT_comp_dir	("/Users/gclayton/Documents/src/args")
              DW_AT_low_pc	(0x0000000100003f60)
              DW_AT_high_pc	(0x0000000100003f8a)

0x00000032:   DW_TAG_subprogram
                DW_AT_low_pc	(0x0000000100003f60)
                DW_AT_high_pc	(0x0000000100003f8a)
                DW_AT_frame_base	(DW_OP_reg6 RBP)
                DW_AT_name	("main")
                DW_AT_decl_file	("/Users/gclayton/Documents/src/args/main.cpp")
                DW_AT_decl_line	(4)
                DW_AT_type	(0x000000000000004b "int")
                DW_AT_external	(true)

0x0000004b:   DW_TAG_base_type
                DW_AT_name	("int")
                DW_AT_encoding	(DW_ATE_signed)
                DW_AT_byte_size	(0x04)

0x00000052:   NULL

But if you include the "cstdio" file instead:

#include <cstdio>

int main() {
    printf("hello\n");
    return 0;
}

This ends up having tons of extra type stuff that really bloats the DWARF. The C++ header ends up with full definitions for every function in stdio.h and data types (like "FILE"!) for functions that are not used in the .o file. In this case the C++ code add DW_TAG_imported_declaration to all these types and that ends up keeping all of these extra types alive since something refers to it and ends up with way toe much stuff in the final dSYM. We can see the DW_TAG_imported_declaration below:

0x00000032:   DW_TAG_namespace
                DW_AT_name	("std")

0x00000037:     DW_TAG_namespace
                  DW_AT_name	("__1")
                  DW_AT_export_symbols	(true)

0x0000003c:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cstdio")
                    DW_AT_decl_line	(109)
                    DW_AT_import	(0x0000000000000195)
`

And later we see:

0x00000193:       NULL

0x00000194:     NULL

0x00000195:   DW_TAG_typedef
                DW_AT_type	(0x00000000000001a0 "__sFILE")
                DW_AT_name	("FILE")
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(157)

0x000001a0:   DW_TAG_structure_type
                DW_AT_calling_convention	(DW_CC_pass_by_value)
                DW_AT_name	("__sFILE")
                DW_AT_byte_size	(0x98)
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(126)

0x000001a9:     DW_TAG_member
                  DW_AT_name	("_p")
                  DW_AT_type	(0x000000000000029a "unsigned char*")
                  DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                  DW_AT_decl_line	(127)
                  DW_AT_data_member_location	(0x00)
....

c) .debug_frame. Already generated CIE records are not reused between object files

Can't we unique these when the .o file gets linked on the main thread?

d) In multi-thread mode there is no predictable order of how compile units/dies are processed. That leads to selecting "canonical" dies, or adding strings into the debug_str table in some unpredictable order. It is semantically correct but creates not-matching references:

DW_AT_name [DW_FORM_strp]     ( .debug_str[0x00000095] = "AnonC")   
DW_AT_name [DW_FORM_strp]     ( .debug_str[0x0000007f] = "AnonC")

DW_AT_type [DW_FORM_ref_addr] (0x000000000000014f "U")
DW_AT_type [DW_FORM_ref_addr] (0x0000000000000170 "U")

Yes this will happen and it is ok as long as "--num-threads 1" allows consistent builds. Any clients that require dSYM files to be the same can use this slower method

e) .debug_names is not supported by this patch.

Any reason why this can't be done? I believe that dsymutil just generates the .debug_names (or Apple accelerator tables) after you have written all of the DWARF out by traversing it no? If this doesn't happen, it can be added with the --update option to dsymutil, but again, this takes time.

One of the main benefits of running dsymutil is the accelerator tables that speed up debugging. If you are solely looking for a way to reduce link times, then this succeeds. But if there are no accelerator tables, each time you debug this the debugger is going to have to manually index all of the DWARF and cause long delays each time you debug something. You might as well just not run the dsymutil and leave the DWARF in the .o files as that is the workflow that is designed to be used during the edit/compile/debug workflow.

I haven't looked at the code itself yet, just wanted to get some more overall info on the approach.

Snipping a fragment here:

A lot of stuff the compiler emits these days also causes many of the types to not be correctly removed even though they aren't used.
So there are many C++ compiler things that seem to be keeping things alive in the DWARF output by dsymutil these days that used to not be there before.

Depends on your definition of "used" I guess, and I don't think it's
much related to dsymutil. at least the example you give is
specifically about imported_declarations - I implemented those (... 7
years ago https://github.com/llvm/llvm-project/commit/bd4837665b2ee1e77dc9337e11ee2aed52a00a93
) & yeah, they're not great. Specifically the compiler doesn't track
if they are used for name lookup, so we don't have a way to only emit
those that were used - and even if we did emit only the used ones,
they aren't used in the DWARF sense (if you have a "using namespace
x;" in code - references to names in namespace 'x' don't refer to that
using declaration in the DWARF (unlike a typedef, or a using
declaration "using x = y;")) so you wouldn't be able to strategically
drop them in dsymutil. (eg: if a certain using directive was only used
by one function body, but that function was optimized away or not
linked - there's no chain of liveness between that function and the
using directive - no way to know which using directives should be kept
around).

& even using declarations for functions (unlike for types) - again, no
chain of connection. So if you call a function via a using declaration
("void f1(); using f2 = f1; void f3() { f2(); }") there's nothing in
the DWARF that says f2 is used by f3 - with call_site debug info we
could describe the call-site, but it wouldn't cover all references to
f2 (indirect calls, for instance).

So, yeah, compiler improvements to track the used-ness of using
declarations and directives would be nice. Oh, but even if we knew
they were used, we'd have to know /where/ they were used from (eg: if
the were used in an inline function that was never called, that
wouldn't be much use to us/we wouldn't want to emit them in the
resulting DWARF).

But even if we did have that information, unfortunately I'm not sure
it'd help dsymutil greatly - because currently those entities don't
always form chains of ownership within the DWARF, so you just sort of
have to keep carrying them.

Though they'd still benefit from all the usual type deduplication
stuff - so this is probably orthogonal to the discussion.

  • Dave
avl added a comment.Feb 5 2021, 7:38 AM

Handling inter-CU references: inter-CU references are hard to process using only one pass. f.e. if CU1 references CU100 and CU100 references CU1, we could not finish handling of CU1 until we finished CU100. Thus we either need to load all CUs into the memory, either load CUs several times.

What kind of inter-CU references do you see? There really shouldn't be any unless they are all in the same .o file and were produced by an LTO pass. Are there any inter CU references that I am forgetting about?

yes. That is the case: they are all in the same .o file and were produced by an LTO pass.

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution (This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't the linking process create on large uniqued abbrev table as it links each .o file's DWARF into the final output? This info isn't that large that is true, but it'd be nice to have just one table as long as we are optimizing the DWARF.

If we try to generate it while cloning DIEs it leads to slow execution, because threads are waiting for each other.
(All threads need access to the global abbreviation table for every die. Thus execution time spent on waiting).

If we try to implement it as some post-process step(when all compile units are processed) - then it would also require too much additional time. For each DIE we need to add abbreviations into the global abbreviation table. This might change the abbreviation code. Since the abbreviation code is ULEB128 - The size of DIE might change. That means we need to update all reference attributes. It would take a significant amount of time. Technically it might be done, but it looks like it would not be practical.

Probably, it might be easily done for "num-threads 1" case only.

b) Removed check for ambiguous type declarations. Existed check does not solve the problem in general (current check does not work if there are ambiguous types in different compile units).

I am assuming you didn't pull out ODR type uniquing all together right? That is very important and one of the main reasons for using dsymutil.

Right. This patch does mostly the same ODR deduplication as the current dsymutil.
There are small differences:

  1. In multi-thread mode, some canonical dies might become not seen by some threads.

    If the request for setting canonical dies come from several threads at the same time - some threads might receive information that there is not canonical die(though it is actually, it would be set by another thread). As the result, some type would not be de-duplicated. But the number of such cases is small. For the clang, there is no noticeable difference in the size of the binary.
  1. Ambiguous types. That is what currently done by this code: https://github.com/llvm/llvm-project/blob/main/llvm/lib/DWARFLinker/DWARFLinkerDeclContext.cpp#L19

    i.e. when dsymutil can not distinguish function type because of parameterized arguments. It marks the declaration context as invalid. The current workaround does not solve the problem fully. A similar approach might be implemented(with the same limitations) by this patch. But, probably, we might want to think of a better solution.

There are many things that can make this ODR type uniquing fail to work correctly, but those are bugs that should be fixed. One of the main things is if a class or struct has a forward declaration to another class inside of it. If this is the case the llvm dsymutil doesn't try and see if everyone has the same full class definition with a forward decl on it, it just punts and emits another copy of the type. I saw this when I would have a "FILE" type from the unix fopen(...) function. If multiple files had this type in a C++ source file, they wouldn't get uniqued because one of the members had a forward declaration. The reason it doesn't get uniqued is dsymutil worries someone might have a "FILE" type info in the DWARF that _does_ have a full definition for that forward declaration. A lot of stuff the compiler emits these days also causes many of the types to not be correctly removed even though they aren't used.

So there are many C++ compiler things that seem to be keeping things alive in the DWARF output by dsymutil these days that used to not be there before. When ODR works correctly, the first instance of a type gets emitted from the first object file file that has it, and then all other .o files that have this same type remove it and have all CU references within changed to a DW_FORM_ref_addr inter CU reference. We should make sure that this isn't happening to many clang types and causing us to not unique types.

When the type uniquing works correctly it really reduces the output DWARF size. Back at Apple the WebCore.framework DWARF debug info went from 1.8GB down to 450MB.

As an example of C++ compiler type bloat, you can see this if you do "#include <cstdio>". If you compile this with a recent clang:

#include <stdio.h>

int main() {
    printf("hello\n");
    return 0;
}

The DWARF looks very small:

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer	("Apple clang version 12.0.0 (clang-1200.0.32.29)")
              DW_AT_language	(DW_LANG_C_plus_plus_11)
              DW_AT_name	("main.cpp")
              DW_AT_LLVM_sysroot	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk")
              DW_AT_APPLE_sdk	("MacOSX.sdk")
              DW_AT_stmt_list	(0x00000000)
              DW_AT_comp_dir	("/Users/gclayton/Documents/src/args")
              DW_AT_low_pc	(0x0000000100003f60)
              DW_AT_high_pc	(0x0000000100003f8a)

0x00000032:   DW_TAG_subprogram
                DW_AT_low_pc	(0x0000000100003f60)
                DW_AT_high_pc	(0x0000000100003f8a)
                DW_AT_frame_base	(DW_OP_reg6 RBP)
                DW_AT_name	("main")
                DW_AT_decl_file	("/Users/gclayton/Documents/src/args/main.cpp")
                DW_AT_decl_line	(4)
                DW_AT_type	(0x000000000000004b "int")
                DW_AT_external	(true)

0x0000004b:   DW_TAG_base_type
                DW_AT_name	("int")
                DW_AT_encoding	(DW_ATE_signed)
                DW_AT_byte_size	(0x04)

0x00000052:   NULL

But if you include the "cstdio" file instead:

#include <cstdio>

int main() {
    printf("hello\n");
    return 0;
}

This ends up having tons of extra type stuff that really bloats the DWARF. The C++ header ends up with full definitions for every function in stdio.h and data types (like "FILE"!) for functions that are not used in the .o file. In this case the C++ code add DW_TAG_imported_declaration to all these types and that ends up keeping all of these extra types alive since something refers to it and ends up with way toe much stuff in the final dSYM. We can see the DW_TAG_imported_declaration below:

0x00000032:   DW_TAG_namespace
                DW_AT_name	("std")

0x00000037:     DW_TAG_namespace
                  DW_AT_name	("__1")
                  DW_AT_export_symbols	(true)

0x0000003c:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cstdio")
                    DW_AT_decl_line	(109)
                    DW_AT_import	(0x0000000000000195)
`

And later we see:

0x00000193:       NULL

0x00000194:     NULL

0x00000195:   DW_TAG_typedef
                DW_AT_type	(0x00000000000001a0 "__sFILE")
                DW_AT_name	("FILE")
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(157)

0x000001a0:   DW_TAG_structure_type
                DW_AT_calling_convention	(DW_CC_pass_by_value)
                DW_AT_name	("__sFILE")
                DW_AT_byte_size	(0x98)
                DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                DW_AT_decl_line	(126)

0x000001a9:     DW_TAG_member
                  DW_AT_name	("_p")
                  DW_AT_type	(0x000000000000029a "unsigned char*")
                  DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/_stdio.h")
                  DW_AT_decl_line	(127)
                  DW_AT_data_member_location	(0x00)
....

According to the problem of imported declarations - this patch does the same as the current dsymutil.
The solution for that problem would be nice to have. Though this is out of the scope of this patch.

c) .debug_frame. Already generated CIE records are not reused between object files

Can't we unique these when the .o file gets linked on the main thread?

We can, but we need to re-parse generated debug_frame then.
Probably, If it would not take a lot of time, we might to try.

e) .debug_names is not supported by this patch.

Any reason why this can't be done? I believe that dsymutil just generates the .debug_names (or Apple accelerator tables) after you have written all of the DWARF out by traversing it no? If this doesn't happen, it can be added with the --update option to dsymutil, but again, this takes time.

There is no fundamental reason. I think it might be supported.
I just need more time to check that code and understand how to implement it properly.
I would update this patch with implementation later.

avl added a comment.Feb 22 2021, 9:55 AM

@JDevlieghere @aprantl @echristo @dblaikie @clayborg

What do you think of this patch? Is it OK in general and Is it worth to divide it into smaller patches and integrate?

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

I believe lld's multithreading still produces deterministic output, for instance.

avl added a comment.Feb 22 2021, 11:12 AM

yep. this implementation is non-deterministic.
I tried to preserve the current ODR algorithm(which relies on concrete type sequences).

I thought about following improvement(though not for this patch),
which should be deterministic even in multi-thread mode:

Implement artificial type table with type merging:

Source DWARF:

0x0000000b: DW_TAG_compile_unit 
              DW_AT_name = "module1.cpp"
                            
0x0000002e: DW_TAG_structure_type 
              DW_AT_name = "S"

              NULL

0x0000008c:   DW_TAG_subprogram 

0x000000aa:     DW_TAG_variable  
                  DW_AT_name = "s"
                  DW_AT_type (0x0000002e "S")

            NULL

0x0000012c: DW_TAG_compile_unit 
              DW_AT_name "module2.cpp"

0x0000014f:   DW_TAG_structure_type 
                DW_AT_name = "S"

0x00000157:     DW_TAG_structure_type
                  DW_AT_name = "Nested"

0x00000160: DW_TAG_subprogram 

0x00000168:   DW_TAG_variable  
                   DW_AT_name = "n"
                   DW_AT_type (0x00000157 "Nested")

            NULL

New DWARF:

0x0000000b: DW_TAG_compile_unit
              DW_AT_name = "__type_table"

0x0000002e:   DW_TAG_structure_type 
                DW_AT_name = "S"

0x00000036:     DW_TAG_structure_type
                DW_AT_name = "Nested"

              NULL

0x0000012c: DW_TAG_compile_unit
              DW_AT_name = "module1.cpp"

0x0000014c:   DW_TAG_subprogram 

0x00000150:     DW_TAG_variable  
                  DW_AT_name = "s"
                  DW_AT_type (0x0000002e "S")

            NULL

0x0000018c: DW_TAG_compile_unit
              DW_AT_name "module2.cpp"

0x000001bc:   DW_TAG_subprogram 

0x000001c0:     DW_TAG_variable  
                  DW_AT_name = "n"
                  DW_AT_type (0x00000036 "Nested")

            NULL

In this example, "New DWARF is the result of ODR unique algorithm differrent from the current one.
The current ODR algorithm is - remember the first type occurrence and set all following references(to the same type) into that first met type.
The new ODR algorithm might be: move all types into separate artificial compile unit containing type table.
In this case type table might be always created in the same form despite the type processing order.

As additional bonus type merging might save noticeable amount of space.

clayborg added a comment.EditedFeb 22 2021, 11:50 AM
In D96035#2579263, @avl wrote:

@JDevlieghere @aprantl @echristo @dblaikie @clayborg

What do you think of this patch? Is it OK in general and Is it worth to divide it into smaller patches and integrate?

The original dsymutil (pre llvm version) that I wrote was deterministic, and so is the current one. That being said, I am not too worried about this because the quality of the resulting debug info is the main goal here.

My main concerns are:

  • The final .debug_str should only have one copy of a string. It should be possible to create individual string tables and merge them in the last loop you have that fixes up offsets right?
  • I am unclear on how ODR uniquing works here without cross talking between threads and slowing everything down. Since they currently create new DW_FORM_ref_addr that reference other already streamed out CUs, how does that work in this patch?
  • I am unclear on what state the inter CU references are in in this patch when they are in the same .o file from LTO. Do they work? Are you loading all CUs into memory? I would be ok with loading all CUs from the same file at one time.

I do your idea of creating a new compile unit for types in order to make ODR better as I have thought of this. A few extra ideas on this:

  • it might be best to just create type units on the fly instead of making fake compile units?
  • if we don't want to create/use type units then:
    • create a new DW_TAG_compile_unit whose source path is the DW_AT_decl_file for the type in question
    • Bonus points if we can add all types form the same DW_AT_decl_file to the same DW_TAG_compile_unit
    • Some type class declarations that use compiler default created constructors, copy constructor, assignment operators might omit these member functions in various different compile units if they were or were not used by that compile unit. If we can augment the uniqued type definition to include these, that would be a huge win. Right now when ODR finds type "class A" with no compiler created constructor, it will re-emit a full type definition because DWARF needs to be able to reference it _if_ a previous copy did not include the needed method

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

avl added a comment.Feb 23 2021, 11:53 AM

The final .debug_str should only have one copy of a string. It should be possible to create individual string tables and merge them in the last loop you have that fixes up offsets right?

Right. The final .debug_str has only one copy of a string.
The implementation is a bit different though:

While cloning compile unit, DWARFLinker remembers the pointer to the string and reference to the string attribute:

// create delayed task for setting proper offset value for the string
// attribute.
CU.noteOffsetData(It, *String);

After cloning is complete, DWARFLinker puts remembered strings into the common StringPool and attribute is updated:

void DWARFLinker::assignDelayedOffsets(CompileUnit &CU) {
  std::unique_lock<std::mutex> Guard(StringPoolMutex);

  for (const CompileUnit::OffsetsData &Data : CU.getStringsOffsetData())
    setStringsOffset(Data.AttrToPatch, Data.String);
}

This way common resource TheOffsetsStringPool is locked only once per compile unit.
After all compile units are processed the .debug_str is generated from the TheOffsetsStringPool
which has only one copy of each string.

I am unclear on how ODR uniquing works here without cross talking between threads and slowing everything down. Since they currently create new DW_FORM_ref_addr that reference other already streamed out CUs, how does that work in this patch?

ODR uniquing cross talks between threads. It is implemented this way:

During analyzing declaration context stage DWARFLinker remembers properties of each declaration context for compile unit:

// In multy-threaded case it is in-effective to lock&use
// common resource ODRContexts.getChildDeclContext() for
// every type declaration context. Instead, we remember
// DeclarationContextKey and create DeclarationContext and
// update Info.Ctxt field later. That way we lock
// ODRContexts.getChildDeclContext() only once for each compile
// unit.

CU.noteContextData(Current.ContextKey, &Info);

Later, after all types are analyzed, corresponding type declaration contexts are created in the common
ODRContexts pool.

void DWARFLinker::assignDelayedDeclContexts(CompileUnit &CU) {
  std::unique_lock<std::mutex> Guard(ContextsMutex);

  for (const CompileUnit::DeclContextKeyData &Data : CU.getDeclContextData())
    setDeclContext(Data.Key, Data.Info);
}

At that stage, Declaration contexts(which are already is in ODRContexts pool) might be reused.

I am unclear on what state the inter CU references are in in this patch when they are in the same .o file from LTO. Do they work? Are you loading all CUs into memory? I would be ok with loading all CUs from the same file at one time.

inter CU references are supported. This patch tries to load/unload not dependent compile units and load at one time only inter-connected compile units.
Let`s we have following dependencies:

CU1 references CU2, CU2 references CU4, CU4 references CU3, CU5(does not reference other CUs)

  1. first pass. CU1 loaded. It is seen that it needs CU2, so left CU1 in memory. CU2 loaded. It is seen that it needs CU4, so left CU2 in memory. CU3 loaded. processed. unloaded. CU4 loaded. It is seen that it needs CU3, so left CU4 in memory. CU5 loaded. processed. unloaded.
  2. second pass. CU1 processed. CU2 processed. CU3 loaded. processed. CU4 processed.

The reason why I try to avoid loading all CUs from the same file at one time, this utility - https://reviews.llvm.org/D86539.
Implementation which loads all CUs from the same file at one time would require a lot of memory if it optimize single binary(not a set of object files).

it might be best to just create type units on the fly instead of making fake compile units?

Type units might be created, but they require additional space. We might lose some size improvement.

create a new DW_TAG_compile_unit whose source path is the DW_AT_decl_file for the type in question

I thought about single artificial compile unit for all types, but yes we might implement it this way.

Some type class declarations that use compiler default created constructors, copy constructor, assignment operators might omit these member functions in various different compile units if they were or were not used by that compile unit. If we can augment the uniqued type definition to include these, that would be a huge win.

yes. That is the idea for type merging. it might save a lot of space.

avl added a comment.Feb 23 2021, 11:59 AM

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Might changing the ODR uniquing algorithm be a solution then? (That is what I described in some previous message - do not reuse existing type declarations, but generate an artificial compile unit containing the types)? Such artificial compile unit containing the types might be generated in the deterministic way.

In D96035#2582780, @avl wrote:

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Yes :)

avl added a comment.Feb 23 2021, 12:11 PM
In D96035#2582780, @avl wrote:

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Yes :)

I see. I tried to preserve existing idea of ODR uniquing in this patch. To make it in multi-thread way and to have a deterministic behavior - Could we consider to change current ODR uniquing algorithm?

In D96035#2582780, @avl wrote:

A non-deterministic behavior is probably not acceptable - it'd certainly be no good at Google if we ever wanted to use this for DWARF linking. Not sure how the Apple folks feel about it for classic dsymutil.

Yes, this would be a non-starter.

even if it IS deterministic in --num-threads=1, and is Not deterministic in multi-thread case?

Might changing the ODR uniquing algorithm be a solution then? (That is what I described in some previous message - do not reuse existing type declarations, but generate an artificial compile unit containing the types)? Such artificial compile unit containing the types might be generated in the deterministic way.

I'm not sure why that would necessarily be better/faster - it'd still require two passes, right? One to collect the types, then another to emit the unit with the types and the units referencing that?

If it requires two passes, what about one pass that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones? That could then potentially produce the same kind of (& possibly exactly the same) output as is the case today, rather than introducing a new CU?

avl added a comment.Feb 23 2021, 2:00 PM

I'm not sure why that would necessarily be better/faster - it'd still require two passes, right? One to collect the types, then another to emit the unit with the types and the units referencing that?

If it requires two passes, what about one pass that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones? That could then potentially produce the same kind of (& possibly exactly the same) output as is the case today, rather than introducing a new CU?

I think I did not understand the idea. Current processing looks similar to the above description.
For the single compile unit, we do the declaration context analysis step that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones. Later we emit the body of the compile unit based on the results of the declaration context analysis step.

When we decide which of the type definitions to keep in the unit we use ODR uniquing algorithm which sets the first met type definition as canonical and uses it later for type references in other compile units.

But we do not have a fixed order of compile units, they are processed in parallel. If both CU1 and CU2 have the same type definition, depending on the real order, canonical type definition might be set for CU1 or CU2. How could we avoid that non-determinism using additional pass?

Speaking of the solution with artificial CU it has several advantages before the current ODR algorithm:

  1. Since DWARFLinker generates artificial CU then it might generate it in a deterministic way. Despite the real order of processed types, we might always generate them in the same form(sorted by name or another).
  1. Current ODR algorithm makes CU processing be depending on common resource - ODR Declaration context. That slowdowns parallel execution(threads wait for each other). If we would generate separate artificial CU for types then processing of current CU might be done without waiting for ODR Declaration context. it might increase effective CPU utilization ratio.
  1. Since we generate types - it is possible to implement type merging, which might save a significant amount of space.
In D96035#2583116, @avl wrote:

I'm not sure why that would necessarily be better/faster - it'd still require two passes, right? One to collect the types, then another to emit the unit with the types and the units referencing that?

If it requires two passes, what about one pass that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones? That could then potentially produce the same kind of (& possibly exactly the same) output as is the case today, rather than introducing a new CU?

I think I did not understand the idea. Current processing looks similar to the above description.
For the single compile unit, we do the declaration context analysis step that decides which of the type definitions to keep in the unit that defined them, and which to remove/make a reference to the kept ones. Later we emit the body of the compile unit based on the results of the declaration context analysis step.

When we decide which of the type definitions to keep in the unit we use ODR uniquing algorithm which sets the first met type definition as canonical and uses it later for type references in other compile units.

But we do not have a fixed order of compile units, they are processed in parallel. If both CU1 and CU2 have the same type definition, depending on the real order, canonical type definition might be set for CU1 or CU2. How could we avoid that non-determinism using additional pass?

Comparing your proposal - avoiding nondeterminism by sorting the types by name in the new type-holding CU, we could do something similar, but instead sorting the CUs a type appears in to pick which of those existing locations to be the canonical home. (or, indeed, could sort by the original linear visitation order)

eg: multithreaded walk over each CU and find each type (this applies to both your proposal and what I'm suggesting here, I think) - then, rather than sorting by name and putting them in a new CU, sort by the "index" of the CU the type appeared in (where the index is the order the CUs would've been visited in in the current linear algorithm) then place/preserve the canonical type in the CU with the lowest index where the type appears?

Then the second pass goes and emits the CUs, consulting this list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere.

I think there might be merit to the approach you're suggesting too - but I'm less sure it's necessary to significantly alter the output scheme to achieve the benefits of parallelism.

(as for type merging - that might also be possible with the scheme I'm proposing - if we're rewriting DIEs anyway, seems plausible we could add new child DIEs to canonical type type DIEs)

One idea is for each type we want to unique, each CU as it is being parsed on its own thread creates a type map that maps "type compile unit path" (which is DW_AT_decl_file of that type) to a list of uniqued types with decl context info. This would allow all types with the same DW_AT_decl_file to be stored in the same type compile unit. As we emit the debug info for a normal compile unit, any type references are changed to be DW_AT_ref_addr references with values that will need to be fixed up. After all normal CUs have been emitted, combine all of the CU specific type maps together, and now emit the type compile units (carefully to include all needed children in each type to make the most complete version of the type possible and emit them) in a sorted order to allow determinism. Then we fix up all DW_AT_ref_addr references in the normal CUs to point to the one type definition from the type compile units.

Sorry if this is restating what has already been said/implied!

avl added a comment.Feb 24 2021, 5:04 AM

Comparing your proposal - avoiding nondeterminism by sorting the types by name in the new type-holding CU, we could do something similar, but instead sorting the CUs a type appears in to pick which of those existing locations to be the canonical home. (or, indeed, could sort by the original linear visitation order)

eg: multithreaded walk over each CU and find each type (this applies to both your proposal and what I'm suggesting here, I think) - then, rather than sorting by name and putting them in a new CU, sort by the "index" of the CU the type appeared in (where the index is the order the CUs would've been visited in in the current linear algorithm) then place/preserve the canonical type in the CU with the lowest index where the type appears?

Then the second pass goes and emits the CUs, consulting this list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere.

My understanding is that this way assumes all CUs should be loaded into the memory and it does extra pass. i.e.

  1. the first pass enumerates in a multithreaded manner all object files, all compile units, and creates an indexed map(the list of type homes). (In the result all CUs from all object files are loaded into the memory at the same time. The indexed map is also in the memory).
  1. the second pass enumerates in a multithreaded manner all object files, all compile units, and emits bodies(consulting the list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file.

The scheme implemented in this patch and which might be done with additional compile unit keeping types - visit CU only once and then it might be unloaded from the memory:

  1. the first pass enumerates in a multithreaded manner all object files, all compile units. Each CU is loaded, analyzed for types(there would be created a list of attribute referencing types), emitted. types moved into the artificial CU. unloaded.
  1. Emit artificial CU(After the first pass is finished all CU are unloaded from the memory, except the artificial one).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file. (At this stage type's references from already emitted CU bodies would be patched to proper offsets inside artificial CU).

this scheme does not need to have two passes and it does not need to load all CUs into the memory at the same time.

(as for type merging - that might also be possible with the scheme I'm proposing - if we're rewriting DIEs anyway, seems plausible we could add new child DIEs to canonical type type DIEs)

That would be one more reason to keep CUs in memory. When we rewrite a canonical DIE in some CU we would need to gather all children from all other CUs.

The artificial CU(from the scheme with separate CU keeping types) would have merged type so it does not require to keep source CU.

avl added a comment.Feb 24 2021, 5:37 AM

One idea is for each type we want to unique, each CU as it is being parsed on its own thread creates a type map that maps "type compile unit path" (which is DW_AT_decl_file of that type) to a list of uniqued types with decl context info. This would allow all types with the same DW_AT_decl_file to be stored in the same type compile unit. As we emit the debug info for a normal compile unit, any type references are changed to be DW_AT_ref_addr references with values that will need to be fixed up. After all normal CUs have been emitted, combine all of the CU specific type maps together, and now emit the type compile units (carefully to include all needed children in each type to make the most complete version of the type possible and emit them) in a sorted order to allow determinism. Then we fix up all DW_AT_ref_addr references in the normal CUs to point to the one type definition from the type compile units.

Sorry if this is restating what has already been said/implied!

yep. That is pretty match the same scenario which I talked about.

The difference is that you propose to create type units(DW_TAG_type_unit) for the types while my suggestion is to use artificial compile unit(that would save space needed by type units headers).

This would allow all types with the same DW_AT_decl_file to be stored in the same type compile unit.

Do you mean DW_TAG_type_unit or DW_TAG_compile_unit here?

It looks like we could not put several types into the same type unit:

"A type unit entry owns debugging information entries that
represent the definition of a single type, plus additional debugging information
entries that may be necessary to include as part of the definition of the type."

Why it would be good to group types on DW_AT_decl_file basis?

friss added a comment.Feb 24 2021, 8:21 AM
In D96035#2584686, @avl wrote:

The scheme implemented in this patch and which might be done with additional compile unit keeping types - visit CU only once and then it might be unloaded from the memory:

  1. the first pass enumerates in a multithreaded manner all object files, all compile units. Each CU is loaded, analyzed for types(there would be created a list of attribute referencing types), emitted. types moved into the artificial CU. unloaded.

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).
Conceptually, I think the idea of moving types to a "central" location is valid, it seems semantically equivalent to the kind of debug info we would get with -gmodules. Another kind of approach would be to gather enough information to create a dependency graph of CUs and then orchestrate parallel processing according to this. Even if this would limit the parallelism at the beginning of a link, I think it's very hard to intuitively figure out whether it would perform worse than the approach above.

avl added a comment.Feb 24 2021, 9:40 AM

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).

We might sort types on name basis inside that artificial CU. In this case it would always be emitted in the same order.

In D96035#2585324, @avl wrote:

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).

We might sort types on name basis inside that artificial CU. In this case it would always be emitted in the same order.

I don't think finding how to sort them is the complicated part. In my mind the issue is keeping them around until you can sort them as I'm not sure DIEs can exist not attached to a CU. Also, it's not just the one type, but the transitive closure of all its dependencies that you need to consider which makes things (much) more complicated. As I said before, I haven't worked on this in a long time, so feel free to shoot my concerns down!

In D96035#2585324, @avl wrote:

How would you guarantee that the artificial CU is always emitted in the same order? It seems like you need to stash the types somewhere, then create a deterministic ordering and then emit it, but I don't remember whether that's doable with LLVM's DIEs (It's been a while since I touched this code).

We might sort types on name basis inside that artificial CU. In this case it would always be emitted in the same order.

I don't think finding how to sort them is the complicated part. In my mind the issue is keeping them around until you can sort them as I'm not sure DIEs can exist not attached to a CU. Also, it's not just the one type, but the transitive closure of all its dependencies that you need to consider which makes things (much) more complicated. As I said before, I haven't worked on this in a long time, so feel free to shoot my concerns down!

I agree with Fred in that it might be too expensive to keep all of the CUs around so that we can later copy type data out of them at the end into uniqued compile units.

For testing DWARF, a while back I created a DWARF generator set of classes:

llvm/unittests/DebugInfo/DWARF/DwarfGenerator.h
llvm/unittests/DebugInfo/DWARF/DwarfGenerator.cpp

This allows the creation of stand alone DWARF using a new set of DWARF classes:

llvm::dwarfgen::DIE
llvm::dwarfgen::CompileUnit
llvm::dwarfgen::Generator

It might allow us to copy the types into a llvm::dwarfgen::Generator instance and might easy the ability to add and remove DIEs to parent DIEs, like adding new methods to classes and slowly adding new types a compile unit DIE. Check the test files in llvm/unittests/DebugInfo/DWARF that include the DwarfGenerator.h file to see examples of tests that create DWARF using this generator. The nice thing is all references to other DIEs are actual pointers to llvm::dwarfgen::DIE objects, so you can insert and remove DIEs as needed to anything within a llvm::dwarfgen::CompileUnit, even across llvm::dwarfgen::CompileUnit objects if needed, and when you generate the DWARF, it all gets finalized at that point. So maybe we can use the DWARF generator for all types. When each compile unit is done being parsed on its thread, it can enter a type copying phase where it makes sure all of the uniqued types it needs are copied into a single smart class that uses the DWARF generator. Then we can unload the compile units from the .o files right away when we are done.

In D96035#2584686, @avl wrote:

Comparing your proposal - avoiding nondeterminism by sorting the types by name in the new type-holding CU, we could do something similar, but instead sorting the CUs a type appears in to pick which of those existing locations to be the canonical home. (or, indeed, could sort by the original linear visitation order)
eg: multithreaded walk over each CU and find each type (this applies to both your proposal and what I'm suggesting here, I think) - then, rather than sorting by name and putting them in a new CU, sort by the "index" of the CU the type appeared in (where the index is the order the CUs would've been visited in in the current linear algorithm) then place/preserve the canonical type in the CU with the lowest index where the type appears?
Then the second pass goes and emits the CUs, consulting this list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere.

My understanding is that this way assumes all CUs should be loaded into the memory and it does extra pass. i.e.

Sounds like your proposal would require that too - or require reloading the CUs twice? I think either strategy (keeping them loaded, or reloading them) could be used for either of our proposed directions (creating a separate unit, or using the existing units)?

  1. the first pass enumerates in a multithreaded manner all object files, all compile units, and creates an indexed map(the list of type homes). (In the result all CUs from all object files are loaded into the memory at the same time. The indexed map is also in the memory).
  1. the second pass enumerates in a multithreaded manner all object files, all compile units, and emits bodies(consulting the list of type homes to determine whether this type should be emitted in the CU, or reference a copy emitted elsewhere).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file.

The scheme implemented in this patch and which might be done with additional compile unit keeping types - visit CU only once and then it might be unloaded from the memory:

  1. the first pass enumerates in a multithreaded manner all object files, all compile units. Each CU is loaded, analyzed for types(there would be created a list of attribute referencing types), emitted. types moved into the artificial CU. unloaded.

"types moved into the artificial CU" means what exactly? Copying them type DIE tree into some intermediate, in-memory representation?

Ah, yes, though keeping them in memory may be expensive - it might be cheaper to rewalk the units/know the offsets to the DIEs and reload them in some way.

  1. Emit artificial CU(After the first pass is finished all CU are unloaded from the memory, except the artificial one).
  1. Patch sizes/offsets/references after individual CU bodies are glued into the resulting file. (At this stage type's references from already emitted CU bodies would be patched to proper offsets inside artificial CU).

this scheme does not need to have two passes and it does not need to load all CUs into the memory at the same time.

(as for type merging - that might also be possible with the scheme I'm proposing - if we're rewriting DIEs anyway, seems plausible we could add new child DIEs to canonical type type DIEs)

That would be one more reason to keep CUs in memory. When we rewrite a canonical DIE in some CU we would need to gather all children from all other CUs.

The artificial CU(from the scheme with separate CU keeping types) would have merged type so it does not require to keep source CU.

Yes, if we're merging types it becomes somewhat more complicated - again, could do a two pass (without necessarily keeping all CUs in memory) - go quickly skim the CUs, for each type, record the type name and member list - then merge those lists, keeping the type record for the lowest indexed CU, and a merged member list that mentions which offset/CU each member comes from. This is essentially the same data you'd have to keep in memory (either fully in memory, or some kind of skeleton data that refers to the CUs/input file itself, so it can be reparsed as needed) for the "emit a separate unit full of types at the end" approach.

And in any case, you couldn't emit the units in parallel in the first pass, because you wouldn't know which offsets to write them to, right? (because the size of units will be changing during this process)

So I'm not sure where the parallelism comes into your scheme (& even in my scheme it'd be a bit non-trivial - I guess you'd have to record all the DIE references in each unit that might become cross-unit references (so you know you'll have to change their encoded size) and all the types (so you know whether that'll get bigger (if members are merged in) or smaller (if the type is removed in favor of referencing a type in another unit) - not sure there's a huge difference in performance/complexity between the two, perhaps.

avl added a comment.Feb 25 2021, 1:34 AM

I agree with Fred in that it might be too expensive to keep all of the CUs around so that we can later copy type data out of them at the end into uniqued compile units.

It might allow us to copy the types into a llvm::dwarfgen::Generator instance and might easy the ability to add and remove DIEs to parent DIEs, like adding new methods to classes and slowly adding new types a compile unit DIE.

When each compile unit is done being parsed on its thread, it can enter a type copying phase where it makes sure all of the uniqued types it needs are copied into a single smart class that uses the DWARF generator. Then we can unload the compile units from the .o files right away when we are done.

Agreed. That is exactly solution which I am talking about. I did not examined whether llvm::dwarfgen::Generator is suitable for that yet but I would be happy to use it if it is.

avl added a comment.Feb 25 2021, 2:27 AM

Sounds like your proposal would require that too - or require reloading the CUs twice? I think either strategy (keeping them loaded, or reloading them) could be used for either of our proposed directions (creating a separate unit, or using the existing units)?

No. My solution does not require to keep all CUs in memory or to reload them twice(for the purpose of ODR types deduplication).

It loads CU, analyzes types, creates list of type references, removes types. Emit DWARF for that CU. unload CU.

It does not need to load that CU again or keep input DWARF of that CU in the memory.
It only needs to fix-up remembered type references in generated DWARF after all CUs are processed and artificial CU is built.
It keeps all types in memory but that requires less space than all CUs.

"types moved into the artificial CU" means what exactly? Copying them type DIE tree into some intermediate, in-memory representation?

yes.

Ah, yes, though keeping them in memory may be expensive - it might be cheaper to rewalk the units/know the offsets to the DIEs and reload them in some way.

I think it would not be cheaper to load CU from the disk again. But we can do experiment and select the more effective solution. i.e. solution with artificial CU allows us to have a choice.

If we would implement solution which load all CUs in memory or reload them from disk - then we would not have a choice.

Yes, if we're merging types it becomes somewhat more complicated - again, could do a two pass (without necessarily keeping all CUs in memory) - go quickly skim the CUs, for each type, record the type name and member list - then merge those lists, keeping the type record for the lowest indexed CU, and a merged member list that mentions which offset/CU each member comes from. This is essentially the same data you'd have to keep in memory (either fully in memory, or some kind of skeleton data that refers to the CUs/input file itself, so it can be reparsed as needed) for the "emit a separate unit full of types at the end" approach.
And in any case, you couldn't emit the units in parallel in the first pass, because you wouldn't know which offsets to write them to, right? (because the size of units will be changing during this process)

Not exactly right. This prototype uses exactly the scheme with one pass(which loads/parse/handle/unload CU only once). It processes units in parallel, generates resulting DWARF and the list of references which should be patched. After all CUs are processed and final sizes are known. It has additional pass which writes correct offsets. That additional pass is _much_ cheaper than load/parse/handle/unload CUs again.

So I'm not sure where the parallelism comes into your scheme (& even in my scheme it'd be a bit non-trivial - I guess you'd have to record all the DIE references in each unit that might become cross-unit references (so you know you'll have to change their encoded size) and all the types (so you know whether that'll get bigger (if members are merged in) or smaller (if the type is removed in favor of referencing a type in another unit) - not sure there's a huge difference in performance/complexity between the two, perhaps.

right. That is the scheme(remember cross-unit references and patch them later) is used by this patch(except that it does not do types merging and does not have separate CU keeping types). This scheme allows to process CU separately, in parallel. It gives us more than 2x performance improvement. It also might be improved more(theoretically, If threads inter-dependency would be decreased).

In D96035#2587154, @avl wrote:

Sounds like your proposal would require that too - or require reloading the CUs twice? I think either strategy (keeping them loaded, or reloading them) could be used for either of our proposed directions (creating a separate unit, or using the existing units)?

No. My solution does not require to keep all CUs in memory or to reload them twice(for the purpose of ODR types deduplication).

It loads CU, analyzes types, creates list of type references, removes types. Emit DWARF for that CU. unload CU.

It does not need to load that CU again or keep input DWARF of that CU in the memory.
It only needs to fix-up remembered type references in generated DWARF after all CUs are processed and artificial CU is built.
It keeps all types in memory but that requires less space than all CUs.

Yeah, having to keep the types in memory is the bit I'm getting at - but yes, it's not all CUs. If they're being unloaded/reloaded though, it might be less clear whether it's so costly to preserve the existing output behavior. Though I don't personally have a problem with creating a synthetic/arbitrary CU to put all the types in either. But other folks (Apple folks with a vested interest in the dsymutil behavior) might disagree.

"types moved into the artificial CU" means what exactly? Copying them type DIE tree into some intermediate, in-memory representation?

yes.

Ah, yes, though keeping them in memory may be expensive - it might be cheaper to rewalk the units/know the offsets to the DIEs and reload them in some way.

I think it would not be cheaper to load CU from the disk again. But we can do experiment and select the more effective solution. i.e. solution with artificial CU allows us to have a choice.

If we would implement solution which load all CUs in memory or reload them from disk - then we would not have a choice.

Yes, if we're merging types it becomes somewhat more complicated - again, could do a two pass (without necessarily keeping all CUs in memory) - go quickly skim the CUs, for each type, record the type name and member list - then merge those lists, keeping the type record for the lowest indexed CU, and a merged member list that mentions which offset/CU each member comes from. This is essentially the same data you'd have to keep in memory (either fully in memory, or some kind of skeleton data that refers to the CUs/input file itself, so it can be reparsed as needed) for the "emit a separate unit full of types at the end" approach.
And in any case, you couldn't emit the units in parallel in the first pass, because you wouldn't know which offsets to write them to, right? (because the size of units will be changing during this process)

Not exactly right. This prototype uses exactly the scheme with one pass(which loads/parse/handle/unload CU only once). It processes units in parallel, generates resulting DWARF

By "generating resulting DWARF" I guess that has to be buffered in memory? (how could two CUs be written to the final output file at the same time? You wouldn't know where to write to because you wouldn't know how big the previous unit would be if you haven't finished processing it)

avl added a comment.Mar 5 2021, 1:40 PM

Yeah, having to keep the types in memory is the bit I'm getting at - but yes, it's not all CUs. If they're being unloaded/reloaded though, it might be less clear whether it's so costly to preserve the existing output behavior. Though I don't personally have a problem with creating a synthetic/arbitrary CU to put all the types in either. But other folks (Apple folks with a vested interest in the dsymutil behavior) might disagree.

I see, thanks. I tend to implement such a behavior (creating a synthetic/arbitrary CU to put all the types in either) since I believe that would be the most effective way.
(the practice might show if it is). It would be good to know whether that solution is acceptable/unacceptable before starting implementation. @JDevlieghere @aprantl @friss Would you find "implementation moving types into separate synthetic/arbitrary CU" useful and worth integrating into upstream dsymutil?(if it would show good performance/size results being implemented)

By "generating resulting DWARF" I guess that has to be buffered in memory? (how could two CUs be written to the final output file at the same time? You wouldn't know where to write to because you wouldn't know how big the previous unit would be if you haven't finished processing it)

Right. The DWARF for each CU is buffered in memory. And after all CUs are processed - They are glued together into the final full resulting DWARF file. That is exactly how this implementation works.

By "generating resulting DWARF" I guess that has to be buffered in memory? (how could two CUs be written to the final output file at the same time? You wouldn't know where to write to because you wouldn't know how big the previous unit would be if you haven't finished processing it)

Right. The DWARF for each CU is buffered in memory. And after all CUs are processed - They are glued together into the final full resulting DWARF file. That is exactly how this implementation works.

That seems quite expensive :s

"how this implementation works" - the patch in this review? The existing checked-in implementation? (apologies as I'm not especially familiar with the implementation of dsymutil as it is today)

avl added a comment.Mar 5 2021, 2:19 PM

That seems quite expensive :s

yep, it requires additional space. Though it allows to have benefits:

  • it allows separate parallel processing(this patch has more than 2x performance improvement).
  • plain DWARF requires less space than parsed/generated(DWARFDie/DIE).

"how this implementation works" - the patch in this review? The existing checked-in implementation? (apologies as I'm not especially familiar with the implementation of dsymutil as it is today)

yes, the patch in this review.

avl updated this revision to Diff 364773.Aug 6 2021, 6:22 AM

Changes from the previous version of the patch:

  1. It produces predictable results when working in multi-thread mode.
  2. It does types merging. Types merging allows reducing the size of .debug_info table by 40% less in size.
avl edited the summary of this revision. (Show Details)Aug 6 2021, 6:24 AM
avl added a comment.Aug 6 2021, 7:06 AM

@JDevlieghere @aprantl @dblaikie @echristo @clayborg @friss

Would you mind taking a look at this patch, please? Changes from the first version of the patch:

  1. It produces predictable results when working in multi-thread mode.
  2. It does types merging. Types merging allows reducing the size of .debug_info table by 40% less in size.
avl added a comment.Aug 6 2021, 7:17 AM

Example of dwarf output with type merging(created from odr-uniquing.cpp dsymutil lit test) : https://reviews.llvm.org/P8270

avl added a comment.Aug 6 2021, 3:11 PM

@dblaikie I copied your message from the another review.

(I'm not especially invested in DWARFLinker, since it's mostly for MachO dsymutil linking (though generalizing it to work for dwp could be interesting, might be an alternative solution (rather than going to more like gold-dwp) to addressing some overhead/scalability issues in llvm-dwp))

The type merging approach sounds OK to me, broadly speaking.
The overhead at low parallelism seems unfortunate - can the algorithm be made more amenable to single threaded performance as well? (is there some design tradeoff that could happen at low >parallelism that'd keep mostly the same approach? Or if we want low-parallelism performance are we going to end up maintaining the two different approaches entirely?)
What's the explanation for the difference/improvement in total output size with type uniquing enabled? I guess existing dsymutil is only structural, not odr-based uniquing or something?

avl added a comment.Aug 6 2021, 3:18 PM

(I'm not especially invested in DWARFLinker, since it's mostly for MachO dsymutil linking (though generalizing it to work for dwp could be interesting, might be an alternative solution (rather than going to more like gold-dwp) to addressing some overhead/scalability issues in llvm-dwp))

The type merging approach sounds OK to me, broadly speaking.
The overhead at low parallelism seems unfortunate - can the algorithm be made more amenable to single threaded performance as well?

yes it can. I think that performance might be improved for 1 thread case at least to 1.5X of upstream dsymutil. May be better.

(is there some design tradeoff that could happen at low >parallelism that'd keep mostly the same approach? Or if we want low-parallelism performance are we going to end up maintaining the two different approaches entirely?)

Types merging plus deterministic output require doing more work than current dsymutil solution. That is the reason why 1-2 threads version works slower. But there is a set of things which might improve performance.

If deterministic output is the requirement then we would not be able to support two different approaches since they would be binary incompatible.

What's the explanation for the difference/improvement in total output size with type uniquing enabled? I guess existing dsymutil is only structural, not odr-based uniquing or something?

Upstream dsymutil solution could not deduplicate partially defined types. Some type class declarations might omit member functions in various different compile units. Upstream dsymutil does not change them and leave inplace. Types merging solution is able to remove all such copies and to have only one class declaration containing all member functions. That allows to save space.

In D96035#2932140, @avl wrote:

(I'm not especially invested in DWARFLinker, since it's mostly for MachO dsymutil linking (though generalizing it to work for dwp could be interesting, might be an alternative solution (rather than going to more like gold-dwp) to addressing some overhead/scalability issues in llvm-dwp))

The type merging approach sounds OK to me, broadly speaking.
The overhead at low parallelism seems unfortunate - can the algorithm be made more amenable to single threaded performance as well?

yes it can. I think that performance might be improved for 1 thread case at least to 1.5X of upstream dsymutil. May be better.

(is there some design tradeoff that could happen at low >parallelism that'd keep mostly the same approach? Or if we want low-parallelism performance are we going to end up maintaining the two different approaches entirely?)

Types merging plus deterministic output require doing more work than current dsymutil solution. That is the reason why 1-2 threads version works slower. But there is a set of things which might improve performance.

Fair enough - maybe there's room for the overhead to be low enough that it's worth the extra benefits of smaller output, such that it's not worth keeping the old implementation (ie: its time/space tradeoff isn't so interesting to be worth keeping). That'll be up to the Apple folks, pretty much.

If deterministic output is the requirement then we would not be able to support two different approaches since they would be binary incompatible.

So long as the different behaviors aren't chosen organically (based on number of cores available, etc), but instead by a flag - that would be OK. The goal is to be deterministic based on inputs/command line - but different command lines can produce different behaviors.

What's the explanation for the difference/improvement in total output size with type uniquing enabled? I guess existing dsymutil is only structural, not odr-based uniquing or something?

Upstream dsymutil solution could not deduplicate partially defined types. Some type class declarations might omit member functions in various different compile units. Upstream dsymutil does not change them and leave inplace. Types merging solution is able to remove all such copies and to have only one class declaration containing all member functions. That allows to save space.

Yeah, figured something like that - thanks for explaining!

avl added a comment.Sep 6 2021, 6:52 AM

@JDevlieghere @aprantl @clayborg @friss @echristo

Folks, What is your opinion on this patch? Would it be useful to integrate?

The patch implements two features:

  1. Types deduplication by merging. It merges all partial type definitions/declarations into a single description to avoid types duplications. This approach allows the generation .debug_info table of 50% less in size.
  1. Parallel execution. It allows speed-up execution if the hardware has available threads.

Current size/performance results are(compared with llvm upstream dsymutil):

  • .debug_info table of 40% less in size.
  • single-threaded mode works 1.7x slower.
  • multy-thread mode works up to 2x faster.

There is a number of things which might improve the result:

  1. Deduplicate types defined in DW_TAG_subprograms with DW_AT_specification attribute.
  1. (probably)Deduplicate abstract inline instances.
  1. Use another memory pool for dies. Instead of BumpPtrAllocator use a pre-allocated pool. It would reduce data fragmentation.
  1. Current version of the patch uses two different memory pools(because of interfaces dependencies) with duplicated data: MTSafeStringTable and NonRelocatableStringpool. Using only one kind of pool would help to reduce memory usage and to improve performance numbers.
  1. Improve Dies navigation - https://reviews.llvm.org/D102634#inline-971772.
  1. Generate index tables in parallel.
  1. Avoid gluing generated tables into a single file. Use tables, generated for compilation units, as the output. It would allow avoiding data copying. Consumers might write compilation units tables directly so we do not need to create an intermediate output file.

So, I think we might have(after above is done) a single-threaded performance to be 1.3-1.5 slower than current dsymutil, multi-threaded performance up to 3x faster.

In D96035#2985299, @avl wrote:

Yeah, especially @aprantl & @JDevlieghere

Folks, What is your opinion on this patch? Would it be useful to integrate?

The patch implements two features:

  1. Types deduplication by merging. It merges all partial type definitions/declarations into a single description to avoid types duplications. This approach allows the generation .debug_info table of 50% less in size.
  1. Parallel execution. It allows speed-up execution if the hardware has available threads.

Current size/performance results are(compared with llvm upstream dsymutil):

  • .debug_info table of 40% less in size.

Out of curiosity - what's the total size of the clang dsym file with/without this patch/feature?

  • single-threaded mode works 1.7x slower.
  • multy-thread mode works up to 2x faster.

With how many threads? (& what sort of setup are the speed measurements done with - writing to (& maybe reading from) a fast SSD or temporary/RAM filesystem?)

There is a number of things which might improve the result:

  1. Deduplicate types defined in DW_TAG_subprograms with DW_AT_specification attribute.
  1. (probably)Deduplicate abstract inline instances.
  1. Use another memory pool for dies. Instead of BumpPtrAllocator use a pre-allocated pool. It would reduce data fragmentation.
  1. Current version of the patch uses two different memory pools(because of interfaces dependencies) with duplicated data: MTSafeStringTable and NonRelocatableStringpool. Using only one kind of pool would help to reduce memory usage and to improve performance numbers.
  1. Improve Dies navigation - https://reviews.llvm.org/D102634#inline-971772.
  1. Generate index tables in parallel.
  1. Avoid gluing generated tables into a single file. Use tables, generated for compilation units, as the output. It would allow avoiding data copying. Consumers might write compilation units tables directly so we do not need to create an intermediate output file.

So, I think we might have(after above is done) a single-threaded performance to be 1.3-1.5 slower than current dsymutil, multi-threaded performance up to 3x faster.

It may be necessary to lower the overhead further, though I'm not sure.

Alternatively: perhaps the new architecture could present a different tradeoff: what's the performance of parallelism (on a fast storage) without any deduplication? Can it run faster but produce a larger output (or is the output such a bottleneck that that isn't the case)?

avl added a comment.Sep 7 2021, 3:16 AM
Current size/performance results are(compared with llvm upstream dsymutil):

    .debug_info table of 40% less in size.

Out of curiosity - what's the total size of the clang dsym file with/without this patch/feature?

dsymutil --use-dlnext clang
ls -lh clang.dSYM/Contents/Resources/DWARF/clang
758M

dsymutil clang
ls -lh clang.dSYM/Contents/Resources/DWARF/clang
955M

i.e. total size of the clang dsym file become 20% smaller with --use-dlnext option.

single-threaded mode works 1.7x slower.
multy-thread mode works up to 2x faster.

With how many threads? (& what sort of setup are the speed measurements done with - writing to (& maybe reading from) a fast SSD or temporary/RAM filesystem?)

Measurements were done on Darwin 24-core 64G system using regular disk(not SSD/ not temporary/RAM filesystem). ~2x improvement is for 16 cores.

|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  | DWARF(*)|exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   155   |   15.8   |   465   |   269   |   16.1   |   273   |
|-------|------------------------------|------------------------------|
|   2   |    99   |   17.5   |   465   |   154   |   16.1   |   273   |
|-------|------------------------------|------------------------------|
|   4   |    99   |   17.5   |   465   |    96   |   16.5   |   273   |
|-------|------------------------------|------------------------------|
|   8   |    99   |   17.5   |   465   |    65   |   16.5   |   273   |
|-------|------------------------------|------------------------------|
|  16   |    99   |   17.5   |   465   |    52   |   16.5   |   273   |
|---------------------------------------------------------------------|

Alternatively: perhaps the new architecture could present a different tradeoff: what's the performance of parallelism (on a fast storage) without any deduplication? Can it run faster but produce a >larger output (or is the output such a bottleneck that that isn't the case)?

For the case when type deduplication is not done the performance numbers(for regular storage) look better(if compared with no ODR case, the performance improvement seen starting from 2 cores):

|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  |  DWARF  |exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   224   |   15.9   |   1400  |   250   |   19.5   |   1400  |
|-------|------------------------------|------------------------------|
|   2   |   214   |   17.7   |   1400  |   144   |   19.5   |   1400  |
|-------|------------------------------|------------------------------|
|   4   |   214   |   17.7   |   1400  |    90   |   19.5   |   1400  |
|-------|------------------------------|------------------------------|
|   8   |   214   |   17.7   |   1400  |    62   |    20    |   1400  |
|-------|------------------------------|------------------------------|
|  16   |   214   |   17.7   |   1400  |    51   |    20    |   1400  |
|---------------------------------------------------------------------|

output is not a bottleneck here. The execution time directly depends on the amount of source DWARF which should be analyzed/cloned(The upstream dsymutil in ODR deduplication mode skips analyzing/clonning for some dies thus it works faster. When it does not skip types, like in no ODR case, it has similar single thread performance).

Anyway, there is additional set of things which might improve performance/memory requirements/output size for all modes(single-thread/multi-thread/ODR/no-ODR).

In D96035#2985299, @avl wrote:

@JDevlieghere @aprantl @clayborg @friss @echristo

Folks, What is your opinion on this patch? Would it be useful to integrate?

It is very compelling! I do think it would be useful to integrate if there are perf and size wins as you are finding.

The patch implements two features:

  1. Types deduplication by merging. It merges all partial type definitions/declarations into a single description to avoid types duplications. This approach allows the generation .debug_info table of 50% less in size.

Very nice! I will apply this patch and do some testing. This can help eliminate issues we have in LLDB when you have many copies of a type, and any compiler generated methods (constructors, copy constructors, destructors, etc) are only defined in some of the copies of the type. Were you able to find all such methods and merge them in? I am guessing you didn't end up using the DWARFGenerator for this?

  1. Parallel execution. It allows speed-up execution if the hardware has available threads.

We have this in normal dsymutil right now right? You have just made it faster? Or made it work with the new approach you have in this patch?

Current size/performance results are(compared with llvm upstream dsymutil):

  • .debug_info table of 40% less in size.

That is great! Anything we can do to reduce the dups in DWARF even more than what we already had in dsymutil is great. I know there were a bunch of things that could cause types to be emitted multiple times in the current dsymutil that were not allowing us to reduce the DWARF size as much as we would have liked to.

  • single-threaded mode works 1.7x slower.
  • multy-thread mode works up to 2x faster.

Nice.

There is a number of things which might improve the result:

  1. Deduplicate types defined in DW_TAG_subprograms with DW_AT_specification attribute.
  1. (probably)Deduplicate abstract inline instances.
  1. Use another memory pool for dies. Instead of BumpPtrAllocator use a pre-allocated pool. It would reduce data fragmentation.
  1. Current version of the patch uses two different memory pools(because of interfaces dependencies) with duplicated data: MTSafeStringTable and NonRelocatableStringpool. Using only one kind of pool would help to reduce memory usage and to improve performance numbers.
  1. Improve Dies navigation - https://reviews.llvm.org/D102634#inline-971772.
  1. Generate index tables in parallel.
  1. Avoid gluing generated tables into a single file. Use tables, generated for compilation units, as the output. It would allow avoiding data copying. Consumers might write compilation units tables directly so we do not need to create an intermediate output file.

So, I think we might have(after above is done) a single-threaded performance to be 1.3-1.5 slower than current dsymutil, multi-threaded performance up to 3x faster.

You say 3x here and 2x above. If it is faster, then all is good. By default dsymutil runs with multiple threads, so the single threaded performance doesn't matter to me.

avl added a comment.Sep 8 2021, 12:01 AM
The patch implements two features:

    Types deduplication by merging. It merges all partial type definitions/declarations into a single description to avoid types duplications. This approach allows the generation .debug_info table of 50% less in size.

Very nice! I will apply this patch and do some testing. This can help eliminate issues we have in LLDB when you have many copies of a type, and any compiler generated methods (constructors, copy >constructors, destructors, etc) are only defined in some of the copies of the type. Were you able to find all such methods and merge them in?

yep, all such methods should be merged into the single type description(as soon as method name was properly identified - it either should have explicit name(DW_AT_name, DW_AT_linkage_name), either it should have implicit unique identifiers(DW_AT_decl_file, DW_AT_decl_line)).

I am guessing you didn't end up using the DWARFGenerator for this?

I did not use DWARFGenerator because it heavily depends on AsmPrinter, MC/* classes. It might be reused though, If "DIE tree building" part would be separated from "writing output" part.
Implementation of "DIE tree building" part in this patch uses the same idea as DWARFGenerator.

Another thing is that DWARFGenerator is not ready to work in multi-thread mode(i.e. to build dies tree in parallel).

DWARFGenerator might be refactored so that above problems would be resolved. In that case this patch would be able to use DWARFGenerator.

Parallel execution. It allows speed-up execution if the hardware has available threads.

We have this in normal dsymutil right now right? You have just made it faster? Or made it work with the new approach you have in this patch?

This patch uses the new approach. Current upstream dsymutil is able to utilize only two threads. It processes compile units sequentially, but is able to analyze compile unit and clone previous compile unit in parallel.

This patch is able to utilize all available threads. It is able to process each compile unit separately and in parallel.
Thus, it might perform better when available threads exist because more threads are used.

You say 3x here and 2x above. If it is faster, then all is good. By default dsymutil runs with multiple threads, so the single threaded performance doesn't matter to me.

2x - is the performance result for this patch for 16 threads.
3x - is my estimation for this patch plus set of improvements(mentioned above) for 16 threads.

avl added a comment.Sep 8 2021, 12:19 AM

yep, all such methods should be merged into the single type description(as soon as method name was properly identified - it either should have explicit name(DW_AT_name, DW_AT_linkage_name), >either it should have implicit unique identifiers(DW_AT_decl_file, DW_AT_decl_line)).

There exist restriction currently - This patch does not de-duplicate types from DW_TAG_subprograms with DW_AT_specification attribute. It might be done and will improve size/performance numbers.

In D96035#2931137, @avl wrote:

@JDevlieghere @aprantl @dblaikie @echristo @clayborg @friss

Would you mind taking a look at this patch, please? Changes from the first version of the patch:

Disclaimer: I haven't had a chance to look at the code itself yet, but I have a few high level questions.

  1. It produces predictable results when working in multi-thread mode.

This means that the generated DWARF is identical regardless of the number of threads, right?

  1. It does types merging. Types merging allows reducing the size of .debug_info table by 40% less in size.

Very exciting!

My main concern at this point is how you are planning to qualify this change. I see you copied the existing tests (which I assume was necessary because you had to modify them to match the newly generated output) which is a great start. The LLDB test suite runs most tests with a dSYM and another good test would be to run it with the --use-dlnext flag. However, both of these are still pretty artificial test cases. Anecdotally, we often find bugs in large projects that don't show up in smaller ones, and semantically comparing the DWARF by hand is basically impossible. When we re-implemented dsymutil on top of llvm, we compared the hash of the generated DWARF to make sure it was identical (bugs in the original implementation included). I'm curious what your thoughts are on this.

avl added a comment.Sep 9 2021, 3:14 AM
It produces predictable results when working in multi-thread mode.

This means that the generated DWARF is identical regardless of the number of threads, right?

Right. Generated DWARF is identical regardless of the number of threads.

My main concern at this point is how you are planning to qualify this change. I see you copied the existing tests (which I assume was necessary because you had to modify them to match the newly generated output) which is a great start. The LLDB test suite runs most tests with a dSYM and another good test would be to run it with the --use-dlnext flag. However, both of these are still pretty artificial test cases. Anecdotally, we often find bugs in large projects that don't show up in smaller ones, and semantically comparing the DWARF by hand is basically impossible. When we re-implemented dsymutil on top of llvm, we compared the hash of the generated DWARF to make sure it was identical (bugs in the original implementation included). I'm curious what your thoughts are on this.

yeah, approach of comparing output with original implementation is impossible in this case since output has been changed radically.

I have a thought of implementing a set of tests using debugger scripts. Something similar to this - https://reviews.llvm.org/D74169#change-ZwZOgdhWYOcz . So that such tests cover all/most aspects of DWARF specification. These tests would not depend on the concrete dsymutil binary format but will show whether debug information is correctly seen by debug info consumer(lldb).

Speaking of bugs found in large projects that don't show up in smaller ones: I think is the only way to create wider testing suite. i.e. if current test suite misses some cases then we need to add them(either upfront, either after the fact). Another possibility is to make llvm-dwarfdump --verify to be able to catch more bugs.

The LLDB test suite runs most tests with a dSYM and another good test would be to run it with the --use-dlnext flag.

Did not do this yet, but will check.

avl updated this revision to Diff 372071.Sep 11 2021, 4:02 AM

rebased.

avl updated this revision to Diff 373325.Sep 17 2021, 1:39 PM

rebased. fixed a couple of minor issues.

avl updated this revision to Diff 395922.Dec 22 2021, 12:59 PM
  1. added deduplication of types defined inside subprograms.
  2. implemented several performance optimizations.
  3. supported "-Xclang -gsimple-template-names=simple".
  4. The debug types dump was implemented for llvm-dwarfdump.
avl retitled this revision from [WIP][dsymutil][DWARFlinker] implement separate multi-thread processing for compile units. to [dsymutil][DWARFlinker] implement separate multi-thread processing for compile units..Dec 22 2021, 1:09 PM
avl edited the summary of this revision. (Show Details)
avl added a comment.Dec 22 2021, 1:28 PM

@JDevlieghere @aprantl @dblaikie @clayborg @friss @echristo

Please consider this new version of the patch. It has several improvements and I think it is in a ready state now.

Changes from the previous version:

  1. It deduplicates types defined inside subprograms. This change allows reducing the size of the resulting dwarf even more: .debug_info table is 67% less in size now.
  1. Several performance optimizations were implemented. It is 2.1x faster than upstream dsymutil in multi-thread ODR mode(16 threads). It is 1.4x slower in single-thread ODR mode. For non-ODR mode, it is 4.5x faster in multi-thread mode(16 threads) and has the same performance in single-thread mode. See comparison tables 1 and 2.
  1. It is compatible with "-Xclang -gsimple-template-names=simple". The previous version recognizes types by their names(utilizing the fact that names contain template parameters). This version takes into account DW_TAG_template_type_parameter and DW_TAG_template_value_parameter. It allows taking advantage created by "-Xclang -gsimple-template-names=simple": smaller size of debug_str section and performance improvement(due to smaller strings). Table 3 compares output results for debug info built with "-Xclang -debug-forward-template-params", processed by current upstream dsymutil and debug info built with "-Xclang -debug-forward-template-params -Xclang -gsimple-template-names=simple", processed by new version from this patch. The overall binary is 37% less in size.
  1. Because of point 3, It is now a requirement to build debug info with "-Xclang -debug-forward-template-params". Without specifying "-Xclang -debug-forward-template-params" the output would be correct but it would be non-deterministic.
  1. The debug types dump was implemented for llvm-dwarfdump(--odr-compat-dump), which allows comparing results generated by the current upstream dsymutil and the new version from this patch. Dumps, done for debug-info created by dsymutil with/without "--use-dlnext" options, might be compared and should match.

Testing:

  1. It passed llvm/test/tools/dsymutil lit tests.
  2. It passed check-llvm, check-clang, check-lldb.
  3. It passed llvm-dwarfdump --verify for all llvm&clang binaries.
  4. It passed llvm-dwarfdump --odr-compat-dump for all llvm&clang binaries:
dsymutil clang 
llvm-dwarfdump --odr-compat-dump clang | md5
27e2af4662bf44adeccf158072e6ebd7
dsymutil --use-dlnext clang
llvm-dwarfdump --odr-compat-dump clang | md5
27e2af4662bf44adeccf158072e6ebd7
  1. It produces deterministic output:
dsymutil --use-dlnext --num-threads 1 bugpoint
md5 bugpoint.dSYM/Contents/Resources/DWARF/bugpoint
a1b113c6998a6ca23084e7a9bc929184
dsymutil --use-dlnext --num-threads 2 bugpoint
md5 bugpoint.dSYM/Contents/Resources/DWARF/bugpoint
a1b113c6998a6ca23084e7a9bc929184
dsymutil --use-dlnext --num-threads 3 bugpoint
md5 bugpoint.dSYM/Contents/Resources/DWARF/bugpoint
a1b113c6998a6ca23084e7a9bc929184
dsymutil --use-dlnext --num-threads 7 bugpoint
md5 bugpoint.dSYM/Contents/Resources/DWARF/bugpoint
a1b113c6998a6ca23084e7a9bc929184

There still exists room for improvements. The single-thread/multi-thread performance might be improved, run-time memory requirements might be decreased. I think it would be better to do these additional improvements after main part is integrated.

Performance results for this patch for the clang binary(Darwin 24-core 64G):

Table 1. clang binary, ODR deduplication is ON:

|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  | DWARF(*)|exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   159   |   16.5   |   485   |   220   |   12.6   |   157   |
|-------|------------------------------|------------------------------|
|   2   |    99   |   17.8   |   485   |   129   |   12.6   |   157   |
|-------|------------------------------|------------------------------|
|   4   |    99   |   17.8   |   485   |    83   |   12.6   |   157   |
|-------|------------------------------|------------------------------|
|   8   |    99   |   17.8   |   485   |    57   |   12.9   |   157   |
|-------|------------------------------|------------------------------|
|  16   |    99   |   17.8   |   485   |    47   |   13.1   |   157   |
|---------------------------------------------------------------------|

Table 2. clang binary, ODR deduplication is OFF:

|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  |  DWARF  |exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   224   |   16.2   |   1450  |   224   |   15.0   |   1460  |
|-------|------------------------------|------------------------------|
|   2   |   218   |    18    |   1450  |   131   |   15.7   |   1460  |
|-------|------------------------------|------------------------------|
|   4   |   218   |    18    |   1450  |    83   |   15.9   |   1460  |
|-------|------------------------------|------------------------------|
|   8   |   218   |    18    |   1450  |    57   |   16.3   |   1460  |
|-------|------------------------------|------------------------------|
|  16   |   218   |    18    |   1450  |    47   |   16.6   |   1460  |
|---------------------------------------------------------------------|

Table 3. clang binary, ODR deduplication is ON
(dsymutil+"-Xclang -gsimple-template-names=simple"
ws dsymutil+"-Xclang -debug-forward-template-params -Xclang -gsimple-template-names=simple"):

|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  |DWARF(**)|exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   159   |   16.5   |   485   |   216   |   12.0   |   157   |
|       |         |          |   258   |         |          |   210   |
|-------|------------------------------|------------------------------|
|   2   |    99   |   17.8   |   485   |   126   |   12.1   |   157   |
|       |         |          |   258   |         |          |   210   |
|-------|------------------------------|------------------------------|
|   4   |    99   |   17.8   |   485   |    80   |   12.1   |   157   |
|       |         |          |   258   |         |          |   210   |
|-------|------------------------------|------------------------------|
|   8   |    99   |   17.8   |   485   |    55   |   12.3   |   157   |
|       |         |          |   258   |         |          |   210   |
|-------|------------------------------|------------------------------|
|  16   |    99   |   17.8   |   485   |    44   |   12.4   |   157   |
|       |         |          |   258   |         |          |   210   |
|---------------------------------------------------------------------|

(**) DWARF is the size of .debug_info(first) and debug_str(second) section.

avl updated this revision to Diff 396957.Jan 2 2022, 2:56 PM
avl edited the summary of this revision. (Show Details)

rebased.

I am taking at look at the current state of the patch and results this week. I am very interested in this. If anyone has any ideas on validation, please share.

So I am currently just looking at the content of the DWARF that is produced when using "--use-dlnext". Right now the main issue is it seems that types declared in an anonymous namespace in implementation files are all being moved to the "__type_table" die:

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer	("DWARFLinker lib")
              DW_AT_language	(DW_LANG_C_plus_plus_14)
              DW_AT_name	("__type_table")
              DW_AT_stmt_list	(0x00000000)
              DW_AT_comp_dir	("")

One quick example is LVBase from APValue.cpp line 165:

namespace {
  struct LVBase {
    APValue::LValueBase Base;
    CharUnits Offset;
    unsigned PathLength;
    bool IsNullPtr : 1;
    bool IsOnePastTheEnd : 1;
  };
}

Which produces a type in the "__type_table" compile unit:

0x000000a6:   DW_TAG_namespace
                DW_AT_decl_file	("/Users/gclayton/Documents/src/lldb/clean/Debug/lib/libclangAST.a(APValue.cpp.o)")

0x000000a9:     DW_TAG_structure_type
                  DW_AT_calling_convention	(DW_CC_pass_by_value)
                  DW_AT_name	("LVBase")
                  DW_AT_byte_size	(0x20)
                  DW_AT_decl_file	("/Users/gclayton/Documents/src/lldb/clean/llvm-project/clang/lib/AST/APValue.cpp")
                  DW_AT_decl_line	(166)

0x000000b3:       DW_TAG_subprogram
                    DW_AT_name	("LVBase")
                    DW_AT_declaration	(true)
                    DW_AT_artificial	(true)
                    DW_AT_external	(true)

I would think we should leave these types in the original compile unit as that will be the only compile unit that will use these types. Right now you are moving them to the "__type_table" compile unit so they can be uniqued, but they don't need to be uniqued. This is pretty C++ specific though.

The namespace also has the object file path as the decl file. If we are going to move the type, wouldn't it be better to use the compile unit's path like:

0x000000a6:   DW_TAG_namespace
                DW_AT_decl_file	("/Users/gclayton/Documents/src/lldb/clean/llvm-project/clang/lib/AST/APValue.cpp")

So an easy detection method to not move types that are defined in an implementation file would be if a type is DW_AT_decl_file matches the DW_TAG_compile_unit source file, then don't move it. If the type's DW_AT_decl_file doesn't match, that put that type into the "__type_table" DW_TAG_compile_unit.

Another way to know if a type should be moved to the "__type_table" would be to detect if the type is defined multiple times, and if it isn't don't move it. If it is, then move it. Not sure how easy that would be to detect with your current codebase though.

Looking at a few normal CUs after this I see that there are a lot of imported declarations we can probably get rid of if no one references them:

0x04517e06:   DW_TAG_namespace
                DW_AT_name	("std")

0x04517e0b:     DW_TAG_namespace
                  DW_AT_name	("__1")
                  DW_AT_export_symbols	(true)

0x04517e10:       DW_TAG_namespace
                    DW_AT_name	("chrono")

0x04517e15:         DW_TAG_imported_module
                      DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.0.sdk/usr/include/c++/v1/chrono")
                      DW_AT_decl_line	(2923)
                      DW_AT_import	(0x0000000001b05f36)

0x04517e1d:         NULL

0x04517e1e:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.0.sdk/usr/include/c++/v1/cstddef")
                    DW_AT_decl_line	(49)
                    DW_AT_import	(0x0000000003fe0299)

0x04517e25:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.0.sdk/usr/include/c++/v1/cstddef")
                    DW_AT_decl_line	(50)
                    DW_AT_import	(0x0000000003fe0375)

0x04517e2c:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.0.sdk/usr/include/c++/v1/cstddef")
                    DW_AT_decl_line	(53)
                    DW_AT_import	(0x0000000003fe019b)

0x04517e33:       DW_TAG_imported_declaration
                    DW_AT_decl_file	("/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.0.sdk/usr/include/c++/v1/cstdlib")
                    DW_AT_decl_line	(99)
                    DW_AT_import	(0x0000000003fe0375)

... <many many more>

These imported declarations amounted to 3% of overall CU size for a single compile unit that I tested. I know that the stock dsymutil whitelists all of these which causes these and all the types the reference to be kept, so if we can get rid of any DW_TAG_imported_declaration DIEs that are not referenced, we might also be able to get rid of the type that they reference as well as all types from a C++ header file that have corresponding C header files like "cstdlib" -> "stdlib.h", they will have this issue and waste space.

One issue I do see with this approach as well is the first "type_table" is about 69MB for a dSYM file for LLDB.framework in LLDB. Any types that are accessed from debug info in LLDB will end up causing this entire DW_TAG_compile_unit to be parsed which can be expensive as LLDB parses DWARF partially and only parses what it needs to when loading DWARF. Is there any way to split up the "type_table" into multiple DW_TAG_compile_unit entries whose DW_AT_name matches the DW_AT_decl_file of the root type that is being uniqued? That would allow smaller compile units to be parsed and keep memory footprint down in debuggers.

These imported declarations amounted to 3% of overall CU size for a single compile unit that I tested. I know that the stock dsymutil whitelists all of these which causes these and all the types the reference to be kept, so if we can get rid of any DW_TAG_imported_declaration DIEs that are not referenced, we might also be able to get rid of the type that they reference as well as all types from a C++ header file that have corresponding C header files like "cstdlib" -> "stdlib.h", they will have this issue and waste space.

This is a more fundamental issue with Clang's debug info (I say this as the person who implemented the support for imported declarations in Clang) - Clang doesn't track the use of a using decl (& for a using directive, that brings in a whole namespace, there's no DWARF that would reference that anyway (eg: a variable is of some type - so you have to refer to the type DIE, there's nowhere to mention/reference an imported namespace from that association)) so Clang's debug info describes the type referenced via a using decl in the source without using the using decl in the DWARF:

$ cat decl.cpp
namespace ns {
struct t1 { };
}
using ns::t1;
t1 v1;
$ clang++-tot decl.cpp -g -c && llvm-dwarfdump-tot decl.o | grep "DW_AT_type\|DW_TAG"
0x0000000b: DW_TAG_compile_unit
0x0000001e:   DW_TAG_variable
                DW_AT_type      (0x00000038 "ns::t1")
0x00000033:   DW_TAG_namespace
0x00000038:     DW_TAG_structure_type
0x00000042:   DW_TAG_imported_declaration

So imported modules are never referenced from other DWARF - if you want any of them to persist you have to treat them all as roots, you can't GC them.
And for imported declarations, at least those generated from Clang you have to do the same until that bug is fixed in Clang to actually reference the using declarations.

In any case I assume this issue is mostly orthogonal to this proposal (ie: I expect that's an existing issue in dsymutil, not a new regression introduced by this patch), yeah?

One issue I do see with this approach as well is the first "type_table" is about 69MB for a dSYM file for LLDB.framework in LLDB. Any types that are accessed from debug info in LLDB will end up causing this entire DW_TAG_compile_unit to be parsed which can be expensive as LLDB parses DWARF partially and only parses what it needs to when loading DWARF. Is there any way to split up the "type_table" into multiple DW_TAG_compile_unit entries whose DW_AT_name matches the DW_AT_decl_file of the root type that is being uniqued? That would allow smaller compile units to be parsed and keep memory footprint down in debuggers.

Splitting them up seems OK to me. Though there's some tradeoff - encodings can be more dense if clusters of types are in the same unit (since they can use shorter unit-local references, rather than full sec_offset references) and th extra unit headers. Probably groups of types in units would be good - a basic heuristic would be X types per unit, but a more advanced one would depend on the size of the types (& even more advanced would group related types together to benefit from more local references - but I doubt that's worth it).

avl added a comment.EditedJan 5 2022, 5:24 AM

So I am currently just looking at the content of the DWARF that is produced when using "--use-dlnext". Right now the main issue is it seems that types declared in an anonymous namespace in implementation files are all being moved to the "__type_table" die:

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer	("DWARFLinker lib")
              DW_AT_language	(DW_LANG_C_plus_plus_14)
              DW_AT_name	("__type_table")
              DW_AT_stmt_list	(0x00000000)
              DW_AT_comp_dir	("")

One quick example is LVBase from APValue.cpp line 165:

namespace {
  struct LVBase {
    APValue::LValueBase Base;
    CharUnits Offset;
    unsigned PathLength;
    bool IsNullPtr : 1;
    bool IsOnePastTheEnd : 1;
  };
}

Which produces a type in the "__type_table" compile unit:

0x000000a6:   DW_TAG_namespace
                DW_AT_decl_file	("/Users/gclayton/Documents/src/lldb/clean/Debug/lib/libclangAST.a(APValue.cpp.o)")

0x000000a9:     DW_TAG_structure_type
                  DW_AT_calling_convention	(DW_CC_pass_by_value)
                  DW_AT_name	("LVBase")
                  DW_AT_byte_size	(0x20)
                  DW_AT_decl_file	("/Users/gclayton/Documents/src/lldb/clean/llvm-project/clang/lib/AST/APValue.cpp")
                  DW_AT_decl_line	(166)

0x000000b3:       DW_TAG_subprogram
                    DW_AT_name	("LVBase")
                    DW_AT_declaration	(true)
                    DW_AT_artificial	(true)
                    DW_AT_external	(true)

I would think we should leave these types in the original compile unit as that will be the only compile unit that will use these types. Right now you are moving them to the "__type_table" compile unit so they can be uniqued, but they don't need to be uniqued. This is pretty C++ specific though.

The reason why the content of anonymous namespace(as well as types defined inside subprograms) is moved into the type table - is handling parameterized types.
If we have a type defined inside an anonymous namespace and this type is not moved into the type table and this type is used as a parameter of another class then we could not move that another class into the type table also. That another class might be used in other places and have other dependencies. Finally, the type from the anonymous namespace might prevent moving a lot of classes into the type table(current design decision is that type table does not have references into other compile units). Please check the following example:


cat main.cpp

char foo1 (void);
long foo2 (void);

int main ( void ) {
  return foo1() + foo2();
}

cat a.h

struct A {
  template <class T>
  T foo () {
    return T();
  }
};

cat foo1.cpp

#include "a.h"

char foo1 (void) {
  return A().foo<char>(); 
}

cat foo2.cpp

#include "a.h"

namespace {

struct B {
  long m1;
};

B internal (void) {
  return A().foo<B>();
}

}

long foo2 (void ) {
  return internal().m1;
}

In dwarf we have a following die :

DW_TAG_structure_type
  DW_AT_calling_convention        (DW_CC_pass_by_value)
  DW_AT_name      ("A")
  DW_AT_byte_size (0x01)
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")
  DW_AT_decl_line (1)

  DW_TAG_subprogram
    DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")
    DW_AT_name    ("foo<(anonymous namespace)::B>")
    DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")
    DW_AT_decl_line       (3)
    DW_AT_type    (0x0000016d "(anonymous namespace)::B")
    DW_AT_declaration     (true)

    DW_TAG_template_type_parameter
      DW_AT_type  (0x0000016d "(anonymous namespace)::B")
      DW_AT_name  ("T")

    DW_TAG_formal_parameter
      DW_AT_type  (0x000001b3 "A *")
      DW_AT_artificial    (true)

    NULL
  NULL

The method "foo<(anonymous namespace)::B>" prevents moving "A"(from foo2.cpp) into the type table.
The "A" structure might contain a lot of data and many other types might depend on "A".
Finally a big piece of DWARF(from NON-anonymous namespaces) would be left in place
and is not de duplicated.

To resolve this problem this patch moves all types into the type table. So that "anonymous"
types would not prevent other types from de duplication.

There probably might be implemented smarter solution, which will be able to "split" type "A" so that all not depending part of it would be moved into the type table but
part related to the types defined in anonymous namespace would left in place. That is more complex solution and is not implemented yet. This patch implements types merging. Types splitting might be done if current solution is not good enough.

The namespace also has the object file path as the decl file. If we are going to move the type, wouldn't it be better to use the compile unit's path like:

0x000000a6:   DW_TAG_namespace
                DW_AT_decl_file	("/Users/gclayton/Documents/src/lldb/clean/llvm-project/clang/lib/AST/APValue.cpp")

yes, it supposed to be compile unit path. will check it.

jhenderson added inline comments.Jan 5 2022, 6:18 AM
llvm/tools/llvm-dwarfdump/llvm-dwarfdump.cpp
252 ↗(On Diff #396957)

The reason why the content of anonymous namespace(as well as types defined inside subprograms) is moved into the type table - is handling parameterized types.
If we have a type defined inside an anonymous namespace and this type is not moved into the type table and this type is used as a parameter of another class then we could not move that another class into the type table also. That another class might be used in other places and have other dependencies. Finally, the type from the anonymous namespace might prevent moving a lot of classes into the type table(current design decision is that type table does not have references into other compile units). Please check the following example:

You can use cross-unit references (DW_FORM_sec_offset) to have one type/DIE in one unit refer to a type/DIE in another unit. So they could still be moved to/handled in a common way, but split up into multiple units to avoid pathological cases in DWARF consumers that might tend to load whole units at a time. Would that be feasible? (not sure if such consumers also then have common pathologies where they load all units that a unit references via DW_FORM_sec_offset... if they do, then the solution becomes more difficult and would involve either putting every type in a separate unit or finding groups of related types to put together, rather than arbitrary grouping - @clayborg do you know if lldb has that sort of behavior? (where it'd try to load all referenced units from a unit it's loading))

avl added a comment.Jan 5 2022, 1:36 PM

You can use cross-unit references (DW_FORM_sec_offset) to have one type/DIE in one unit refer to a type/DIE in another unit. So they could still be moved to/handled in a common way, but split up into multiple units to avoid pathological cases in DWARF consumers that might tend to load whole units at a time. Would that be feasible? (not sure if such consumers also then have common pathologies where they load all units that a unit references via DW_FORM_sec_offset... if they do, then the solution becomes more difficult and would involve either putting every type in a separate unit or finding groups of related types to put together, rather than arbitrary grouping - @clayborg do you know if lldb has that sort of behavior? (where it'd try to load all referenced units from a unit it's loading))

If I understood correctly - you are talking here more about splitting the current global type table between several artificial type tables.
Which is an additional problem, but not what I was talking about. I tried to describe situation when a class ("A") contains information for global
namespace and for anonymous namespace:

DW_TAG_structure_type                                               | global&anonymous
  DW_AT_calling_convention        (DW_CC_pass_by_value)             | global&anonymous
  DW_AT_name      ("A")                                             | global&anonymous
  DW_AT_byte_size (0x01)                                            | global&anonymous
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               | global&anonymous
  DW_AT_decl_line (1)                                               | global&anonymous

  DW_TAG_subprogram                                                 | anonymous
    DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     | anonymous
    DW_AT_name    ("foo<(anonymous namespace)::B>")                 | anonymous
    DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       | anonymous
    DW_AT_decl_line       (3)                                       | anonymous
    DW_AT_type    (0x0000016d "(anonymous namespace)::B")           | anonymous
    DW_AT_declaration     (true)                                    | anonymous

    DW_TAG_template_type_parameter                                  | anonymous
      DW_AT_type  (0x0000016d "(anonymous namespace)::B")           | anonymous
      DW_AT_name  ("T")                                             | anonymous

    DW_TAG_formal_parameter                                         | anonymous
      DW_AT_type  (0x000001b3 "A *")                                | anonymous
      DW_AT_artificial    (true)                                    | anonymous

    NULL                                                            | anonymous
    
  DW_TAG_subprogram                                                 | global
    DW_AT_linkage_name    ("another member of A")                   | global
    DW_AT_name    ("another member of A")                           | global
    
  NULL                                                              | global&anonymous
Thus it should be split into two parts. One should go to the global type table:
DW_TAG_structure_type                                               
  DW_AT_calling_convention        (DW_CC_pass_by_value)             
  DW_AT_name      ("A")                                             
  DW_AT_byte_size (0x01)                                           
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")              
  DW_AT_decl_line (1)                                              

  DW_TAG_subprogram                                                
    DW_AT_linkage_name    ("another member of A")                  
    DW_AT_name    ("another member of A")                          
    
  NULL
Another part should be left into the owning compilation unit.
DW_TAG_structure_type                                             
DW_AT_calling_convention        (DW_CC_pass_by_value)             
DW_AT_name      ("A")                                             
DW_AT_byte_size (0x01)                                            
DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               
DW_AT_decl_line (1)                                               

DW_TAG_subprogram                                                 
  DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     
  DW_AT_name    ("foo<(anonymous namespace)::B>")                 
  DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       
  DW_AT_decl_line       (3)                                       
  DW_AT_type    (0x0000016d "(anonymous namespace)::B")           
  DW_AT_declaration     (true)                                    

  DW_TAG_template_type_parameter                                  
    DW_AT_type  (0x0000016d "(anonymous namespace)::B")           
    DW_AT_name  ("T")                                             

  DW_TAG_formal_parameter                                         
    DW_AT_type  (0x000001b3 "A *")                                
    DW_AT_artificial    (true)                                    

  NULL                                                            
  
NULL

That kind of splitting is not implemented by current patch.
Also that solution would require more space since it duplicates structure description.

Or are you suggesting another solution for that case?

avl added a comment.Jan 5 2022, 3:03 PM

One issue I do see with this approach as well is the first "type_table" is about 69MB for a dSYM file for LLDB.framework in LLDB. Any types that are accessed from debug info in LLDB will end up causing this entire DW_TAG_compile_unit to be parsed which can be expensive as LLDB parses DWARF partially and only parses what it needs to when loading DWARF. Is there any way to split up the "type_table" into multiple DW_TAG_compile_unit entries whose DW_AT_name matches the DW_AT_decl_file of the root type that is being uniqued? That would allow smaller compile units to be parsed and keep memory footprint down in debuggers.

I think that it could be done in general. Though I am not sure about the concrete scheme. Following are reasons why I implemented it as a single type unit:

  1. Fragmentation. clang binary has ~1600 compilation units. If every unit would have a separate type unit, then unit header, unit die, namespace dies, base types, line table headers would be duplicated. Types distributed between units should also be duplicated(if they have a template parameter from another unit, similar to the above example for an anonymous type).
  1. Cross CU references. The current type table does not have outside references. That allows processing the table by one pass. If we allow outside references then we need to accumulate such inter-connected units in memory first and then process them. I do not know how that is currently implemented in lldb - Probably there is some efficient scheme to handle inter-connected units.

Probably, we might create some scheme with several type units. F.e. similar to what David described: "a basic heuristic would be X types per unit, but a more advanced one would depend on the size of the types (& even more advanced would group related types together to benefit from more local references - but I doubt that's worth it)."

In D96035#3223521, @avl wrote:

You can use cross-unit references (DW_FORM_sec_offset) to have one type/DIE in one unit refer to a type/DIE in another unit. So they could still be moved to/handled in a common way, but split up into multiple units to avoid pathological cases in DWARF consumers that might tend to load whole units at a time. Would that be feasible? (not sure if such consumers also then have common pathologies where they load all units that a unit references via DW_FORM_sec_offset... if they do, then the solution becomes more difficult and would involve either putting every type in a separate unit or finding groups of related types to put together, rather than arbitrary grouping - @clayborg do you know if lldb has that sort of behavior? (where it'd try to load all referenced units from a unit it's loading))

If I understood correctly - you are talking here more about splitting the current global type table between several artificial type tables.
Which is an additional problem, but not what I was talking about. I tried to describe situation when a class ("A") contains information for global
namespace and for anonymous namespace:

DW_TAG_structure_type                                               | global&anonymous
  DW_AT_calling_convention        (DW_CC_pass_by_value)             | global&anonymous
  DW_AT_name      ("A")                                             | global&anonymous
  DW_AT_byte_size (0x01)                                            | global&anonymous
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               | global&anonymous
  DW_AT_decl_line (1)                                               | global&anonymous

  DW_TAG_subprogram                                                 | anonymous
    DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     | anonymous
    DW_AT_name    ("foo<(anonymous namespace)::B>")                 | anonymous
    DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       | anonymous
    DW_AT_decl_line       (3)                                       | anonymous
    DW_AT_type    (0x0000016d "(anonymous namespace)::B")           | anonymous
    DW_AT_declaration     (true)                                    | anonymous

    DW_TAG_template_type_parameter                                  | anonymous
      DW_AT_type  (0x0000016d "(anonymous namespace)::B")           | anonymous
      DW_AT_name  ("T")                                             | anonymous

    DW_TAG_formal_parameter                                         | anonymous
      DW_AT_type  (0x000001b3 "A *")                                | anonymous
      DW_AT_artificial    (true)                                    | anonymous

    NULL                                                            | anonymous
    
  DW_TAG_subprogram                                                 | global
    DW_AT_linkage_name    ("another member of A")                   | global
    DW_AT_name    ("another member of A")                           | global
    
  NULL                                                              | global&anonymous
Thus it should be split into two parts. One should go to the global type table:
DW_TAG_structure_type                                               
  DW_AT_calling_convention        (DW_CC_pass_by_value)             
  DW_AT_name      ("A")                                             
  DW_AT_byte_size (0x01)                                           
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")              
  DW_AT_decl_line (1)                                              

  DW_TAG_subprogram                                                
    DW_AT_linkage_name    ("another member of A")                  
    DW_AT_name    ("another member of A")                          
    
  NULL
Another part should be left into the owning compilation unit.
DW_TAG_structure_type                                             
DW_AT_calling_convention        (DW_CC_pass_by_value)             
DW_AT_name      ("A")                                             
DW_AT_byte_size (0x01)                                            
DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               
DW_AT_decl_line (1)                                               

DW_TAG_subprogram                                                 
  DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     
  DW_AT_name    ("foo<(anonymous namespace)::B>")                 
  DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       
  DW_AT_decl_line       (3)                                       
  DW_AT_type    (0x0000016d "(anonymous namespace)::B")           
  DW_AT_declaration     (true)                                    

  DW_TAG_template_type_parameter                                  
    DW_AT_type  (0x0000016d "(anonymous namespace)::B")           
    DW_AT_name  ("T")                                             

  DW_TAG_formal_parameter                                         
    DW_AT_type  (0x000001b3 "A *")                                
    DW_AT_artificial    (true)                                    

  NULL                                                            
  
NULL

That kind of splitting is not implemented by current patch.
Also that solution would require more space since it duplicates structure description.

Or are you suggesting another solution for that case?

Oh, sorry, I wasn't discussing that case/didn't understand that's what was being discussed. I think for that case it's fine to move the translation-unit-local/private linkage/anonymous-ish part of the type into the global type table along with the rest of the type.

I think @clayborg's point was mostly for a purely translation-unit-local entity it could stay in the translation unit that defined it - reducing the need for cross-CU referencing to refer to the type and for work needed to move the type around, etc.

In D96035#3223734, @avl wrote:

One issue I do see with this approach as well is the first "type_table" is about 69MB for a dSYM file for LLDB.framework in LLDB. Any types that are accessed from debug info in LLDB will end up causing this entire DW_TAG_compile_unit to be parsed which can be expensive as LLDB parses DWARF partially and only parses what it needs to when loading DWARF. Is there any way to split up the "type_table" into multiple DW_TAG_compile_unit entries whose DW_AT_name matches the DW_AT_decl_file of the root type that is being uniqued? That would allow smaller compile units to be parsed and keep memory footprint down in debuggers.

I think that it could be done in general. Though I am not sure about the concrete scheme. Following are reasons why I implemented it as a single type unit:

  1. Fragmentation. clang binary has ~1600 compilation units. If every unit would have a separate type unit, then unit header, unit die, namespace dies, base types, line table headers would be duplicated. Types distributed between units should also be duplicated(if they have a template parameter from another unit, similar to the above example for an anonymous type).

Yeah, I think one type per unit would be more expensive than would be desirable - like type units which have a lot of overhead for this sort of reason.

I don't understand the second part "Types distributed between units should also be duplicated" - types can refer to types in other units (using DW_FORM_sec_offset the same as the way that types will be referenced from the non-type contexts where the type needs to be referred to (eg: a compilation unit with a subprogram that needs to refer to the return type in the global type table/type unit/whatever)), I wouldn't advocate for any duplication.

  1. Cross CU references. The current type table does not have outside references. That allows processing the table by one pass. If we allow outside references then we need to accumulate such inter-connected units in memory first and then process them. I do not know how that is currently implemented in lldb - Probably there is some efficient scheme to handle inter-connected units.

If the types can all be generated in one pass today - they could be generated in one pass over multiple units as well. I don't follow why that'd require multiple passes. (laying out one unit with multiple types requires laying out all the DIEs to know the values of unit-relative offsets - putting those DIEs in several units doesn't make that situation worse in terms of the amount of data that needs to be kept around - it'd mean laying out multiple units to determine where the offset values are across those units)

Probably, we might create some scheme with several type units. F.e. similar to what David described: "a basic heuristic would be X types per unit, but a more advanced one would depend on the size of the types (& even more advanced would group related types together to benefit from more local references - but I doubt that's worth it)."

Yep - X types would be a good first approximation. Could probably do the "X bytes or DIEs per grouping" might not be too expensive either.

avl added a comment.Jan 6 2022, 4:51 AM

If the types can all be generated in one pass today - they could be generated in one pass over multiple units as well. I don't follow why that'd require multiple passes. (laying out one unit with multiple types requires laying out all the DIEs to know the values of unit-relative offsets - putting those DIEs in several units doesn't make that situation worse in terms of the amount of data that needs to be kept around - it'd mean laying out multiple units to determine where the offset values are across those units)

Oh sorry, I wrote not clearly. My statement is not about generation. This is about reading generated dwarf. If the resulting DWARF would be closely coupled by inter-CU references then it would be hard to read&process it. It would also be hard to implement parallel separate compile units processing(because units depend on each other). To effectively process DWARF, there would be necessary to load nearly all compile&type units into the memory. An extreme scenario of such usage is applying DWARFLinker to the already linked DWARF.

With current implementation only the "type_table" unit needs to be loaded forever, other units might be loaded/unloaded independently(because "type table" cannot reference other compile&type units).
If we would generate DWARF which has type units that reference other type units and usual compile units - then, in the worst case, we would need to load nearly all of them to be able to process(and since dependency is not known upfront we need a pass to discover dependency and then a pass which loads units and processes).

Probably, lldb does not have processing which requires loading all dependent units at once. And thus, it is not so important for lldb.

But from the point of view general DWARF reader it is better to have less coupled DWARF. That is why the current design&implementation of the patch assumes that the "type table" does not reference other units.

At first glance, we might divide types into not depending groups. So that it still be possible to not have references between type units
and be able to load smaller groups at once.

avl added a comment.Jan 6 2022, 2:10 PM

I don't understand the second part "Types distributed between units should also be duplicated" - types can refer to types in other units (using DW_FORM_sec_offset the same as the way that types will be referenced from the non-type contexts where the type needs to be referred to (eg: a compilation unit with a subprogram that needs to refer to the return type in the global type table/type unit/whatever)), I wouldn't advocate for any duplication.

The case with parametrized function member does not have a good solution if separated between compilation units.
It either requires to duplicate part of the description(that is a solution which I was talking about trying to avoid inter-CU references),
either introduces inter-CU reference(that is solution which you are talking about(DW_FORM_sec_offset/DW_FORM_ref_addr)).
I think it would be good if we will use a solution that does not require duplicating data AND does not
require using inter-CU reference(other that non-typeCU->typeCU):

cat main.cpp

char foo1 (void);
long foo2 (void);

int main ( void ) {
  return foo1() + foo2();
}

cat a.h

struct A {
  template <class T>
  T foo () {
    return T();
  }

  int createi () {
    return 0;
   }

  float createf ( ) {
    return 0.0;
  }
};

cat foo1.cpp

#include "a.h"

char foo1 (void) {
  return A().foo<char>(); 
}

cat foo2.cpp

#include "a.h"

namespace name1 {

struct B {
  long m1;
};

B internal (void) {
  return A().foo<B>();
}

}

long foo2 (void ) {
  return name1::internal().m1;
}

int foo3 () {

 A var;

 return var.createi() + var.createf();
}

If struct "A" and struct B are placed into the different compilation units, then we have two alternatives on how to represent struct "A".
First alternative assumes duplication of struct "A" definition while second alternative assumes inter-CU reference:

  1. The first compilation unit("foo1.cpp") does not have a definition for "foo<name1::B>". The definition for "foo<name1::B>" is located into the second compilation unit("foo2.cpp"). In this case we have a duplicated definition of struct "A".
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type   <<<<<
    DW_AT_name "A"        <<<<<

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_structure_type    <<<<<<
    DW_AT_name "A"         <<<<<<
    
    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"

  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type
    DW_AT_name "B"

  NULL
  1. All definitions are inside the first unit("foo1.cpp"), but we have a cross CU reference from the first unit to the second.
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type
    DW_AT_name "A"

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"   <<<<<<

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type   <<<<<<
    DW_AT_name "B"

  NULL

The solution from this patch solves both problems by putting all data in a single compilation unit.
Thus we do not need to duplicate data and create only non-typeCU->typeCU inter-CU references.
If we want to split "type table" into a set of smaller units to minimize peak memory usage then we probably could
group types so that all dependent types would be located in a single unit.
That way we probably could avoid duplication and inter-CU references.
Avoiding inter-CU references might help to create more effective implementation of an DWARF reader.

avl added a comment.Jan 10 2022, 8:41 AM

@dblaikie: I think @clayborg's point was mostly for a purely translation-unit-local entity it could stay in the translation unit that defined it - reducing the need for cross-CU referencing to refer to the type and for work needed to move the type around, etc.

right, the purely translation-unit-local entity might be left into the original compile unit(not moved into the type table).
The current implementation(for simplicity) could not do this. This might be done as an improvement.

@clayborg: These imported declarations amounted to 3% of overall CU size for a single compile unit that I tested. I know that the stock dsymutil whitelists all of these which causes these and all the types the reference to be kept, so if we can get rid of any DW_TAG_imported_declaration DIEs that are not referenced, we might also be able to get rid of the type that they reference as well as all types from a C++ header file that have corresponding C header files like "cstdlib" -> "stdlib.h", they will have this issue and waste space.

@dblaikie: So imported modules are never referenced from other DWARF - if you want any of them to persist you have to treat them all as roots, you can't GC them.
@dblaikie: And for imported declarations, at least those generated from Clang you have to do the same until that bug is fixed in Clang to actually reference the using declarations.

@dblaikie: In any case I assume this issue is mostly orthogonal to this proposal (ie: I expect that's an existing issue in dsymutil, not a new regression introduced by this patch), yeah?

This patch should leave imported declaration inplace. i.e. they are placed similarly to how current dsymutil does this.

In D96035#3226125, @avl wrote:

I don't understand the second part "Types distributed between units should also be duplicated" - types can refer to types in other units (using DW_FORM_sec_offset the same as the way that types will be referenced from the non-type contexts where the type needs to be referred to (eg: a compilation unit with a subprogram that needs to refer to the return type in the global type table/type unit/whatever)), I wouldn't advocate for any duplication.

The case with parametrized function member does not have a good solution if separated between compilation units.
It either requires to duplicate part of the description(that is a solution which I was talking about trying to avoid inter-CU references),
either introduces inter-CU reference(that is solution which you are talking about(DW_FORM_sec_offset/DW_FORM_ref_addr)).
I think it would be good if we will use a solution that does not require duplicating data AND does not
require using inter-CU reference(other that non-typeCU->typeCU):

cat main.cpp

char foo1 (void);
long foo2 (void);

int main ( void ) {
  return foo1() + foo2();
}

cat a.h

struct A {
  template <class T>
  T foo () {
    return T();
  }

  int createi () {
    return 0;
   }

  float createf ( ) {
    return 0.0;
  }
};

cat foo1.cpp

#include "a.h"

char foo1 (void) {
  return A().foo<char>(); 
}

cat foo2.cpp

#include "a.h"

namespace name1 {

struct B {
  long m1;
};

B internal (void) {
  return A().foo<B>();
}

}

long foo2 (void ) {
  return name1::internal().m1;
}

int foo3 () {

 A var;

 return var.createi() + var.createf();
}

If struct "A" and struct B are placed into the different compilation units, then we have two alternatives on how to represent struct "A".
First alternative assumes duplication of struct "A" definition while second alternative assumes inter-CU reference:

  1. The first compilation unit("foo1.cpp") does not have a definition for "foo<name1::B>". The definition for "foo<name1::B>" is located into the second compilation unit("foo2.cpp"). In this case we have a duplicated definition of struct "A".
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type   <<<<<
    DW_AT_name "A"        <<<<<

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_structure_type    <<<<<<
    DW_AT_name "A"         <<<<<<
    
    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"

  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type
    DW_AT_name "B"

  NULL
  1. All definitions are inside the first unit("foo1.cpp"), but we have a cross CU reference from the first unit to the second.
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type
    DW_AT_name "A"

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"   <<<<<<

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type   <<<<<<
    DW_AT_name "B"

  NULL

The solution from this patch solves both problems by putting all data in a single compilation unit.
Thus we do not need to duplicate data and create only non-typeCU->typeCU inter-CU references.
If we want to split "type table" into a set of smaller units to minimize peak memory usage then we probably could
group types so that all dependent types would be located in a single unit.
That way we probably could avoid duplication and inter-CU references.
Avoiding inter-CU references might help to create more effective implementation of an DWARF reader.

This doesn't address the case where struct A is a local type (in an anonymous namespace). Moving it out of the translation/compile unit it's defined it will break the representation (because the name is only valid within that compile unit - so you might end up pulling two types with the same name into the type table unit, making it hard for a DWARF consumer to do correct name lookup/know which version of the type the user is talking about in a given context).

What we could do is a third option - not duplicating the whole definition of A, but only part of the definition.

Take a look at the DWARF Clang produces for something like this, for instance:

struct t1 {
  virtual void f1(); // home this type into another translation unit
  template<typename T>
  void f1() { }
};
namespace {
struct t2 { };
}
int main() {
  t1().f1<t2>();
}

The DWARF for this will include a declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration (& then an out of line definition that refers to that declaration) (in this case we could potentially also include a DW_AT_specification in the declaration that points to the definition in the type table, if that helps consumers significantly - so they don't have to do name lookup to figure out that they're the same thing).

Speaking of all that - how are member function definitions working in this current proposal? Do they use sec_offsets to refer to the declaration in the type table, or do they already do something like ^ with a type declaration in the same unit to house the member function declaration?

avl added a comment.Jan 11 2022, 5:04 AM

! In D96035#3232592, @dblaikie wrote:

This doesn't address the case where struct A is a local type (in an anonymous namespace). Moving it out of the translation/compile unit it's defined it will break the representation (because the name is only valid within that compile unit - so you might end up pulling two types with the same name into the type table unit, making it hard for a DWARF consumer to do correct name lookup/know which version of the type the user is talking about in a given context).

This patch put such types into the type table separately. It creates two different namespaces with different DW_AT_decl_file attribute:

DW_TAG_compile_unit
  DW_AT_name "__type_table"
  
  DW_TAG_namespace
    DW_AT_decl_file "compile_unit1.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

  DW_TAG_namespace
    DW_AT_decl_file "compile_unit2.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

NULL

What we could do is a third option - not duplicating the whole definition of A, but only part of the definition.

Take a look at the DWARF Clang produces for something like this, for instance:

struct t1 {
  virtual void f1(); // home this type into another translation unit
  template<typename T>
  void f1() { }
};
namespace {
struct t2 { };
}
int main() {
  t1().f1<t2>();
}

The DWARF for this will include a declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration (& then an out of line definition that refers to that declaration) (in this case we could potentially also include a DW_AT_specification in the declaration that points to the definition in the type table, if that helps consumers significantly - so they don't have to do name lookup to figure out that they're the same thing).

yes. I assumed the similar approach here - https://reviews.llvm.org/D96035#3222090 (part about types splitting).

If type splitting were supported then this approach allows leaving anonymous types inside the compilation unit at the cost of little duplication and without unnecessary inter-CU references. It also allows decreasing the overall size of "type table". The current patch does not support type splitting. I suggest implementing it as an additional improvement.

Types splitting is necessary because originally generated "declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration" might contain other members. So we need to move these other members into the type table and leave "declaration of the f1<t2> member" in place. Otherwise, (because other members will stay not deduplicated) the size of resulting DWARF would be bigger.

Speaking of all that - how are member function definitions working in this current proposal? Do they use sec_offsets to refer to the declaration in the type table, or do they already do something like ^ with a type declaration in the same unit to house the member function declaration?

All member function definitions are merged into the type which is in type table. Later they are referenced(using DW_FORM_ref_addr) from compile units:

DW_TAG_compile_unit
  DW_AT_name "__type_table"

  DW_TAG_structure_type
    DW_AT_name "t1"
      
    DW_TAG_subprogram
      DW_AT_name "t1"
    NULL

    DW_TAG_subprogram
      DW_AT_name "f1"
      
      DW_TAG_template_type_parameter
        DW_AT_type "__type_table::main.cpp::t2"  // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
    NULL
  NULL
  
  DW_TAG_namespace
    DW_AT_decl_file "main.cpp"
    
    DW_TAG_struct_type
      DW_AT_name "t2"
    NULL
  NULL

NULL

DW_TAG_compile_unit
  DW_AT_name "main.cpp"
  
  DW_TAG_subprogram
    DW_AT_name "main"
  NULL

  DW_TAG_subprogram
    DW_AT_name "t1"
    DW_AT_specification "__type_table::t1::t1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

  DW_TAG_subprogram
    DW_AT_name "f1"
    DW_AT_specification "__type_table::t1::f1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

NULL
In D96035#3234040, @avl wrote:

! In D96035#3232592, @dblaikie wrote:

This doesn't address the case where struct A is a local type (in an anonymous namespace). Moving it out of the translation/compile unit it's defined it will break the representation (because the name is only valid within that compile unit - so you might end up pulling two types with the same name into the type table unit, making it hard for a DWARF consumer to do correct name lookup/know which version of the type the user is talking about in a given context).

This patch put such types into the type table separately. It creates two different namespaces with different DW_AT_decl_file attribute:

DW_TAG_compile_unit
  DW_AT_name "__type_table"
  
  DW_TAG_namespace
    DW_AT_decl_file "compile_unit1.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

  DW_TAG_namespace
    DW_AT_decl_file "compile_unit2.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

NULL

Yeah, I don't think that's enough to address the scoping issue - which file they're written in doesn't define their scope of name lookup, unfortunately. (eg: you could #include a file into another file - the file the code is written in doesn't tell you whether another entity defined in a different file can see that name (because the files could be included into the same translation unit)) - local types, I think, have to stay in the compile unit that defined them - I don't think there's any other solution to the language scoping.

What we could do is a third option - not duplicating the whole definition of A, but only part of the definition.

Take a look at the DWARF Clang produces for something like this, for instance:

struct t1 {
  virtual void f1(); // home this type into another translation unit
  template<typename T>
  void f1() { }
};
namespace {
struct t2 { };
}
int main() {
  t1().f1<t2>();
}

The DWARF for this will include a declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration (& then an out of line definition that refers to that declaration) (in this case we could potentially also include a DW_AT_specification in the declaration that points to the definition in the type table, if that helps consumers significantly - so they don't have to do name lookup to figure out that they're the same thing).

yes. I assumed the similar approach here - https://reviews.llvm.org/D96035#3222090 (part about types splitting).

If type splitting were supported then this approach allows leaving anonymous types inside the compilation unit at the cost of little duplication and without unnecessary inter-CU references. It also allows decreasing the overall size of "type table". The current patch does not support type splitting. I suggest implementing it as an additional improvement.

Types splitting is necessary because originally generated "declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration" might contain other members. So we need to move these other members into the type table and leave "declaration of the f1<t2> member" in place. Otherwise, (because other members will stay not deduplicated) the size of resulting DWARF would be bigger.

Yep, that sounds right to me. To the best of my knowledge, the sort of things that may need to get "left behind" in the compilation unit/not moved to the type table, would be member templates like the one we've been discussing, and nested types (eg: you could have a pimpl idiom - struct t1 { struct t2; }; ... namespace { struct t3 { }; } struct t1::t2 { t3 m; }; so the whole definition of t2 would have to stay in the compilation unit with t3, I think - because it needs to reference t3. The type table could include a declaration of t2, since t1 might include pointers to t2 or other references that only require a declaration)

The nice thing about this particular way of splitting - is that there's loads of existing precedent for it. This sort of DWARF (declarations with members) is generated by clang all the time for homed debug info. (though, that said, lldb isn't /super/ fond of this sort of debug info - so it wouldn't surprise me if it'd need some fixes to handle this sort of thing in a dsym, but we might be lucky there & it already works because at least all the types are still defined within the single dsym)

Speaking of all that - how are member function definitions working in this current proposal? Do they use sec_offsets to refer to the declaration in the type table, or do they already do something like ^ with a type declaration in the same unit to house the member function declaration?

All member function definitions are merged into the type which is in type table. Later they are referenced(using DW_FORM_ref_addr) from compile units:

DW_TAG_compile_unit
  DW_AT_name "__type_table"

  DW_TAG_structure_type
    DW_AT_name "t1"
      
    DW_TAG_subprogram
      DW_AT_name "t1"
    NULL

    DW_TAG_subprogram
      DW_AT_name "f1"
      
      DW_TAG_template_type_parameter
        DW_AT_type "__type_table::main.cpp::t2"  // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
    NULL
  NULL
  
  DW_TAG_namespace
    DW_AT_decl_file "main.cpp"
    
    DW_TAG_struct_type
      DW_AT_name "t2"
    NULL
  NULL

NULL

DW_TAG_compile_unit
  DW_AT_name "main.cpp"
  
  DW_TAG_subprogram
    DW_AT_name "main"
  NULL

  DW_TAG_subprogram
    DW_AT_name "t1"
    DW_AT_specification "__type_table::t1::t1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

  DW_TAG_subprogram
    DW_AT_name "f1"
    DW_AT_specification "__type_table::t1::f1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

NULL

Ah, OK - thanks for the details!

(I should say, whether or not it's necessary/prerequisite to handle the anonymous types more accurately, through type splitting - I don't have an opinion on - dsymutil isn't something I use/have any investment in - I'm just discussing the DWARF representational tools and techniques that might be used to address some of these limitations. Folks who own/maintain/ship dsymutil are the ones to make any final choices about what is/isn't acceptable here)

avl updated this revision to Diff 400601.Jan 17 2022, 10:16 AM

rebased. removed usages of NonRelocatableStringpool.

ormris removed a subscriber: ormris.Jan 18 2022, 9:38 AM
avl updated this revision to Diff 404549.Jan 31 2022, 8:40 AM

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).
implemented types splitting.

avl edited the summary of this revision. (Show Details)Jan 31 2022, 8:42 AM
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs? Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

(https://randomascii.wordpress.com/2018/02/04/what-we-talk-about-when-we-talk-about-performance/ The ambiguity made me start to use "1.3x as fast" )

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs? Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

avl added a comment.EditedJan 31 2022, 11:13 AM
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

No, sorry, I meant what took 1second before now takes 1*1.3 = 1.3sec.

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs?

yes. anonymous/non-linkage types are correctly kept in their original CUs. These are tests for that functionality:

odr-nested-types1.test
odr-types-in-subprogram1.test

Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

splitting up types into chunks is not implemented yet. I think it would take some time. I would prefer to implement it as an separate improvement for this patch.

In D96035#3284921, @avl wrote:
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

No, sorry, I meant what took 1second before now takes 1*1.3 = 1.3sec.

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs?

yes. anonymous/non-linkage types are correctly kept in their original CUs. These are tests for that functionality:

odr-nested-types1.test
odr-types-in-subprogram1.test

Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

splitting up types into chunks is not implemented yet. I think it would take some time. I would prefer to implement it as an separate improvement for this patch.

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

In D96035#3284921, @avl wrote:
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

No, sorry, I meant what took 1second before now takes 1*1.3 = 1.3sec.

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs?

yes. anonymous/non-linkage types are correctly kept in their original CUs. These are tests for that functionality:

odr-nested-types1.test
odr-types-in-subprogram1.test

Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

splitting up types into chunks is not implemented yet. I think it would take some time. I would prefer to implement it as an separate improvement for this patch.

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be.

I think the major reason it'd be resisted would be the cost - increasing dsymutil time significantly would be a hard sell. I think it's pretty understandable that the sort of parallel processing @avl's experimenting with here wouldn't be able to do that efficiently.

Arbitrarily grouping types into smallish buckets to reduce the overhead for DWARF consumers seems more feasible, though, I think.

avl added a comment.Feb 1 2022, 6:58 AM

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

There are some disadvantages with creating additional compilation units for each source compile unit:

  1. Fragmentation and duplication. It would be necessary to duplicate: unit header, unit die, namespace dies, base types, line table headers, line table files, abbreviation table. clang has approx 1600 compilation units. So we need to duplicate all the above information for each of them. At the end of all, we might lose some DWARF size achievements.
  1. Clously coupled references. If all types would be placed in separate compilation units matched with the original unit of declaration then types would reference each other. As the result, It would be hard to process such units in a parallel manner(independently). This limits the acceleration that can be achieved by parallelization. This patch tries to avoid cross-CU references. Only one type is allowed: non-type-CU -> type-CU.

What about the following solution: Current type table unit(let`s say 60MB) would be divided into several buckets(let`s say 16) of independent types. Each bucket is placed in separate artificial compilation unit. So that there would not be references between units, there would not be a lot of duplicated information. The size of each separate type unit would be around 4MB(it would help to lldb to not parse much). Can this be a good solution? It looks like it allows to keep benefits(small final size of overall DWARF file, simple references, small size of each compile unit). It also would probably help to speed up multi-thread execution of DWARFLinker(if all type units would be generated in parallel) but I am afraid it would slow down single-thread execution.

In D96035#3287393, @avl wrote:

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

There are some disadvantages with creating additional compilation units for each source compile unit:

  1. Fragmentation and duplication. It would be necessary to duplicate: unit header, unit die, namespace dies, base types, line table headers, line table files, abbreviation table. clang has approx 1600 compilation units. So we need to duplicate all the above information for each of them. At the end of all, we might lose some DWARF size achievements.

Oh, yeah, too many units sounds unfortunate for sure. Currently we'd have one CU per .cpp file, and with "putting types in a synthetic unit that matches their decl_file" we'd have an additional one CU per .h file - which, admittedly, is probably a bit more than doubling the number of units. (template instantiations, for instance, might all be grouped together in the same unit, since they're defined in the same header)

I don't know that this makes the DWARF 'the best it can be' - it's not clear to me what expressive power is provided by the types being in units that match source file names (indeed the information is provided in the DWARF either way - a consumer can treat types defined in a certain file in a certain way even if they're all in one big CU).

  1. Clously coupled references. If all types would be placed in separate compilation units matched with the original unit of declaration then types would reference each other. As the result, It would be hard to process such units in a parallel manner(independently). This limits the acceleration that can be achieved by parallelization. This patch tries to avoid cross-CU references. Only one type is allowed: non-type-CU -> type-CU.

Yeah.

What about the following solution: Current type table unit(let`s say 60MB) would be divided into several buckets(let`s say 16) of independent types. Each bucket is placed in separate artificial compilation unit. So that there would not be references between units, there would not be a lot of duplicated information. The size of each separate type unit would be around 4MB(it would help to lldb to not parse much). Can this be a good solution? It looks like it allows to keep benefits(small final size of overall DWARF file, simple references, small size of each compile unit). It also would probably help to speed up multi-thread execution of DWARFLinker(if all type units would be generated in parallel) but I am afraid it would slow down single-thread execution.

Finding independent buckets of types sounds difficult/algorithmically complicated. But maybe that's feasible? I'm not sure. I was thinking more "emit all the types the same way you do currently, except into multiple unit "Chunks"" - ie: the code already handles type-to-type references within the single type-CU, so I don't understand (maybe I'm missing something) why it would be difficult to treat that "type-CU" as actually being "multiple type CUs" with arbitrary cross-referencing within that collection of type CUs. Then the chunks/buckets are chosen arbitrarily - admittedly that means longer encoding (sec_offset references are longer than the unit-relative references) than if you can group the types together into isolated/only-self-referencing groups - so maybe the extra space savings is worth the work to create those isolated groups? Naively, I would not have expected it to be worth that much.

avl added a comment.Feb 2 2022, 5:50 AM

Finding independent buckets of types sounds difficult/algorithmically complicated. But maybe that's feasible? I'm not sure.

I do not have a ready implementation. So I do not know whether implementation would be acceptable. And even whether it is possible to divide whole type table onto 16 independent parts. That needs some investigation.

I was thinking more "emit all the types the same way you do currently, except into multiple unit "Chunks"" - ie: the code already handles type-to-type references within the single type-CU, so I don't understand (maybe I'm missing something) why it would be difficult to treat that "type-CU" as actually being "multiple type CUs" with arbitrary cross-referencing within that collection of type CUs. Then the chunks/buckets are chosen arbitrarily - admittedly that means longer encoding (sec_offset references are longer than the unit-relative references) than if you can group the types together into isolated/only-self-referencing groups - so maybe the extra space savings is worth the work to create those isolated groups? Naively, I would not have expected it to be worth that much.

Current implementation deals with one big type unit. This implementation loads all types in memory. The similar implementation might be done for "multiple type CUs". But if we had several independent type CU we might be able to not load all type units CU into the memory but to process them separately and throw out of memory. This would allow to reduce memory requirements.

So yes we might change current implementation to generate several type CUs referencing each other. It would have the same properties as current implementation with the exception that lldb would be able to load it by smaller chunks. If we would be able to divide whole type unit onto several independent parts then we might use more efficient implementation(which uses less memory f.e.)

avl added a comment.Feb 2 2022, 5:58 AM

I`ve got following message in the mailbox from @clayborg . I believe it was sent here but it was not appeared because of some error. So I duplicate it here.

On Feb 1, 2022, at 9:26 AM, David Blaikie via Phabricator <reviews@reviews.llvm.org> wrote:

dblaikie added a comment.

In D96035#3287393 https://reviews.llvm.org/D96035#3287393, @avl wrote:

In D96035#3285620 https://reviews.llvm.org/D96035#3285620, @clayborg wrote:

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

There are some disadvantages with creating additional compilation units for each source compile unit:

  1. Fragmentation and duplication. It would be necessary to duplicate: unit header, unit die, namespace dies, base types, line table headers, line table files, abbreviation table. clang has approx 1600 compilation units. So we need to duplicate all the above information for each of them. At the end of all, we might lose some DWARF size achievements.

base types wouldn't need to be duplicated right? We can put any types with no decl file into the current type unit you created and have everyone use absolute links to them? I was not a fan of the abbreviation tables being split up, I believe the llvm dsymutil has done this for a while now, as it leads to duplication as you already mention. I would prefer on big .debug_abbrev section if possible so all compile units can re-use, but I understand that this would slow things down.

Oh, yeah, too many units sounds unfortunate for sure. Currently we'd have one CU per .cpp file, and with "putting types in a synthetic unit that matches their decl_file" we'd have an additional one CU per .h file - which, admittedly, is probably a bit more than doubling the number of units. (template instantiations, for instance, might all be grouped together in the same unit, since they're defined in the same header)

So 11 bytes for a compile unit plus a few for the main DIE doesn't sound too bad to me when everything would be perfectly organized in the DWARF. Since there would all types would be have matching DW_AT_decl_file attributes, the line table should be very small, with hopefully just a few files. I would love to see a file done like this just so we can compare the sizes of an actual file to see if we are using up that much more space.

I don't know that this makes the DWARF 'the best it can be' - it's not clear to me what expressive power is provided by the types being in units that match source file names (indeed the information is provided in the DWARF either way - a consumer can treat types defined in a certain file in a certain way even if they're all in one big CU).

DWARF has always been a "shove a bunch of stuff into one big blob per compile unit" with no regard to how it will be used in the linking phase since DWARF was created. Linkers are expected to concatenate and relocate and call it a day. This leaves a ton of debug info in the linked executable that isn't useful, like functions that should have been dead stripped, but weren't because of the limitations of the DWARF format and what can be done quickly by linkers that don't understand the format. It is not easy for do anything with DWARF and it isn't a great format for just accessing a thing or two here and there since it is a serialized stream of conciousness of what is in compile unit with no real organization other than "I just put it all in one bucket".

I love the idea of having the DWARF being really well organized from a consumer stand point. Not many producers are thinking about how this information is consumed, just how quick and fast we can make it or link it. There is definitely value to doing things quickly, don't get me wrong, so for the compile/edit/debug cycle we need to keep things fast. But for when you want to create a build and archive it, I would rather spend a few more cycles getting the DWARF in the best shape it can be in so that it can be consumed efficiently by symbolication servers that want to use it, or debuggers to debug a released build than saving a few seconds or bytes in the file.

  1. Clously coupled references. If all types would be placed in separate compilation units matched with the original unit of declaration then types would reference each other. As the result, It would be hard to process such units in a parallel manner(independently). This limits the acceleration that can be achieved by parallelization. This patch tries to avoid cross-CU references. Only one type is allowed: non-type-CU -> type-CU.

Currently you have one compile unit that all threads must access and register their types with right? I don't see how splitting this up would cause more delays than all threads vying for access to the one and only type unit? Seems like it would actually reduce lock contention, but I haven't fully looked at your current solution, so forgive my saying so if this is way off.

Why are we trying to avoid cross CU references with DW_FORM_ref_addr? This is placing the emphasis on linking speed once again over how this information will be accessed by consumers. And one might say that bad DWARF organization can cause more delays when these DWARF files are used for debugging or symolication by having to parse all type DIEs for all types just because it was faster to create the DWARF file in the first place.

Yeah.

What about the following solution: Current type table unit(let`s say 60MB) would be divided into several buckets(let`s say 16) of independent types. Each bucket is placed in separate artificial compilation unit. So that there would not be references between units, there would not be a lot of duplicated information. The size of each separate type unit would be around 4MB(it would help to lldb to not parse much). Can this be a good solution? It looks like it allows to keep benefits(small final size of overall DWARF file, simple references, small size of each compile unit). It also would probably help to speed up multi-thread execution of DWARFLinker(if all type units would be generated in parallel) but I am afraid it would slow down single-thread execution.

Does anyone cares about single thread execution time? I personally don't.

Finding independent buckets of types sounds difficult/algorithmically complicated. But maybe that's feasible? I'm not sure. I was thinking more "emit all the types the same way you do currently, except into multiple unit "Chunks"" - ie: the code already handles type-to-type references within the single type-CU, so I don't understand (maybe I'm missing something) why it would be difficult to treat that "type-CU" as actually being "multiple type CUs" with arbitrary cross-referencing within that collection of type CUs. Then the chunks/buckets are chosen arbitrarily - admittedly that means longer encoding (sec_offset references are longer than the unit-relative references) than if you can group the types together into isolated/only-self-referencing groups - so maybe the extra space savings is worth the work to create those isolated groups? Naively, I would not have expected it to be worth that much.

DW_FORM_ref_addr is 32 bits for DWARF32, and not many use DWARF64 as it has other size increases, so using these to do cross CU references is just as cheap as a CU relative reference. And I haven't seen any compilers take advantage of using DW_FORM_ref1 or DW_FORM_ref2, they just use DW_FORM_ref4 by default which is the same size as DW_FORM_ref_addr.

I would love to see the "each type gets put into their own compile unit based on the DW_AT_decl_file of each type", just to see what this does to the size of the DWARF compared to the currently solution. If the size increase is too much, I would drop my suggestion. Is trying this out hard to do and not worrying at all about performance?

One other solution would be to group types into common subdirectories based off of the DW_AT_decl_file. The idea being and all STL types could be in one common compile unit for anything in with a Dw_AT_decl_file that starts with a common directory prefix? This would allow "vector", "list" and "map" types for the STL to all be in a a specific type compile unit? So the idea would be to strip the file base name from the DW_AT_decl_file and group all types from all header files in that directory into the same type unit?

avl added a comment.Feb 2 2022, 7:37 AM

base types wouldn't need to be duplicated right? We can put any types with no decl file into the current type unit you created and have everyone use absolute links to them?

I am currently implementing an optimization when basic types are always put in the start of current CU and then it is possible to use DW_FORM_ref1 reference to access them. If that optimization become useful then it would be necessary to duplicate them. But without such optimization, there is no need to duplicate basic types.

I was not a fan of the abbreviation tables being split up, I believe the llvm dsymutil has done this for a while now, as it leads to duplication as you already mention. I would prefer on big .debug_abbrev section if possible so all compile units can re-use, but I understand that this would slow things down.

The problem with abbreviation table is not that it affects performance.
I think it is possible to implement global lock-free fast abbreviation table.
The problem is deterministic output. If several threads would write into the
single global abbreviation table then order of elements inside abbreviation table
would be non-deterministic. Thus, I do not see currently how parallel processing
of compilation units might be done using global abbreviation table.

Without requirement of deterministic output - global abbreviation table might be effectively implemented.

So 11 bytes for a compile unit plus a few for the main DIE doesn't sound too bad to me when everything would be perfectly organized in the DWARF. Since there would all types would be have matching DW_AT_decl_file attributes, the line table should be very small, with hopefully just a few files. I would love to see a file done like this just so we can compare the sizes of an actual file to see if we are using up that much more space.

I would love to see the "each type gets put into their own compile unit based on the DW_AT_decl_file of each type", just to see what this does to the size of the DWARF compared to the currently solution. If the size increase is too much, I would drop my suggestion. Is trying this out hard to do and not worrying at all about performance?

also several namespace dies, line table header, abbreviation table, duplicated info inside line table - all of them together might take a significant amount. I would try to model that case and see the real numbers.

Currently you have one compile unit that all threads must access and register their types with right? I don't see how splitting this up would cause more delays than all threads vying for access to the one and only type unit? Seems like it would actually reduce lock contention, but I haven't fully looked at your current solution, so forgive my saying so if this is way off.

It`s not about delays. I am trying to make compile units independent.

Why are we trying to avoid cross CU references with DW_FORM_ref_addr? This is placing the emphasis on linking speed once again over how this information will be accessed by consumers. And one might say that bad DWARF organization can cause more delays when these DWARF files are used for debugging or symolication by having to parse all type DIEs for all types just because it was faster to create the DWARF file in the first place.

Generally speaking, cross CU references makes CU depending on each other. That complicates parallel processing. If CU not depend on others then it could be processed and thrown out. Otherwise, If it depends on other CU`s we need to load all dependent CU`s into the memory and process them together.

If we can separate monolithic type CU into several independent units it would help to "debugging or symbolication" to not parse all type DIEs and it would help other tools(like dsymutil) to process incoming DWARF fast with minimum used RAM.

Does anyone cares about single thread execution time? I personally don't.

If we do not then things are easier :-)

One other solution would be to group types into common subdirectories based off of the DW_AT_decl_file. The idea being and all STL types could be in one common compile unit for anything in with a Dw_AT_decl_file that starts with a common directory prefix? This would allow "vector", "list" and "map" types for the STL to all be in a a specific type compile unit? So the idea would be to strip the file base name from the DW_AT_decl_file and group all types from all header files in that directory into the same type unit?

My current attempt is to try to avoid cross-CU referencies. Let me do some investigation of possible strategies.
Probably, that suggestion(to group types on common directory prefix) would be close to what I am trying to achieve.
Which size of type unit is acceptable? If all STL types require f.e. 4MB - is it OK?

Does anyone cares about single thread execution time? I personally don't.

If we do not then things are easier :-)

I think @JDevlieghere and @aprantl probably care about it just from a regression perspective - but I could be wrong. (I thought they'd expressed some concerns about that in this/other reviews previously).

Apple folks ^ have you got a sense of if/how much of a single threaded regression you'd want to accept in this work, for some multithreaded gains?

avl added a comment.Feb 11 2022, 7:37 AM

@clayborg I`ve done some research and have a couple of questions. Would you mind looking at them, please?

  1. First thing is that I tried to model the case with separate compilation units for each declaration file and it seems that the overall impact is about 0,78% of debug_info+debug_abbrev size of the clang binary. Size of debug_info+debug_abbrev size is 158.7MB. Additional size, required for multiple compilation units, is 1.23MB(2051*(100+500)). 2051 is the number of files used in decl_file attributes, 100 bytes is the size of the compilation unit header, compilation unit die, size of namespace dies, line table header, 500 is the size of separate abbreviation table.
  1. Second thing is that I divided all types into independent buckets. The current type table for the clang binary takes approx 40MB. The maximal size of a bucket containing dependent types is 1.6Mb. So, if the original type table would be divided into multiple compilation units based on types dependency then each type unit might be 1,6Mb or less.

    Dividing types into independent buckets has several advantages: since they are not dependent on each other it is possible to handle them in a parallel/independent way. Another thing is that it potentially minimizes the number of loads which should be done. It loads all dependent types at once. In case types are not divided into independent buckets we would need to load all dependencies by several loads(i.e. whenever cross-CU reference is encountered).

Assuming we decided to split the current global type table on the "decl_file" basis as you suggested, What do you think of the following questions:

  1. How the following situation should be handled:

Source DWARF:

      DW_TAG_compile_unit
        DW_AT_name "cu1"
     
0x100:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
       
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
         
          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
        NULL
     
        DW_TAG_import 0x100   <<<<<<<<<<<<<<<<<<<
      NULL

Result DWARF:

      DW_TAG_compile_unit
        DW_AT_name "type_table_file1"

0x200:  DW_TAG_namespace  <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
          
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
          
        NULL

      DW_TAG_compile_unit
        DW_AT_name "type_table_file2"

0x300:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"

          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
          
        NULL

     DW_TAG_compile_unit
        DW_AT_name "cu1"

        DW_TAG_import 0x200 or 0x300 ?    <<<<<<<<<<<<<<<<<<<

Which offset corresponding to the "namespace1" should be used? Any of them?

  1. Would it be OK to split DW_TAG_module ?

Source DWARF:

DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_module             <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"    <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
   
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL

  NULL
  
NULL

Result DWARF:

DW_TAG_compile_unit
  DW_AT_name "type_table_file1"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<<
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
    
  NULL
  
NULL
   
DW_TAG_compile_unit
  DW_AT_name "type_table_file2"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL
    
  NULL
  
NULL

Is it OK, that DW_TAG_module would be split?

  1. Only root types should be moved into compile unit for corresponding "decl_file", right?
DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_class_type
    DW_AT_name "class1" 
    DW_AT_decl_file "file1"
    
    DW_TAG_subroutine
      DW_AT_name "method1"
      DW_AT_decl_file "file1"    <<<<<<<<<<<<<<<<<<<<<
    NULL
    
    DW_TAG_subroutine
      DW_AT_name "method2"
      DW_AT_decl_file "file2"     <<<<<<<<<<<<<<<<<<<<<
    NULL          
  NULL
NULL

i.e. "method1" and "method2" both should be placed into the compile unit for "file1", right?

  1. what do you think: Would it be good to split current monolithic type table not on "decl_file" basis but on "buckets of dependend types" basis?
In D96035#3314256, @avl wrote:

@clayborg I`ve done some research and have a couple of questions. Would you mind looking at them, please?

  1. First thing is that I tried to model the case with separate compilation units for each declaration file and it seems that the overall impact is about 0,78% of debug_info+debug_abbrev size of the clang binary. Size of debug_info+debug_abbrev size is 158.7MB. Additional size, required for multiple compilation units, is 1.23MB(2051*(100+500)). 2051 is the number of files used in decl_file attributes, 100 bytes is the size of the compilation unit header, compilation unit die, size of namespace dies, line table header, 500 is the size of separate abbreviation table.

Thanks for trying this!

Under 1% (0.78%) , I would be ok with this overhead for really clean DWARF. To clarify: was this a all types for a specific DW_AT_decl_file in it own compile unit? Or was this for all files in the same directory?

  1. Second thing is that I divided all types into independent buckets. The current type table for the clang binary takes approx 40MB. The maximal size of a bucket containing dependent types is 1.6Mb. So, if the original type table would be divided into multiple compilation units based on types dependency then each type unit might be 1,6Mb or less.

    Dividing types into independent buckets has several advantages: since they are not dependent on each other it is possible to handle them in a parallel/independent way. Another thing is that it potentially minimizes the number of loads which should be done. It loads all dependent types at once. In case types are not divided into independent buckets we would need to load all dependencies by several loads(i.e. whenever cross-CU reference is encountered).

Assuming we decided to split the current global type table on the "decl_file" basis as you suggested, What do you think of the following questions:

  1. How the following situation should be handled:

Source DWARF:

      DW_TAG_compile_unit
        DW_AT_name "cu1"
     
0x100:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
       
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
         
          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
        NULL
     
        DW_TAG_import 0x100   <<<<<<<<<<<<<<<<<<<
      NULL

Result DWARF:

      DW_TAG_compile_unit
        DW_AT_name "type_table_file1"

0x200:  DW_TAG_namespace  <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
          
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
          
        NULL

      DW_TAG_compile_unit
        DW_AT_name "type_table_file2"

0x300:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"

          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
          
        NULL

     DW_TAG_compile_unit
        DW_AT_name "cu1"

        DW_TAG_import 0x200 or 0x300 ?    <<<<<<<<<<<<<<<<<<<

We might have to include them both right? Do we actually see DWARF like this? But my initial guess is we would need them both if clients needed access to "S1" or "S2". Or if we can track what that DWARF accesses within the DW_TAG_import, maybe only the ones that are needed.

Which offset corresponding to the "namespace1" should be used? Any of them?

2. Would it be OK to split DW_TAG_module ?


Source DWARF:
DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_module             <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"    <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
   
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL

  NULL
  
NULL
Result DWARF:
DW_TAG_compile_unit
  DW_AT_name "type_table_file1"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<<
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
    
  NULL
  
NULL
   
DW_TAG_compile_unit
  DW_AT_name "type_table_file2"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL
    
  NULL
  
NULL
Is it OK, that DW_TAG_module would be split?

I believe it would be ok to split the module. I don't believe anyone refers to the module directly do they? Only things within and as long as we fix up those direct references then that should be ok. I have to admit I am not familiar with the DW_TAG_module as we don't use it much in LLDB AFAIK.

  1. Only root types should be moved into compile unit for corresponding "decl_file", right?
DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_class_type
    DW_AT_name "class1" 
    DW_AT_decl_file "file1"
    
    DW_TAG_subroutine
      DW_AT_name "method1"
      DW_AT_decl_file "file1"    <<<<<<<<<<<<<<<<<<<<<
    NULL
    
    DW_TAG_subroutine
      DW_AT_name "method2"
      DW_AT_decl_file "file2"     <<<<<<<<<<<<<<<<<<<<<
    NULL          
  NULL
NULL

i.e. "method1" and "method2" both should be placed into the compile unit for "file1", right?

Yes, The entire contents of a class should go into the DW_AT_decl_file for the class itself. DW_TAG_subprogram entries that belong to the class should live with the class.

  1. what do you think: Would it be good to split current monolithic type table not on "decl_file" basis but on "buckets of dependend types" basis?

If every type can do into its own compile unit for only an extra 0.78% cost, I find this totally acceptable and I would vote for this!

I am very excited about this feature and I will push for adoption of this within Facebook ASAP if this is viable, faster and also much better organized! The one thing to think about is seeing if we can make this work with "--update" option to dsymutil (which reparses the DWARF and re-emits it and the accelerator tables. This --update option could then be used to post produce debug innfo before it gets permanently archived to disk, so the smaller and better organized the better! So if we can get this working we will have a huge set of code to run this through and test it with!

avl added a comment.Feb 14 2022, 8:17 AM

Thanks for trying this!

Under 1% (0.78%) , I would be ok with this overhead for really clean DWARF. To clarify: was this a all types for a specific DW_AT_decl_file in it own compile unit? Or was this for all files in the same directory?

That was all types for a specific DW_AT_decl_file in it own compile unit. I did not take into account duplication of file names in the line table, but I think overall size would still be within 1%(if there would not be other duplicated things).

Things might be much worse if we would need to duplicate this kind of dies for each compilation unit :

DW_TAG_const_type 
                DW_AT_type (0x000000000000004a "llvm::Twine")

DW_TAG_pointer_type
                DW_AT_type (0x000000000003c122 "const char")

We might have to include them both right? Do we actually see DWARF like this? But my initial guess is we would need them both if clients needed access to "S1" or "S2". Or if we can track what that DWARF accesses within the DW_TAG_import, maybe only the ones that are needed.

If we are going to include them both - then(in the real examples) it means inserting a dozens of DW_TAG_imported_module dies(for each used type).

We see DWARF like this very often:

using namespace llvm;

SmallSet<int, 3> Set;
SmallVector<const int *, 8> Ptrs;

Originally it looks like this:

      DW_TAG_compile_unit
        DW_AT_name "driver_cc1.cpp"
        
0x100:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "llvm"
       
          DW_TAG_class_type
            DW_AT_name "SmallSet<int, 3>"
            DW_AT_decl_file "SmallSet.h"
          NULL
         
          DW_TAG_class_type
            DW_AT_name "SmallVector<const int *, 8>"
            DW_AT_decl_file "SmallVector.h"
          NULL
        NULL
     
        DW_TAG_imported_module
          DW_AT_import 0x100   <<<<<<<<<<<<<<<<<<<
      NULL

with new scheme it would be:

      DW_TAG_compile_unit
        DW_AT_name "type_table_SmallSet"

0x200:  DW_TAG_namespace  <<<<<<<<<<<<<<<<<<<
          DW_AT_name "llvm"
          
          DW_TAG_structure
            DW_AT_name "SmallSet<int, 3>"
            DW_AT_decl_file "SmallSet.h"
          NULL
          
        NULL

      DW_TAG_compile_unit
        DW_AT_name "type_table_SmallVector"

0x300:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "llvm"

          DW_TAG_structure
            DW_AT_name "SmallVector<const int *, 8>"
            DW_AT_decl_file "SmallVector.h"
          NULL
          
        NULL

     DW_TAG_compile_unit
        DW_AT_name "driver_cc1.cpp"

        DW_TAG_imported_module   <<<<<<<<<<<<<<<<<<<
          DW_AT_import 0x200           <<<<<<<<<<<<<<<<<<<
        DW_TAG_imported_module    <<<<<<<<<<<<<<<<<<<
          DW_AT_import 0x300            <<<<<<<<<<<<<<<<<<<

It looks really overweight. These DW_TAG_imported_module dies will noticeable increase added size.

I am very excited about this feature and I will push for adoption of this within Facebook ASAP if this is viable, faster and also much better organized!

Thanks!

Thing, which I did not understand yet - why "splitting on decl_file basis" is so good? I understand the problem with big "type_table": we do not want to parse a lot of DWARF if lazily use only part of it. But, if less fragmented solution exists, would it be good to use it instead of "splitting on decl_file basis"?
Or "splitting on decl_file basis" is still better for some reason?

The one thing to think about is seeing if we can make this work with "--update" option to dsymutil (which reparses the DWARF and re-emits it and the accelerator tables. This --update option could then be used to post produce debug innfo before it gets permanently archived to disk, so the smaller and better organized the better! So if we can get this working we will have a huge set of code to run this through and test it with!

yes, I think we could make it possible that "--update" (or another replacement option) would reorganize DWARF. Though it would be good if we would do at as an additional change after this patch.

Does the patch in this diff already contain the changes where each type in a separate compile unit that matches the DW_AT_decl_file? I would love to start running some tests, we have some dSYM files that are too large and are exceeding the 4GB boundary that I would like to test this patch on.

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 10:31 AM
avl added a comment.Mar 9 2022, 2:00 PM

Does the patch in this diff already contain the changes where each type in a separate compile unit that matches the DW_AT_decl_file? I would love to start running some tests, we have some dSYM files that are too large and are exceeding the 4GB boundary that I would like to test this patch on.

No, it does not yet. Implementing this needs quite serious refactoring and would require some time(a month or two).

Another thing is that I am not sure that is the right way to go. This change seems would introduce significant fragmentation. My initial estimation(1%) was based on the assumption that a small number of DIEs would be duplicated. The more I think about it the more sources of duplication become clear. f.e. it would be necessary to insert a lot of DW_TAG_imported_module DIEs(see https://reviews.llvm.org/D96035#3314256 for the example ). Also, the implementation should be significantly complicated to add this kind of fragmentation. As a result size of the final DWARF would be increased, performance results for the patch would be worse(especially for single-thread mode, which seems to be still important).

I do not understand currently why "splitting on decl_file basis" is so good? (So that it worth degradation of performance and size results).

As an alternative, I am thinking of splitting type_table on a namespace basis. It would avoid the necessity of inserting multiple DW_TAG_imported_module DIEs. In clang, it would result in three compile units 17MB, 16MB, 7MB instead of one of 40Mb. What do you think of that idea - splitting the current type table into smaller units on a namespace basis?

avl updated this revision to Diff 420072.Apr 3 2022, 12:19 PM

rebased. supported .debug_names.

avl edited the summary of this revision. (Show Details)Apr 3 2022, 12:21 PM

Does anyone cares about single thread execution time? I personally don't.

If we do not then things are easier :-)

I think @JDevlieghere and @aprantl probably care about it just from a regression perspective - but I could be wrong. (I thought they'd expressed some concerns about that in this/other reviews previously).

Apple folks ^ have you got a sense of if/how much of a single threaded regression you'd want to accept in this work, for some multithreaded gains?

If it's purely time, I don't care too much about single threaded performance. When the lookForDiesToKeep algorithm was recursive we would occasionally run out of stack space because std::threads (underlying LLVM's thread pool) have a ridiculously small stack and no way to increase it. So we had this workaround in place for a while that if you specified -j1 it would not avoid the thread pool altogether. But since I converted that algorithm to a work list/state machine that's not really a concern anymore. The thing I did express concerns about what an earlier revision of this change that caused non-determinism in multithreaded mode which is not really negotiable but IIUC that's no longer the case with the current patch.

I'm (still) very excited about this. I'm really happy that this lives behind a flag which addresses most of my original concerns. Making this the default in dsymutil would require a bunch of qualification but this approach allows me to do that gradually. One thing I suggested earlier I think was to run the lldb test suite with the new flag and see if that breaks anything. Did you ever get around to that? That should give us an initial level of confidence. Jim expressed some concerns about trying to squeeze everything in the same compilation unit. Greg, do you share that concern? Do we expect this will negatively impact lldb's performance? It would probably be worthwhile to do a few performance experiments here as well to make sure this doesn't regress debugging performance.

I'm also very excited that Sony is considering upstreaming llvm-dva (based on Paul's comment in D124082). In the past we've often stuck to binary compatibility because there's no great way to qualify big changes to the algorithm. But as long as all of this lives behind a flag that's not a dealbreaker.

If we're happy with the approach I think we should land this an iterate on smaller changes that will be much easier to review. One reason I've been so quiet in this review is that it takes a long time to page in everything.

avl added a comment.May 6 2022, 11:30 AM

I'm (still) very excited about this. I'm really happy that this lives behind a flag which addresses most of my original concerns. Making this the default in dsymutil would require a bunch of qualification but this approach allows me to do that gradually. One thing I suggested earlier I think was to run the lldb test suite with the new flag and see if that breaks anything. Did you ever get around to that?

yes, I did. This is the set of tests passed by this patch:

It passed llvm/test/tools/dsymutil lit tests.
It passed check-llvm, check-clang, check-lldb.
It passed llvm-dwarfdump --verify for all llvm&clang binaries.
It passed llvm-dwarfdump --types-compat-dump for all llvm&clang binaries.

Jim expressed some concerns about trying to squeeze everything in the same compilation unit. Greg, do you share that concern? Do we expect this will negatively impact lldb's performance? It would probably be worthwhile to do a few performance experiments here as well to make sure this doesn't regress debugging performance.

I will check the performance results and post them here. Though I think the performance degradation would not be significant.

Anyway, I think it might be a good idea to split a single type unit into several units. The question is - by which criteria? @clayborg suggested dividing it by declaration file basis. So that a separate unit would be created for every declaration file. I think that solution creates significant fragmentation. That fragmentation would lead to increased resulting DWARF size and complications of implementation.

I would prefer to integrate the current version of the patch and add functionality for separate type units later. Doing things in that way has the advantage of iterative development. If the patch would be integrated earlier then it would be tested/probed by a bigger amount of people. Also, it might be faster improved by several people. Another thing is that we will have more time to create a good concept for separated type units.

If we're happy with the approach I think we should land this an iterate on smaller changes that will be much easier to review. One reason I've been so quiet in this review is that it takes a long time to page in everything.

If we`re happy with the current approach I am ready to start the creation of separated patches.
I suggest integrating this version of the patch and do not wait for "several type units" implementation.
"several type units" implementation might be integrated as an addition to this patch.

avl added a comment.May 9 2022, 10:12 AM

It would probably be worthwhile to do a few performance experiments here as well to make sure this doesn't regress debugging performance.

please, consider run-time/memory results for lldb used with clang built with various dsymutil options:

                        |    time, sec  |  Memory, Gbytes | .debug_info, bytes |
------------------------|---------------|-----------------|--------------------|
                        |               |                 |                    |
clang-original-noodr    |     8.58      |     1.82        |    1461723785      |
                        |               |                 |                    |
------------------------|---------------|--------------------------------------|
                        |               |                 |                    |
clang-original-odr      |     8.41      |     1.76        |     484472706      |
                        |               |                 |                    |
------------------------|---------------|--------------------------------------|
                        |               |                 |                    |
clang-dlnext-odr        |     8.75      |     1.98        |     155118011      |   <<<< single type unit size is 40495379 bytes.
                        |               |                 |                    |
------------------------|---------------|--------------------------------------|

tested lldb version:

lldb --version
lldb version 15.0.0git (https://github.com/llvm/llvm-project.git revision d9cea8d3a8fff86672174780312674871729578c)
  clang revision d9cea8d3a8fff86672174780312674871729578c
  llvm revision d9cea8d3a8fff86672174780312674871729578c

lldb run command:

/usr/bin/time -l lldb clang --source lldb-commands

set of commands passed to lldb through --source option:

b main
r
n
n
n
fr v
exit

If above set of commands is not very relevant for testing lldb performance then I will retest with another set.

My main concern with creating one large unit for all types is the performance of the debugger when consuming this, but we don't need to fix this in the first iteration and can work on improving this later after we prove it is a problem. As long as the DWARF is in good shape and we are properly uniquing types in a much better way than before I am good.

Can we name the option something other than "--use-dlnext"? This option name won't mean much to users. Not sure what a good option name would be though but something that doesn't talk about internal classes that will be used would be better IMHO.

JDevlieghere accepted this revision.May 11 2022, 3:41 AM

Can we name the option something other than "--use-dlnext"? This option name won't mean much to users. Not sure what a good option name would be though but something that doesn't talk about internal classes that will be used would be better IMHO.

Agreed. How about a flag to specify the dwarf linker: --dwarf-linker={dsymutil,dwarfutil}. The respective tools would have the respective linkers as their default and we could add more options later (e.g. dwarfutil with separete type units).

With the flag renames this LGTM.

This revision is now accepted and ready to land.May 11 2022, 3:41 AM
avl added a comment.May 11 2022, 4:42 AM

Can we name the option something other than "--use-dlnext"? This option name won't mean much to users. Not sure what a good option name would be though but something that doesn't talk about internal classes that will be used would be better IMHO.

Agreed. How about a flag to specify the dwarf linker: --dwarf-linker={dsymutil,dwarfutil}. The respective tools would have the respective linkers as their default and we could add more options later (e.g. dwarfutil with separete type units).

With the flag renames this LGTM.

probably, we would like to use both dwarflinkers with dsymutil and dwarfutil. Thus, we do not want to name tools in the option. How about these variants:

--dwarf-linker={classic,mt}  (mt is multy-thread)

--dwarf-linker={classic,with-joined-types}

--dwarf-linker={classic,experimental}

?

Also, to check my understanding, We are not going to integrate this patch as a single huge change. Instead, I will separate it to the series of reviews and we are going to integrated them one by one, correct?

probably, we would like to use both dwarflinkers with dsymutil and dwarfutil. Thus, we do not want to name tools in the option. How about these variants:

--dwarf-linker={classic,mt}  (mt is multy-thread)

--dwarf-linker={classic,with-joined-types}

--dwarf-linker={classic,experimental}

How about:

--linker=apple
--linker=llvm

The "apple" would be the classic Apple linker with all of its compatability
The new "llvm" would be the new linker with the better type uniquing + other features that will get added. We would need these described in the option help text.

Any multi-threaded options should be separate options if they are not already covered by existing options.

Do we have the ability to change how types are emitted? Like, can we use the "classic" linker with new type uniquing options ("with-joined-types" or "experimental")? I am trying to figure out the three options you specified above

avl added a comment.May 11 2022, 10:32 PM

How about:

--linker=apple
--linker=llvm

The "apple" would be the classic Apple linker with all of its compatability
The new "llvm" would be the new linker with the better type uniquing + other features that will get added. We would need these described in the option help text.

These are fine to me.

Any multi-threaded options should be separate options if they are not already covered by existing options.

Do we have the ability to change how types are emitted? Like, can we use the "classic" linker with new type uniquing options ("with-joined-types" or "experimental")? I am trying to figure out the three options you specified above

No, we do not have the ability to change how types are emitted. "classic"/"apple" linker generates it one way. "experimental"/"llvm" generates it in other way.

These variants: "mt", "with-joined-types", "experimental" are different suggestions for the name of new linker. There should be used only one of them.

"mt" - supposed to note that new linker utilizes multi-thread execution heavier.
"with-joined-types" - supposed to note that new linker uses other approach to type deduplication.
"experimental" - supposed to note that new linker is an experiment yet.

dreampiggy added a comment.EditedJul 27 2022, 4:24 AM

Hi.

I try to pick the code for testing the performance. However, I found the new dwarflinker-next, will crash at some specify code path

See crash stacktrace here:

I try to debug and found the reason:

during analysis stage, NamesGenerator.assignName(CU, DieIdx) can not assign a valid name for some unhandled DIEEntry tag, like

+ DW_TAG_thrown_type

This will cause the final CU.TypeInfo.TypeName be null, and crash at the TypePool.insert during ODR stage

Any solution for this ? Should we just mark it as bad type with the raw hex with the default case (current not)? Or add some more handle cases ?

avl added a comment.EditedJul 27 2022, 6:05 AM

Hi.

I try to pick the code for testing the performance. However, I found the new dwarflinker-next, will crash at some specify code path

Hi @dreampiggy,

The patch is a bit outdated. The upstream code has been changed since I built it last time. I will rebase it.

See crash stacktrace here:

I try to debug and found the reason:

during analysis stage, NamesGenerator.assignName(CU, DieIdx) can not assign a valid name for some unhandled DIEEntry tag, like

+ DW_TAG_thrown_type

This will cause the final CU.TypeInfo.TypeName be null, and crash at the TypePool.insert during ODR stage

Any solution for this ? Should we just mark it as bad type with the raw hex with the default case (current not)? Or add some more handle cases ?

We need to handle more cases. DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker). It should be handled at least inside SyntheticTypeNameBuilder::analyzeDieForTypeName().
I will rebase and add handling of DW_TAG_thrown_type in a week.
Thanks for the analysis.

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

avl added a comment.Jul 28 2022, 10:42 AM

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

It should be as easy as classifying DW_TAG_thrown_type as a type die right?

In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

So it sounds like there's a list of attributes that may be type attributes - is it possible to detect type references by the type of the referenced DIE instead of the attribute used to reference them? (or both?) or would DW_AT_sibling break that by causing a sibling attribute to be treated differently just because the sibling was a type?

avl added a comment.Jul 29 2022, 2:46 AM
In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

So it sounds like there's a list of attributes that may be type attributes - is it possible to detect type references by the type of the referenced DIE instead of the attribute used to reference them? (or both?) or would DW_AT_sibling break that by causing a sibling attribute to be treated differently just because the sibling was a type?

This exact case is not about how to detect whether an attribute references type DIE or not.
(Though attributes referencing type die currently implemented as the predefined list: DW_AT_type, DW_AT_containing_type, DW_AT_specification, DW_AT_abstract_origin, DW_AT_import.)
This case is about how to decide which DIE referencing type DIE should be deduplicated.

f.e. The patch deduplicates this kind of DIE:

DW_TAG_pointer_type 
   DW_AT_type -> ref to type DIE.

To be able to deduplicate above DW_TAG_pointer_type, the patch creates a type descriptor
for the DIE DW_TAG_pointer_type. Any reference to the above pointer will be resolved to the
reference to this type descriptor.

There are some DIEs referencing type DIEs that should not be deduplicated.
f.e. DW_TAG_call_site or DW_TAG_call_site_parameter. Since they are used in
subroutines that are not moved into the "artificial type unit".

Currently there is a list of such DIEs inside SyntheticTypeNameBuilder::analyzeDieForTypeName().
DIEs handled in analyzeDieForTypeName() will be assigned a name and maybe deduplicated later.
DW_TAG_thrown_type was missed there. Probably, instead of just a list of DIEs that might be deduplicated,
it is possible to create a more general rule. f.e. if some DIE is a child of another DIE that is already known
to be deduplicated then such DIE should be considered as a deduplication candidate also. This rule should
cover the DW_TAG_thrown_type case.

Speaking of determining attributes referencing type DIEs: solution detecting type references by the type of the referenced DIE might be too expensive from the performance point of view. Resolving DIE references is quite expensive operation.

avl added a comment.Jul 29 2022, 2:50 AM
In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

It should be as easy as classifying DW_TAG_thrown_type as a type die right?

Yes, we can classify DW_TAG_thrown_type as a deduplication candidate and this concrete case should be cured.

Additionally we can extend the rule for deduplication candidates in such way: All children of the known deduplication candidate should also be considered as a deduplication candidates - that allows to not create a fixed list.

In D96035#3686890, @avl wrote:
In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

So it sounds like there's a list of attributes that may be type attributes - is it possible to detect type references by the type of the referenced DIE instead of the attribute used to reference them? (or both?) or would DW_AT_sibling break that by causing a sibling attribute to be treated differently just because the sibling was a type?

This exact case is not about how to detect whether an attribute references type DIE or not.
(Though attributes referencing type die currently implemented as the predefined list: DW_AT_type, DW_AT_containing_type, DW_AT_specification, DW_AT_abstract_origin, DW_AT_import.)
This case is about how to decide which DIE referencing type DIE should be deduplicated.

f.e. The patch deduplicates this kind of DIE:

DW_TAG_pointer_type 
   DW_AT_type -> ref to type DIE.

To be able to deduplicate above DW_TAG_pointer_type, the patch creates a type descriptor
for the DIE DW_TAG_pointer_type. Any reference to the above pointer will be resolved to the
reference to this type descriptor.

There are some DIEs referencing type DIEs that should not be deduplicated.
f.e. DW_TAG_call_site or DW_TAG_call_site_parameter. Since they are used in
subroutines that are not moved into the "artificial type unit".

Currently there is a list of such DIEs inside SyntheticTypeNameBuilder::analyzeDieForTypeName().
DIEs handled in analyzeDieForTypeName() will be assigned a name and maybe deduplicated later.
DW_TAG_thrown_type was missed there. Probably, instead of just a list of DIEs that might be deduplicated,
it is possible to create a more general rule. f.e. if some DIE is a child of another DIE that is already known
to be deduplicated then such DIE should be considered as a deduplication candidate also. This rule should
cover the DW_TAG_thrown_type case.

Speaking of determining attributes referencing type DIEs: solution detecting type references by the type of the referenced DIE might be too expensive from the performance point of view. Resolving DIE references is quite expensive operation.

Ah, OK - I'm mostly following. Thanks for helping me understand.

avl updated this revision to Diff 450799.Aug 8 2022, 7:23 AM

rebased, changed use-dlnext into --linker=[Apple,LLVM], added handling DW_TAG_thrown_type, added handling default case.

Matt added a subscriber: Matt.Jun 5 2023, 12:23 PM