Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[dsymutil][DWARFlinker] implement separate multi-thread processing for compile units.
AcceptedPublic

Authored by avl on Feb 4 2021, 7:11 AM.

Details

Summary

The idea of that refactoring is to speed up DWARFLinker by utilizing parallel execution and minimizing memory usages(in case the object file contains multiple compilation units). This patch creates a new library DWARFLinkerNext, which is called from standard dsymutil using --use-dlnext command-line option. This is a second version of the patch which additionally does types merging and produces a predictable result(generated .dSYM file is the same between similar runs). The best configuration is able to produce .debug_info table of 68% less in size and works two times faster than the current upstream dsymutil.

The overall linking process looks like this:

parrallel_for_each(ObjectFile) {
  for_each (Compile Unit) {
    1. load Clang modules.
  }
    
  parrallel_for_each(Compile Unit) {
    1. load input DWARF for Compile Unit.
    2. report warnings for Clang modules.
    3. analyze live DIEs.
    4. analyze types if ODR deduplication is requested.
    5. clone DIEs(Generate output DIEs and resulting DWARF tables).
       The result is in OutDebugInfoBytes, which is ELF file containg
       DWARF tables corresponding for current compile unit.
    6. cleanup Input and Output DIEs.
  }  
  
  deallocate loaded Object file.
}

if (ODR deduplication is requested)
  generate artificial compilation unit "Type Table"(It uses partially generated dies at clone stage).

for_each (ObjectFile) {
  for_each (Compile Unit) {
    1. set offsets to Compile Units DWARF tables.
    2. sort offsets/attributes/patches to have a predictable result.
    3. patch size/offsets fields.
    4. generate index tables.
    5. move DWARF tables of compile units into the resulting file.
  }
}

i.e. every compile unit is processed separately, visited only once
(except case inter-CU references exist), and used data are freed
after the compile unit is processed. The resulting file is glued
from generated debug tables, corresponding to separate compile units.

Handling inter-CU references: inter-CU references are hard to process
using only one pass. f.e. if CU1 references CU100 and CU100 references
CU1, we could not finish handling of CU1 until we finished CU100.
Thus we either need to load all CUs into the memory, either load CUs several times.
This patch discovers dependencies during the first pass. So that depending compile
units stay loaded into the memory. And then, during the second pass it handles
inter-connected compile units.

That approach works well for cases when the number of inter-CU references is low. It allows not to load all CUs into the memory at once.

Changes from the current implementation(making DWARFLinkerNext to be binary incompatible with current DWARFLinker):

a) No common abbreviation table. Each compile unit has
   its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space,
   so it looks cheap to have separate abbreviations tables.
   Later, it might be optimized a bit(by removing equal abbreviations tables).
   
b) .debug_frame. Already generated CIE records are not reused between object files

c) ODR type deduplication works using another approach than current dsymutil. All types are moved into the artificial compilation unit. References to the types are changed so that they point to types from artificial compilation unit. Declarations and partial declarations of the same type are merged into single definition/declaration located in the artificial compilation unit.

Some DWARF features are not supported(it might be supported though): DWARF64, split dwarf, DWARF 5.

Performance results for this patch for the clang binary(Darwin 24-core 64G):

  1. clang binary, ODR deduplication is ON:
|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  | DWARF(*)|exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   161   |   16.5   |   486   |   204   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|   2   |   101   |   17.9   |   486   |   118   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|   4   |   101   |   17.9   |   486   |    75   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|   8   |   101   |   17.9   |   486   |    51   |   12.0   |   157   |
|-------|------------------------------|------------------------------|
|  16   |   101   |   17.9   |   486   |    42   |   12.2   |   157   |
|---------------------------------------------------------------------|

(*) DWARF is the size of .debug_info section.

  1. clang binary, ODR deduplication is OFF(--no-odr):
|----------------------------------------------------------------------
|       |           dsymutil           |     dsymutil --use-dlnext    |
|-------|------------------------------|------------------------------|
|       |exec time|  memory  |  DWARF  |exec time|  memory  |  DWARF  |
|       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
|-------|------------------------------|------------------------------|
|threads|         |          |         |         |          |         |
|-------|------------------------------|------------------------------|
|   1   |   227   |   16.3   |   1460  |   227   |   16.2   |   1470  |
|-------|------------------------------|------------------------------|
|   2   |   218   |   17.9   |   1460  |   132   |   16.2   |   1470  |
|-------|------------------------------|------------------------------|
|   4   |   218   |   17.9   |   1460  |    83   |   16.5   |   1470  |
|-------|------------------------------|------------------------------|
|   8   |   218   |   17.9   |   1460  |    58   |   16.5   |   1470  |
|-------|------------------------------|------------------------------|
|  16   |   218   |   17.9   |   1460  |    47   |   16.9   |   1470  |
|---------------------------------------------------------------------|

Effective CPU utilization ratio for that implementation is 52%.

There still exist room for improvements. The run-time performance might be speed-up to 10-20%. The run-time memory usage might be decreased up to 10%.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jhenderson added inline comments.Jan 5 2022, 6:18 AM
llvm/tools/llvm-dwarfdump/llvm-dwarfdump.cpp
252 ↗(On Diff #396957)

The reason why the content of anonymous namespace(as well as types defined inside subprograms) is moved into the type table - is handling parameterized types.
If we have a type defined inside an anonymous namespace and this type is not moved into the type table and this type is used as a parameter of another class then we could not move that another class into the type table also. That another class might be used in other places and have other dependencies. Finally, the type from the anonymous namespace might prevent moving a lot of classes into the type table(current design decision is that type table does not have references into other compile units). Please check the following example:

You can use cross-unit references (DW_FORM_sec_offset) to have one type/DIE in one unit refer to a type/DIE in another unit. So they could still be moved to/handled in a common way, but split up into multiple units to avoid pathological cases in DWARF consumers that might tend to load whole units at a time. Would that be feasible? (not sure if such consumers also then have common pathologies where they load all units that a unit references via DW_FORM_sec_offset... if they do, then the solution becomes more difficult and would involve either putting every type in a separate unit or finding groups of related types to put together, rather than arbitrary grouping - @clayborg do you know if lldb has that sort of behavior? (where it'd try to load all referenced units from a unit it's loading))

avl added a comment.Jan 5 2022, 1:36 PM

You can use cross-unit references (DW_FORM_sec_offset) to have one type/DIE in one unit refer to a type/DIE in another unit. So they could still be moved to/handled in a common way, but split up into multiple units to avoid pathological cases in DWARF consumers that might tend to load whole units at a time. Would that be feasible? (not sure if such consumers also then have common pathologies where they load all units that a unit references via DW_FORM_sec_offset... if they do, then the solution becomes more difficult and would involve either putting every type in a separate unit or finding groups of related types to put together, rather than arbitrary grouping - @clayborg do you know if lldb has that sort of behavior? (where it'd try to load all referenced units from a unit it's loading))

If I understood correctly - you are talking here more about splitting the current global type table between several artificial type tables.
Which is an additional problem, but not what I was talking about. I tried to describe situation when a class ("A") contains information for global
namespace and for anonymous namespace:

DW_TAG_structure_type                                               | global&anonymous
  DW_AT_calling_convention        (DW_CC_pass_by_value)             | global&anonymous
  DW_AT_name      ("A")                                             | global&anonymous
  DW_AT_byte_size (0x01)                                            | global&anonymous
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               | global&anonymous
  DW_AT_decl_line (1)                                               | global&anonymous

  DW_TAG_subprogram                                                 | anonymous
    DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     | anonymous
    DW_AT_name    ("foo<(anonymous namespace)::B>")                 | anonymous
    DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       | anonymous
    DW_AT_decl_line       (3)                                       | anonymous
    DW_AT_type    (0x0000016d "(anonymous namespace)::B")           | anonymous
    DW_AT_declaration     (true)                                    | anonymous

    DW_TAG_template_type_parameter                                  | anonymous
      DW_AT_type  (0x0000016d "(anonymous namespace)::B")           | anonymous
      DW_AT_name  ("T")                                             | anonymous

    DW_TAG_formal_parameter                                         | anonymous
      DW_AT_type  (0x000001b3 "A *")                                | anonymous
      DW_AT_artificial    (true)                                    | anonymous

    NULL                                                            | anonymous
    
  DW_TAG_subprogram                                                 | global
    DW_AT_linkage_name    ("another member of A")                   | global
    DW_AT_name    ("another member of A")                           | global
    
  NULL                                                              | global&anonymous
Thus it should be split into two parts. One should go to the global type table:
DW_TAG_structure_type                                               
  DW_AT_calling_convention        (DW_CC_pass_by_value)             
  DW_AT_name      ("A")                                             
  DW_AT_byte_size (0x01)                                           
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")              
  DW_AT_decl_line (1)                                              

  DW_TAG_subprogram                                                
    DW_AT_linkage_name    ("another member of A")                  
    DW_AT_name    ("another member of A")                          
    
  NULL
Another part should be left into the owning compilation unit.
DW_TAG_structure_type                                             
DW_AT_calling_convention        (DW_CC_pass_by_value)             
DW_AT_name      ("A")                                             
DW_AT_byte_size (0x01)                                            
DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               
DW_AT_decl_line (1)                                               

DW_TAG_subprogram                                                 
  DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     
  DW_AT_name    ("foo<(anonymous namespace)::B>")                 
  DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       
  DW_AT_decl_line       (3)                                       
  DW_AT_type    (0x0000016d "(anonymous namespace)::B")           
  DW_AT_declaration     (true)                                    

  DW_TAG_template_type_parameter                                  
    DW_AT_type  (0x0000016d "(anonymous namespace)::B")           
    DW_AT_name  ("T")                                             

  DW_TAG_formal_parameter                                         
    DW_AT_type  (0x000001b3 "A *")                                
    DW_AT_artificial    (true)                                    

  NULL                                                            
  
NULL

That kind of splitting is not implemented by current patch.
Also that solution would require more space since it duplicates structure description.

Or are you suggesting another solution for that case?

avl added a comment.Jan 5 2022, 3:03 PM

One issue I do see with this approach as well is the first "type_table" is about 69MB for a dSYM file for LLDB.framework in LLDB. Any types that are accessed from debug info in LLDB will end up causing this entire DW_TAG_compile_unit to be parsed which can be expensive as LLDB parses DWARF partially and only parses what it needs to when loading DWARF. Is there any way to split up the "type_table" into multiple DW_TAG_compile_unit entries whose DW_AT_name matches the DW_AT_decl_file of the root type that is being uniqued? That would allow smaller compile units to be parsed and keep memory footprint down in debuggers.

I think that it could be done in general. Though I am not sure about the concrete scheme. Following are reasons why I implemented it as a single type unit:

  1. Fragmentation. clang binary has ~1600 compilation units. If every unit would have a separate type unit, then unit header, unit die, namespace dies, base types, line table headers would be duplicated. Types distributed between units should also be duplicated(if they have a template parameter from another unit, similar to the above example for an anonymous type).
  1. Cross CU references. The current type table does not have outside references. That allows processing the table by one pass. If we allow outside references then we need to accumulate such inter-connected units in memory first and then process them. I do not know how that is currently implemented in lldb - Probably there is some efficient scheme to handle inter-connected units.

Probably, we might create some scheme with several type units. F.e. similar to what David described: "a basic heuristic would be X types per unit, but a more advanced one would depend on the size of the types (& even more advanced would group related types together to benefit from more local references - but I doubt that's worth it)."

In D96035#3223521, @avl wrote:

You can use cross-unit references (DW_FORM_sec_offset) to have one type/DIE in one unit refer to a type/DIE in another unit. So they could still be moved to/handled in a common way, but split up into multiple units to avoid pathological cases in DWARF consumers that might tend to load whole units at a time. Would that be feasible? (not sure if such consumers also then have common pathologies where they load all units that a unit references via DW_FORM_sec_offset... if they do, then the solution becomes more difficult and would involve either putting every type in a separate unit or finding groups of related types to put together, rather than arbitrary grouping - @clayborg do you know if lldb has that sort of behavior? (where it'd try to load all referenced units from a unit it's loading))

If I understood correctly - you are talking here more about splitting the current global type table between several artificial type tables.
Which is an additional problem, but not what I was talking about. I tried to describe situation when a class ("A") contains information for global
namespace and for anonymous namespace:

DW_TAG_structure_type                                               | global&anonymous
  DW_AT_calling_convention        (DW_CC_pass_by_value)             | global&anonymous
  DW_AT_name      ("A")                                             | global&anonymous
  DW_AT_byte_size (0x01)                                            | global&anonymous
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               | global&anonymous
  DW_AT_decl_line (1)                                               | global&anonymous

  DW_TAG_subprogram                                                 | anonymous
    DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     | anonymous
    DW_AT_name    ("foo<(anonymous namespace)::B>")                 | anonymous
    DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       | anonymous
    DW_AT_decl_line       (3)                                       | anonymous
    DW_AT_type    (0x0000016d "(anonymous namespace)::B")           | anonymous
    DW_AT_declaration     (true)                                    | anonymous

    DW_TAG_template_type_parameter                                  | anonymous
      DW_AT_type  (0x0000016d "(anonymous namespace)::B")           | anonymous
      DW_AT_name  ("T")                                             | anonymous

    DW_TAG_formal_parameter                                         | anonymous
      DW_AT_type  (0x000001b3 "A *")                                | anonymous
      DW_AT_artificial    (true)                                    | anonymous

    NULL                                                            | anonymous
    
  DW_TAG_subprogram                                                 | global
    DW_AT_linkage_name    ("another member of A")                   | global
    DW_AT_name    ("another member of A")                           | global
    
  NULL                                                              | global&anonymous
Thus it should be split into two parts. One should go to the global type table:
DW_TAG_structure_type                                               
  DW_AT_calling_convention        (DW_CC_pass_by_value)             
  DW_AT_name      ("A")                                             
  DW_AT_byte_size (0x01)                                           
  DW_AT_decl_file ("/home/avl/test_templates1/./a.h")              
  DW_AT_decl_line (1)                                              

  DW_TAG_subprogram                                                
    DW_AT_linkage_name    ("another member of A")                  
    DW_AT_name    ("another member of A")                          
    
  NULL
Another part should be left into the owning compilation unit.
DW_TAG_structure_type                                             
DW_AT_calling_convention        (DW_CC_pass_by_value)             
DW_AT_name      ("A")                                             
DW_AT_byte_size (0x01)                                            
DW_AT_decl_file ("/home/avl/test_templates1/./a.h")               
DW_AT_decl_line (1)                                               

DW_TAG_subprogram                                                 
  DW_AT_linkage_name    ("_ZN1A3fooIN12_GLOBAL__N_11BEEET_v")     
  DW_AT_name    ("foo<(anonymous namespace)::B>")                 
  DW_AT_decl_file       ("/home/avl/test_templates1/./a.h")       
  DW_AT_decl_line       (3)                                       
  DW_AT_type    (0x0000016d "(anonymous namespace)::B")           
  DW_AT_declaration     (true)                                    

  DW_TAG_template_type_parameter                                  
    DW_AT_type  (0x0000016d "(anonymous namespace)::B")           
    DW_AT_name  ("T")                                             

  DW_TAG_formal_parameter                                         
    DW_AT_type  (0x000001b3 "A *")                                
    DW_AT_artificial    (true)                                    

  NULL                                                            
  
NULL

That kind of splitting is not implemented by current patch.
Also that solution would require more space since it duplicates structure description.

Or are you suggesting another solution for that case?

Oh, sorry, I wasn't discussing that case/didn't understand that's what was being discussed. I think for that case it's fine to move the translation-unit-local/private linkage/anonymous-ish part of the type into the global type table along with the rest of the type.

I think @clayborg's point was mostly for a purely translation-unit-local entity it could stay in the translation unit that defined it - reducing the need for cross-CU referencing to refer to the type and for work needed to move the type around, etc.

In D96035#3223734, @avl wrote:

One issue I do see with this approach as well is the first "type_table" is about 69MB for a dSYM file for LLDB.framework in LLDB. Any types that are accessed from debug info in LLDB will end up causing this entire DW_TAG_compile_unit to be parsed which can be expensive as LLDB parses DWARF partially and only parses what it needs to when loading DWARF. Is there any way to split up the "type_table" into multiple DW_TAG_compile_unit entries whose DW_AT_name matches the DW_AT_decl_file of the root type that is being uniqued? That would allow smaller compile units to be parsed and keep memory footprint down in debuggers.

I think that it could be done in general. Though I am not sure about the concrete scheme. Following are reasons why I implemented it as a single type unit:

  1. Fragmentation. clang binary has ~1600 compilation units. If every unit would have a separate type unit, then unit header, unit die, namespace dies, base types, line table headers would be duplicated. Types distributed between units should also be duplicated(if they have a template parameter from another unit, similar to the above example for an anonymous type).

Yeah, I think one type per unit would be more expensive than would be desirable - like type units which have a lot of overhead for this sort of reason.

I don't understand the second part "Types distributed between units should also be duplicated" - types can refer to types in other units (using DW_FORM_sec_offset the same as the way that types will be referenced from the non-type contexts where the type needs to be referred to (eg: a compilation unit with a subprogram that needs to refer to the return type in the global type table/type unit/whatever)), I wouldn't advocate for any duplication.

  1. Cross CU references. The current type table does not have outside references. That allows processing the table by one pass. If we allow outside references then we need to accumulate such inter-connected units in memory first and then process them. I do not know how that is currently implemented in lldb - Probably there is some efficient scheme to handle inter-connected units.

If the types can all be generated in one pass today - they could be generated in one pass over multiple units as well. I don't follow why that'd require multiple passes. (laying out one unit with multiple types requires laying out all the DIEs to know the values of unit-relative offsets - putting those DIEs in several units doesn't make that situation worse in terms of the amount of data that needs to be kept around - it'd mean laying out multiple units to determine where the offset values are across those units)

Probably, we might create some scheme with several type units. F.e. similar to what David described: "a basic heuristic would be X types per unit, but a more advanced one would depend on the size of the types (& even more advanced would group related types together to benefit from more local references - but I doubt that's worth it)."

Yep - X types would be a good first approximation. Could probably do the "X bytes or DIEs per grouping" might not be too expensive either.

avl added a comment.Jan 6 2022, 4:51 AM

If the types can all be generated in one pass today - they could be generated in one pass over multiple units as well. I don't follow why that'd require multiple passes. (laying out one unit with multiple types requires laying out all the DIEs to know the values of unit-relative offsets - putting those DIEs in several units doesn't make that situation worse in terms of the amount of data that needs to be kept around - it'd mean laying out multiple units to determine where the offset values are across those units)

Oh sorry, I wrote not clearly. My statement is not about generation. This is about reading generated dwarf. If the resulting DWARF would be closely coupled by inter-CU references then it would be hard to read&process it. It would also be hard to implement parallel separate compile units processing(because units depend on each other). To effectively process DWARF, there would be necessary to load nearly all compile&type units into the memory. An extreme scenario of such usage is applying DWARFLinker to the already linked DWARF.

With current implementation only the "type_table" unit needs to be loaded forever, other units might be loaded/unloaded independently(because "type table" cannot reference other compile&type units).
If we would generate DWARF which has type units that reference other type units and usual compile units - then, in the worst case, we would need to load nearly all of them to be able to process(and since dependency is not known upfront we need a pass to discover dependency and then a pass which loads units and processes).

Probably, lldb does not have processing which requires loading all dependent units at once. And thus, it is not so important for lldb.

But from the point of view general DWARF reader it is better to have less coupled DWARF. That is why the current design&implementation of the patch assumes that the "type table" does not reference other units.

At first glance, we might divide types into not depending groups. So that it still be possible to not have references between type units
and be able to load smaller groups at once.

avl added a comment.Jan 6 2022, 2:10 PM

I don't understand the second part "Types distributed between units should also be duplicated" - types can refer to types in other units (using DW_FORM_sec_offset the same as the way that types will be referenced from the non-type contexts where the type needs to be referred to (eg: a compilation unit with a subprogram that needs to refer to the return type in the global type table/type unit/whatever)), I wouldn't advocate for any duplication.

The case with parametrized function member does not have a good solution if separated between compilation units.
It either requires to duplicate part of the description(that is a solution which I was talking about trying to avoid inter-CU references),
either introduces inter-CU reference(that is solution which you are talking about(DW_FORM_sec_offset/DW_FORM_ref_addr)).
I think it would be good if we will use a solution that does not require duplicating data AND does not
require using inter-CU reference(other that non-typeCU->typeCU):

cat main.cpp

char foo1 (void);
long foo2 (void);

int main ( void ) {
  return foo1() + foo2();
}

cat a.h

struct A {
  template <class T>
  T foo () {
    return T();
  }

  int createi () {
    return 0;
   }

  float createf ( ) {
    return 0.0;
  }
};

cat foo1.cpp

#include "a.h"

char foo1 (void) {
  return A().foo<char>(); 
}

cat foo2.cpp

#include "a.h"

namespace name1 {

struct B {
  long m1;
};

B internal (void) {
  return A().foo<B>();
}

}

long foo2 (void ) {
  return name1::internal().m1;
}

int foo3 () {

 A var;

 return var.createi() + var.createf();
}

If struct "A" and struct B are placed into the different compilation units, then we have two alternatives on how to represent struct "A".
First alternative assumes duplication of struct "A" definition while second alternative assumes inter-CU reference:

  1. The first compilation unit("foo1.cpp") does not have a definition for "foo<name1::B>". The definition for "foo<name1::B>" is located into the second compilation unit("foo2.cpp"). In this case we have a duplicated definition of struct "A".
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type   <<<<<
    DW_AT_name "A"        <<<<<

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_structure_type    <<<<<<
    DW_AT_name "A"         <<<<<<
    
    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"

  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type
    DW_AT_name "B"

  NULL
  1. All definitions are inside the first unit("foo1.cpp"), but we have a cross CU reference from the first unit to the second.
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type
    DW_AT_name "A"

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"   <<<<<<

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type   <<<<<<
    DW_AT_name "B"

  NULL

The solution from this patch solves both problems by putting all data in a single compilation unit.
Thus we do not need to duplicate data and create only non-typeCU->typeCU inter-CU references.
If we want to split "type table" into a set of smaller units to minimize peak memory usage then we probably could
group types so that all dependent types would be located in a single unit.
That way we probably could avoid duplication and inter-CU references.
Avoiding inter-CU references might help to create more effective implementation of an DWARF reader.

avl added a comment.Jan 10 2022, 8:41 AM

@dblaikie: I think @clayborg's point was mostly for a purely translation-unit-local entity it could stay in the translation unit that defined it - reducing the need for cross-CU referencing to refer to the type and for work needed to move the type around, etc.

right, the purely translation-unit-local entity might be left into the original compile unit(not moved into the type table).
The current implementation(for simplicity) could not do this. This might be done as an improvement.

@clayborg: These imported declarations amounted to 3% of overall CU size for a single compile unit that I tested. I know that the stock dsymutil whitelists all of these which causes these and all the types the reference to be kept, so if we can get rid of any DW_TAG_imported_declaration DIEs that are not referenced, we might also be able to get rid of the type that they reference as well as all types from a C++ header file that have corresponding C header files like "cstdlib" -> "stdlib.h", they will have this issue and waste space.

@dblaikie: So imported modules are never referenced from other DWARF - if you want any of them to persist you have to treat them all as roots, you can't GC them.
@dblaikie: And for imported declarations, at least those generated from Clang you have to do the same until that bug is fixed in Clang to actually reference the using declarations.

@dblaikie: In any case I assume this issue is mostly orthogonal to this proposal (ie: I expect that's an existing issue in dsymutil, not a new regression introduced by this patch), yeah?

This patch should leave imported declaration inplace. i.e. they are placed similarly to how current dsymutil does this.

In D96035#3226125, @avl wrote:

I don't understand the second part "Types distributed between units should also be duplicated" - types can refer to types in other units (using DW_FORM_sec_offset the same as the way that types will be referenced from the non-type contexts where the type needs to be referred to (eg: a compilation unit with a subprogram that needs to refer to the return type in the global type table/type unit/whatever)), I wouldn't advocate for any duplication.

The case with parametrized function member does not have a good solution if separated between compilation units.
It either requires to duplicate part of the description(that is a solution which I was talking about trying to avoid inter-CU references),
either introduces inter-CU reference(that is solution which you are talking about(DW_FORM_sec_offset/DW_FORM_ref_addr)).
I think it would be good if we will use a solution that does not require duplicating data AND does not
require using inter-CU reference(other that non-typeCU->typeCU):

cat main.cpp

char foo1 (void);
long foo2 (void);

int main ( void ) {
  return foo1() + foo2();
}

cat a.h

struct A {
  template <class T>
  T foo () {
    return T();
  }

  int createi () {
    return 0;
   }

  float createf ( ) {
    return 0.0;
  }
};

cat foo1.cpp

#include "a.h"

char foo1 (void) {
  return A().foo<char>(); 
}

cat foo2.cpp

#include "a.h"

namespace name1 {

struct B {
  long m1;
};

B internal (void) {
  return A().foo<B>();
}

}

long foo2 (void ) {
  return name1::internal().m1;
}

int foo3 () {

 A var;

 return var.createi() + var.createf();
}

If struct "A" and struct B are placed into the different compilation units, then we have two alternatives on how to represent struct "A".
First alternative assumes duplication of struct "A" definition while second alternative assumes inter-CU reference:

  1. The first compilation unit("foo1.cpp") does not have a definition for "foo<name1::B>". The definition for "foo<name1::B>" is located into the second compilation unit("foo2.cpp"). In this case we have a duplicated definition of struct "A".
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type   <<<<<
    DW_AT_name "A"        <<<<<

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_structure_type    <<<<<<
    DW_AT_name "A"         <<<<<<
    
    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"

  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type
    DW_AT_name "B"

  NULL
  1. All definitions are inside the first unit("foo1.cpp"), but we have a cross CU reference from the first unit to the second.
DW_TAG_compile_unit
  DW_AT_name "foo1.cpp"
  
  DW_TAG_structure_type
    DW_AT_name "A"

    DW_TAG_subprogram
      DW_AT_name "foo<char>"

    DW_TAG_subprogram
      DW_AT_name "createi"

    DW_TAG_subprogram
      DW_AT_name "createf"

    DW_TAG_subprogram
      DW_AT_name "foo<name1::B>"

        DW_TAG_template_type_parameter
          DW_AT_type "name1::B"   <<<<<<

  NULL
  
  
DW_TAG_compile_unit
  DW_AT_name "foo2.cpp"
  
  DW_TAG_namespace
    DW_AT_name "name1"

  DW_TAG_structure_type   <<<<<<
    DW_AT_name "B"

  NULL

The solution from this patch solves both problems by putting all data in a single compilation unit.
Thus we do not need to duplicate data and create only non-typeCU->typeCU inter-CU references.
If we want to split "type table" into a set of smaller units to minimize peak memory usage then we probably could
group types so that all dependent types would be located in a single unit.
That way we probably could avoid duplication and inter-CU references.
Avoiding inter-CU references might help to create more effective implementation of an DWARF reader.

This doesn't address the case where struct A is a local type (in an anonymous namespace). Moving it out of the translation/compile unit it's defined it will break the representation (because the name is only valid within that compile unit - so you might end up pulling two types with the same name into the type table unit, making it hard for a DWARF consumer to do correct name lookup/know which version of the type the user is talking about in a given context).

What we could do is a third option - not duplicating the whole definition of A, but only part of the definition.

Take a look at the DWARF Clang produces for something like this, for instance:

struct t1 {
  virtual void f1(); // home this type into another translation unit
  template<typename T>
  void f1() { }
};
namespace {
struct t2 { };
}
int main() {
  t1().f1<t2>();
}

The DWARF for this will include a declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration (& then an out of line definition that refers to that declaration) (in this case we could potentially also include a DW_AT_specification in the declaration that points to the definition in the type table, if that helps consumers significantly - so they don't have to do name lookup to figure out that they're the same thing).

Speaking of all that - how are member function definitions working in this current proposal? Do they use sec_offsets to refer to the declaration in the type table, or do they already do something like ^ with a type declaration in the same unit to house the member function declaration?

avl added a comment.Jan 11 2022, 5:04 AM

! In D96035#3232592, @dblaikie wrote:

This doesn't address the case where struct A is a local type (in an anonymous namespace). Moving it out of the translation/compile unit it's defined it will break the representation (because the name is only valid within that compile unit - so you might end up pulling two types with the same name into the type table unit, making it hard for a DWARF consumer to do correct name lookup/know which version of the type the user is talking about in a given context).

This patch put such types into the type table separately. It creates two different namespaces with different DW_AT_decl_file attribute:

DW_TAG_compile_unit
  DW_AT_name "__type_table"
  
  DW_TAG_namespace
    DW_AT_decl_file "compile_unit1.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

  DW_TAG_namespace
    DW_AT_decl_file "compile_unit2.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

NULL

What we could do is a third option - not duplicating the whole definition of A, but only part of the definition.

Take a look at the DWARF Clang produces for something like this, for instance:

struct t1 {
  virtual void f1(); // home this type into another translation unit
  template<typename T>
  void f1() { }
};
namespace {
struct t2 { };
}
int main() {
  t1().f1<t2>();
}

The DWARF for this will include a declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration (& then an out of line definition that refers to that declaration) (in this case we could potentially also include a DW_AT_specification in the declaration that points to the definition in the type table, if that helps consumers significantly - so they don't have to do name lookup to figure out that they're the same thing).

yes. I assumed the similar approach here - https://reviews.llvm.org/D96035#3222090 (part about types splitting).

If type splitting were supported then this approach allows leaving anonymous types inside the compilation unit at the cost of little duplication and without unnecessary inter-CU references. It also allows decreasing the overall size of "type table". The current patch does not support type splitting. I suggest implementing it as an additional improvement.

Types splitting is necessary because originally generated "declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration" might contain other members. So we need to move these other members into the type table and leave "declaration of the f1<t2> member" in place. Otherwise, (because other members will stay not deduplicated) the size of resulting DWARF would be bigger.

Speaking of all that - how are member function definitions working in this current proposal? Do they use sec_offsets to refer to the declaration in the type table, or do they already do something like ^ with a type declaration in the same unit to house the member function declaration?

All member function definitions are merged into the type which is in type table. Later they are referenced(using DW_FORM_ref_addr) from compile units:

DW_TAG_compile_unit
  DW_AT_name "__type_table"

  DW_TAG_structure_type
    DW_AT_name "t1"
      
    DW_TAG_subprogram
      DW_AT_name "t1"
    NULL

    DW_TAG_subprogram
      DW_AT_name "f1"
      
      DW_TAG_template_type_parameter
        DW_AT_type "__type_table::main.cpp::t2"  // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
    NULL
  NULL
  
  DW_TAG_namespace
    DW_AT_decl_file "main.cpp"
    
    DW_TAG_struct_type
      DW_AT_name "t2"
    NULL
  NULL

NULL

DW_TAG_compile_unit
  DW_AT_name "main.cpp"
  
  DW_TAG_subprogram
    DW_AT_name "main"
  NULL

  DW_TAG_subprogram
    DW_AT_name "t1"
    DW_AT_specification "__type_table::t1::t1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

  DW_TAG_subprogram
    DW_AT_name "f1"
    DW_AT_specification "__type_table::t1::f1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

NULL
In D96035#3234040, @avl wrote:

! In D96035#3232592, @dblaikie wrote:

This doesn't address the case where struct A is a local type (in an anonymous namespace). Moving it out of the translation/compile unit it's defined it will break the representation (because the name is only valid within that compile unit - so you might end up pulling two types with the same name into the type table unit, making it hard for a DWARF consumer to do correct name lookup/know which version of the type the user is talking about in a given context).

This patch put such types into the type table separately. It creates two different namespaces with different DW_AT_decl_file attribute:

DW_TAG_compile_unit
  DW_AT_name "__type_table"
  
  DW_TAG_namespace
    DW_AT_decl_file "compile_unit1.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

  DW_TAG_namespace
    DW_AT_decl_file "compile_unit2.cpp"
    
    DW_TAG_class_type
      DW_AT_name "B"
    NULL
  NULL

NULL

Yeah, I don't think that's enough to address the scoping issue - which file they're written in doesn't define their scope of name lookup, unfortunately. (eg: you could #include a file into another file - the file the code is written in doesn't tell you whether another entity defined in a different file can see that name (because the files could be included into the same translation unit)) - local types, I think, have to stay in the compile unit that defined them - I don't think there's any other solution to the language scoping.

What we could do is a third option - not duplicating the whole definition of A, but only part of the definition.

Take a look at the DWARF Clang produces for something like this, for instance:

struct t1 {
  virtual void f1(); // home this type into another translation unit
  template<typename T>
  void f1() { }
};
namespace {
struct t2 { };
}
int main() {
  t1().f1<t2>();
}

The DWARF for this will include a declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration (& then an out of line definition that refers to that declaration) (in this case we could potentially also include a DW_AT_specification in the declaration that points to the definition in the type table, if that helps consumers significantly - so they don't have to do name lookup to figure out that they're the same thing).

yes. I assumed the similar approach here - https://reviews.llvm.org/D96035#3222090 (part about types splitting).

If type splitting were supported then this approach allows leaving anonymous types inside the compilation unit at the cost of little duplication and without unnecessary inter-CU references. It also allows decreasing the overall size of "type table". The current patch does not support type splitting. I suggest implementing it as an additional improvement.

Types splitting is necessary because originally generated "declaration of t1 (so, some duplication - but not all the members, etc) with a declaration of the f1<t2> member in that t1 declaration" might contain other members. So we need to move these other members into the type table and leave "declaration of the f1<t2> member" in place. Otherwise, (because other members will stay not deduplicated) the size of resulting DWARF would be bigger.

Yep, that sounds right to me. To the best of my knowledge, the sort of things that may need to get "left behind" in the compilation unit/not moved to the type table, would be member templates like the one we've been discussing, and nested types (eg: you could have a pimpl idiom - struct t1 { struct t2; }; ... namespace { struct t3 { }; } struct t1::t2 { t3 m; }; so the whole definition of t2 would have to stay in the compilation unit with t3, I think - because it needs to reference t3. The type table could include a declaration of t2, since t1 might include pointers to t2 or other references that only require a declaration)

The nice thing about this particular way of splitting - is that there's loads of existing precedent for it. This sort of DWARF (declarations with members) is generated by clang all the time for homed debug info. (though, that said, lldb isn't /super/ fond of this sort of debug info - so it wouldn't surprise me if it'd need some fixes to handle this sort of thing in a dsym, but we might be lucky there & it already works because at least all the types are still defined within the single dsym)

Speaking of all that - how are member function definitions working in this current proposal? Do they use sec_offsets to refer to the declaration in the type table, or do they already do something like ^ with a type declaration in the same unit to house the member function declaration?

All member function definitions are merged into the type which is in type table. Later they are referenced(using DW_FORM_ref_addr) from compile units:

DW_TAG_compile_unit
  DW_AT_name "__type_table"

  DW_TAG_structure_type
    DW_AT_name "t1"
      
    DW_TAG_subprogram
      DW_AT_name "t1"
    NULL

    DW_TAG_subprogram
      DW_AT_name "f1"
      
      DW_TAG_template_type_parameter
        DW_AT_type "__type_table::main.cpp::t2"  // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
    NULL
  NULL
  
  DW_TAG_namespace
    DW_AT_decl_file "main.cpp"
    
    DW_TAG_struct_type
      DW_AT_name "t2"
    NULL
  NULL

NULL

DW_TAG_compile_unit
  DW_AT_name "main.cpp"
  
  DW_TAG_subprogram
    DW_AT_name "main"
  NULL

  DW_TAG_subprogram
    DW_AT_name "t1"
    DW_AT_specification "__type_table::t1::t1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

  DW_TAG_subprogram
    DW_AT_name "f1"
    DW_AT_specification "__type_table::t1::f1" // I used names instead offsets, in fact this is DW_FORM_ref_addr referencing into "__type_table"
  NULL

NULL

Ah, OK - thanks for the details!

(I should say, whether or not it's necessary/prerequisite to handle the anonymous types more accurately, through type splitting - I don't have an opinion on - dsymutil isn't something I use/have any investment in - I'm just discussing the DWARF representational tools and techniques that might be used to address some of these limitations. Folks who own/maintain/ship dsymutil are the ones to make any final choices about what is/isn't acceptable here)

avl updated this revision to Diff 400601.Jan 17 2022, 10:16 AM

rebased. removed usages of NonRelocatableStringpool.

ormris removed a subscriber: ormris.Jan 18 2022, 9:38 AM
avl updated this revision to Diff 404549.Jan 31 2022, 8:40 AM

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).
implemented types splitting.

avl edited the summary of this revision. (Show Details)Jan 31 2022, 8:42 AM
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs? Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

(https://randomascii.wordpress.com/2018/02/04/what-we-talk-about-when-we-talk-about-performance/ The ambiguity made me start to use "1.3x as fast" )

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs? Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

avl added a comment.EditedJan 31 2022, 11:13 AM
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

No, sorry, I meant what took 1second before now takes 1*1.3 = 1.3sec.

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs?

yes. anonymous/non-linkage types are correctly kept in their original CUs. These are tests for that functionality:

odr-nested-types1.test
odr-types-in-subprogram1.test

Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

splitting up types into chunks is not implemented yet. I think it would take some time. I would prefer to implement it as an separate improvement for this patch.

In D96035#3284921, @avl wrote:
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

No, sorry, I meant what took 1second before now takes 1*1.3 = 1.3sec.

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs?

yes. anonymous/non-linkage types are correctly kept in their original CUs. These are tests for that functionality:

odr-nested-types1.test
odr-types-in-subprogram1.test

Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

splitting up types into chunks is not implemented yet. I think it would take some time. I would prefer to implement it as an separate improvement for this patch.

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

In D96035#3284921, @avl wrote:
In D96035#3284247, @avl wrote:

rebased.
refactored.
implemented performance improvements(it is 1.3x slower in single thread mode, 2.4x faster im multi-thread mode).

1.3x slower means what took 1second before now takes 2.3 seconds?

No, sorry, I meant what took 1second before now takes 1*1.3 = 1.3sec.

implemented types splitting.

This means that anonymous/non-linkage types are correctly kept in their original CUs?

yes. anonymous/non-linkage types are correctly kept in their original CUs. These are tests for that functionality:

odr-nested-types1.test
odr-types-in-subprogram1.test

Does it mean other things too/instead? (splitting up types into chunks to avoid exceptionally large units containing all types?)

splitting up types into chunks is not implemented yet. I think it would take some time. I would prefer to implement it as an separate improvement for this patch.

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be.

I think the major reason it'd be resisted would be the cost - increasing dsymutil time significantly would be a hard sell. I think it's pretty understandable that the sort of parallel processing @avl's experimenting with here wouldn't be able to do that efficiently.

Arbitrarily grouping types into smallish buckets to reduce the overhead for DWARF consumers seems more feasible, though, I think.

avl added a comment.Feb 1 2022, 6:58 AM

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

There are some disadvantages with creating additional compilation units for each source compile unit:

  1. Fragmentation and duplication. It would be necessary to duplicate: unit header, unit die, namespace dies, base types, line table headers, line table files, abbreviation table. clang has approx 1600 compilation units. So we need to duplicate all the above information for each of them. At the end of all, we might lose some DWARF size achievements.
  1. Clously coupled references. If all types would be placed in separate compilation units matched with the original unit of declaration then types would reference each other. As the result, It would be hard to process such units in a parallel manner(independently). This limits the acceleration that can be achieved by parallelization. This patch tries to avoid cross-CU references. Only one type is allowed: non-type-CU -> type-CU.

What about the following solution: Current type table unit(let`s say 60MB) would be divided into several buckets(let`s say 16) of independent types. Each bucket is placed in separate artificial compilation unit. So that there would not be references between units, there would not be a lot of duplicated information. The size of each separate type unit would be around 4MB(it would help to lldb to not parse much). Can this be a good solution? It looks like it allows to keep benefits(small final size of overall DWARF file, simple references, small size of each compile unit). It also would probably help to speed up multi-thread execution of DWARFLinker(if all type units would be generated in parallel) but I am afraid it would slow down single-thread execution.

In D96035#3287393, @avl wrote:

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

There are some disadvantages with creating additional compilation units for each source compile unit:

  1. Fragmentation and duplication. It would be necessary to duplicate: unit header, unit die, namespace dies, base types, line table headers, line table files, abbreviation table. clang has approx 1600 compilation units. So we need to duplicate all the above information for each of them. At the end of all, we might lose some DWARF size achievements.

Oh, yeah, too many units sounds unfortunate for sure. Currently we'd have one CU per .cpp file, and with "putting types in a synthetic unit that matches their decl_file" we'd have an additional one CU per .h file - which, admittedly, is probably a bit more than doubling the number of units. (template instantiations, for instance, might all be grouped together in the same unit, since they're defined in the same header)

I don't know that this makes the DWARF 'the best it can be' - it's not clear to me what expressive power is provided by the types being in units that match source file names (indeed the information is provided in the DWARF either way - a consumer can treat types defined in a certain file in a certain way even if they're all in one big CU).

  1. Clously coupled references. If all types would be placed in separate compilation units matched with the original unit of declaration then types would reference each other. As the result, It would be hard to process such units in a parallel manner(independently). This limits the acceleration that can be achieved by parallelization. This patch tries to avoid cross-CU references. Only one type is allowed: non-type-CU -> type-CU.

Yeah.

What about the following solution: Current type table unit(let`s say 60MB) would be divided into several buckets(let`s say 16) of independent types. Each bucket is placed in separate artificial compilation unit. So that there would not be references between units, there would not be a lot of duplicated information. The size of each separate type unit would be around 4MB(it would help to lldb to not parse much). Can this be a good solution? It looks like it allows to keep benefits(small final size of overall DWARF file, simple references, small size of each compile unit). It also would probably help to speed up multi-thread execution of DWARFLinker(if all type units would be generated in parallel) but I am afraid it would slow down single-thread execution.

Finding independent buckets of types sounds difficult/algorithmically complicated. But maybe that's feasible? I'm not sure. I was thinking more "emit all the types the same way you do currently, except into multiple unit "Chunks"" - ie: the code already handles type-to-type references within the single type-CU, so I don't understand (maybe I'm missing something) why it would be difficult to treat that "type-CU" as actually being "multiple type CUs" with arbitrary cross-referencing within that collection of type CUs. Then the chunks/buckets are chosen arbitrarily - admittedly that means longer encoding (sec_offset references are longer than the unit-relative references) than if you can group the types together into isolated/only-self-referencing groups - so maybe the extra space savings is worth the work to create those isolated groups? Naively, I would not have expected it to be worth that much.

avl added a comment.Feb 2 2022, 5:50 AM

Finding independent buckets of types sounds difficult/algorithmically complicated. But maybe that's feasible? I'm not sure.

I do not have a ready implementation. So I do not know whether implementation would be acceptable. And even whether it is possible to divide whole type table onto 16 independent parts. That needs some investigation.

I was thinking more "emit all the types the same way you do currently, except into multiple unit "Chunks"" - ie: the code already handles type-to-type references within the single type-CU, so I don't understand (maybe I'm missing something) why it would be difficult to treat that "type-CU" as actually being "multiple type CUs" with arbitrary cross-referencing within that collection of type CUs. Then the chunks/buckets are chosen arbitrarily - admittedly that means longer encoding (sec_offset references are longer than the unit-relative references) than if you can group the types together into isolated/only-self-referencing groups - so maybe the extra space savings is worth the work to create those isolated groups? Naively, I would not have expected it to be worth that much.

Current implementation deals with one big type unit. This implementation loads all types in memory. The similar implementation might be done for "multiple type CUs". But if we had several independent type CU we might be able to not load all type units CU into the memory but to process them separately and throw out of memory. This would allow to reduce memory requirements.

So yes we might change current implementation to generate several type CUs referencing each other. It would have the same properties as current implementation with the exception that lldb would be able to load it by smaller chunks. If we would be able to divide whole type unit onto several independent parts then we might use more efficient implementation(which uses less memory f.e.)

avl added a comment.Feb 2 2022, 5:58 AM

I`ve got following message in the mailbox from @clayborg . I believe it was sent here but it was not appeared because of some error. So I duplicate it here.

On Feb 1, 2022, at 9:26 AM, David Blaikie via Phabricator <reviews@reviews.llvm.org> wrote:

dblaikie added a comment.

In D96035#3287393 https://reviews.llvm.org/D96035#3287393, @avl wrote:

In D96035#3285620 https://reviews.llvm.org/D96035#3285620, @clayborg wrote:

I do believe that splitting types up into a compile unit that matches the DW_AT_decl_file would make this patch really hard to resist as it then makes the DWARF the best it can be. The nice thing is that if this is done it makes it very easy to tell where a type should be defined. So if the type's DW_AT_decl_file matches the current CU or if this is an anonymous namespace, then the type stays where it is. If it doesn't match, then it gets moved to a new compile unit. I don't know exactly how complex this would be, but it seems like it shouldn't be too hard. The huge type unit has the ability to greatly impact debugger performance as the code stands now because as soon as the debugger needs any type, it will have to parse all of the DIEs in the type compile unit. LLDB parses DWARF lazily and only pulls in what it needs, but with these binaries we would need to parse some 60MB of type DIEs as soon as anyone needs a type.

There are some disadvantages with creating additional compilation units for each source compile unit:

  1. Fragmentation and duplication. It would be necessary to duplicate: unit header, unit die, namespace dies, base types, line table headers, line table files, abbreviation table. clang has approx 1600 compilation units. So we need to duplicate all the above information for each of them. At the end of all, we might lose some DWARF size achievements.

base types wouldn't need to be duplicated right? We can put any types with no decl file into the current type unit you created and have everyone use absolute links to them? I was not a fan of the abbreviation tables being split up, I believe the llvm dsymutil has done this for a while now, as it leads to duplication as you already mention. I would prefer on big .debug_abbrev section if possible so all compile units can re-use, but I understand that this would slow things down.

Oh, yeah, too many units sounds unfortunate for sure. Currently we'd have one CU per .cpp file, and with "putting types in a synthetic unit that matches their decl_file" we'd have an additional one CU per .h file - which, admittedly, is probably a bit more than doubling the number of units. (template instantiations, for instance, might all be grouped together in the same unit, since they're defined in the same header)

So 11 bytes for a compile unit plus a few for the main DIE doesn't sound too bad to me when everything would be perfectly organized in the DWARF. Since there would all types would be have matching DW_AT_decl_file attributes, the line table should be very small, with hopefully just a few files. I would love to see a file done like this just so we can compare the sizes of an actual file to see if we are using up that much more space.

I don't know that this makes the DWARF 'the best it can be' - it's not clear to me what expressive power is provided by the types being in units that match source file names (indeed the information is provided in the DWARF either way - a consumer can treat types defined in a certain file in a certain way even if they're all in one big CU).

DWARF has always been a "shove a bunch of stuff into one big blob per compile unit" with no regard to how it will be used in the linking phase since DWARF was created. Linkers are expected to concatenate and relocate and call it a day. This leaves a ton of debug info in the linked executable that isn't useful, like functions that should have been dead stripped, but weren't because of the limitations of the DWARF format and what can be done quickly by linkers that don't understand the format. It is not easy for do anything with DWARF and it isn't a great format for just accessing a thing or two here and there since it is a serialized stream of conciousness of what is in compile unit with no real organization other than "I just put it all in one bucket".

I love the idea of having the DWARF being really well organized from a consumer stand point. Not many producers are thinking about how this information is consumed, just how quick and fast we can make it or link it. There is definitely value to doing things quickly, don't get me wrong, so for the compile/edit/debug cycle we need to keep things fast. But for when you want to create a build and archive it, I would rather spend a few more cycles getting the DWARF in the best shape it can be in so that it can be consumed efficiently by symbolication servers that want to use it, or debuggers to debug a released build than saving a few seconds or bytes in the file.

  1. Clously coupled references. If all types would be placed in separate compilation units matched with the original unit of declaration then types would reference each other. As the result, It would be hard to process such units in a parallel manner(independently). This limits the acceleration that can be achieved by parallelization. This patch tries to avoid cross-CU references. Only one type is allowed: non-type-CU -> type-CU.

Currently you have one compile unit that all threads must access and register their types with right? I don't see how splitting this up would cause more delays than all threads vying for access to the one and only type unit? Seems like it would actually reduce lock contention, but I haven't fully looked at your current solution, so forgive my saying so if this is way off.

Why are we trying to avoid cross CU references with DW_FORM_ref_addr? This is placing the emphasis on linking speed once again over how this information will be accessed by consumers. And one might say that bad DWARF organization can cause more delays when these DWARF files are used for debugging or symolication by having to parse all type DIEs for all types just because it was faster to create the DWARF file in the first place.

Yeah.

What about the following solution: Current type table unit(let`s say 60MB) would be divided into several buckets(let`s say 16) of independent types. Each bucket is placed in separate artificial compilation unit. So that there would not be references between units, there would not be a lot of duplicated information. The size of each separate type unit would be around 4MB(it would help to lldb to not parse much). Can this be a good solution? It looks like it allows to keep benefits(small final size of overall DWARF file, simple references, small size of each compile unit). It also would probably help to speed up multi-thread execution of DWARFLinker(if all type units would be generated in parallel) but I am afraid it would slow down single-thread execution.

Does anyone cares about single thread execution time? I personally don't.

Finding independent buckets of types sounds difficult/algorithmically complicated. But maybe that's feasible? I'm not sure. I was thinking more "emit all the types the same way you do currently, except into multiple unit "Chunks"" - ie: the code already handles type-to-type references within the single type-CU, so I don't understand (maybe I'm missing something) why it would be difficult to treat that "type-CU" as actually being "multiple type CUs" with arbitrary cross-referencing within that collection of type CUs. Then the chunks/buckets are chosen arbitrarily - admittedly that means longer encoding (sec_offset references are longer than the unit-relative references) than if you can group the types together into isolated/only-self-referencing groups - so maybe the extra space savings is worth the work to create those isolated groups? Naively, I would not have expected it to be worth that much.

DW_FORM_ref_addr is 32 bits for DWARF32, and not many use DWARF64 as it has other size increases, so using these to do cross CU references is just as cheap as a CU relative reference. And I haven't seen any compilers take advantage of using DW_FORM_ref1 or DW_FORM_ref2, they just use DW_FORM_ref4 by default which is the same size as DW_FORM_ref_addr.

I would love to see the "each type gets put into their own compile unit based on the DW_AT_decl_file of each type", just to see what this does to the size of the DWARF compared to the currently solution. If the size increase is too much, I would drop my suggestion. Is trying this out hard to do and not worrying at all about performance?

One other solution would be to group types into common subdirectories based off of the DW_AT_decl_file. The idea being and all STL types could be in one common compile unit for anything in with a Dw_AT_decl_file that starts with a common directory prefix? This would allow "vector", "list" and "map" types for the STL to all be in a a specific type compile unit? So the idea would be to strip the file base name from the DW_AT_decl_file and group all types from all header files in that directory into the same type unit?

avl added a comment.Feb 2 2022, 7:37 AM

base types wouldn't need to be duplicated right? We can put any types with no decl file into the current type unit you created and have everyone use absolute links to them?

I am currently implementing an optimization when basic types are always put in the start of current CU and then it is possible to use DW_FORM_ref1 reference to access them. If that optimization become useful then it would be necessary to duplicate them. But without such optimization, there is no need to duplicate basic types.

I was not a fan of the abbreviation tables being split up, I believe the llvm dsymutil has done this for a while now, as it leads to duplication as you already mention. I would prefer on big .debug_abbrev section if possible so all compile units can re-use, but I understand that this would slow things down.

The problem with abbreviation table is not that it affects performance.
I think it is possible to implement global lock-free fast abbreviation table.
The problem is deterministic output. If several threads would write into the
single global abbreviation table then order of elements inside abbreviation table
would be non-deterministic. Thus, I do not see currently how parallel processing
of compilation units might be done using global abbreviation table.

Without requirement of deterministic output - global abbreviation table might be effectively implemented.

So 11 bytes for a compile unit plus a few for the main DIE doesn't sound too bad to me when everything would be perfectly organized in the DWARF. Since there would all types would be have matching DW_AT_decl_file attributes, the line table should be very small, with hopefully just a few files. I would love to see a file done like this just so we can compare the sizes of an actual file to see if we are using up that much more space.

I would love to see the "each type gets put into their own compile unit based on the DW_AT_decl_file of each type", just to see what this does to the size of the DWARF compared to the currently solution. If the size increase is too much, I would drop my suggestion. Is trying this out hard to do and not worrying at all about performance?

also several namespace dies, line table header, abbreviation table, duplicated info inside line table - all of them together might take a significant amount. I would try to model that case and see the real numbers.

Currently you have one compile unit that all threads must access and register their types with right? I don't see how splitting this up would cause more delays than all threads vying for access to the one and only type unit? Seems like it would actually reduce lock contention, but I haven't fully looked at your current solution, so forgive my saying so if this is way off.

It`s not about delays. I am trying to make compile units independent.

Why are we trying to avoid cross CU references with DW_FORM_ref_addr? This is placing the emphasis on linking speed once again over how this information will be accessed by consumers. And one might say that bad DWARF organization can cause more delays when these DWARF files are used for debugging or symolication by having to parse all type DIEs for all types just because it was faster to create the DWARF file in the first place.

Generally speaking, cross CU references makes CU depending on each other. That complicates parallel processing. If CU not depend on others then it could be processed and thrown out. Otherwise, If it depends on other CU`s we need to load all dependent CU`s into the memory and process them together.

If we can separate monolithic type CU into several independent units it would help to "debugging or symbolication" to not parse all type DIEs and it would help other tools(like dsymutil) to process incoming DWARF fast with minimum used RAM.

Does anyone cares about single thread execution time? I personally don't.

If we do not then things are easier :-)

One other solution would be to group types into common subdirectories based off of the DW_AT_decl_file. The idea being and all STL types could be in one common compile unit for anything in with a Dw_AT_decl_file that starts with a common directory prefix? This would allow "vector", "list" and "map" types for the STL to all be in a a specific type compile unit? So the idea would be to strip the file base name from the DW_AT_decl_file and group all types from all header files in that directory into the same type unit?

My current attempt is to try to avoid cross-CU referencies. Let me do some investigation of possible strategies.
Probably, that suggestion(to group types on common directory prefix) would be close to what I am trying to achieve.
Which size of type unit is acceptable? If all STL types require f.e. 4MB - is it OK?

Does anyone cares about single thread execution time? I personally don't.

If we do not then things are easier :-)

I think @JDevlieghere and @aprantl probably care about it just from a regression perspective - but I could be wrong. (I thought they'd expressed some concerns about that in this/other reviews previously).

Apple folks ^ have you got a sense of if/how much of a single threaded regression you'd want to accept in this work, for some multithreaded gains?

avl added a comment.Feb 11 2022, 7:37 AM

@clayborg I`ve done some research and have a couple of questions. Would you mind looking at them, please?

  1. First thing is that I tried to model the case with separate compilation units for each declaration file and it seems that the overall impact is about 0,78% of debug_info+debug_abbrev size of the clang binary. Size of debug_info+debug_abbrev size is 158.7MB. Additional size, required for multiple compilation units, is 1.23MB(2051*(100+500)). 2051 is the number of files used in decl_file attributes, 100 bytes is the size of the compilation unit header, compilation unit die, size of namespace dies, line table header, 500 is the size of separate abbreviation table.
  1. Second thing is that I divided all types into independent buckets. The current type table for the clang binary takes approx 40MB. The maximal size of a bucket containing dependent types is 1.6Mb. So, if the original type table would be divided into multiple compilation units based on types dependency then each type unit might be 1,6Mb or less.

    Dividing types into independent buckets has several advantages: since they are not dependent on each other it is possible to handle them in a parallel/independent way. Another thing is that it potentially minimizes the number of loads which should be done. It loads all dependent types at once. In case types are not divided into independent buckets we would need to load all dependencies by several loads(i.e. whenever cross-CU reference is encountered).

Assuming we decided to split the current global type table on the "decl_file" basis as you suggested, What do you think of the following questions:

  1. How the following situation should be handled:

Source DWARF:

      DW_TAG_compile_unit
        DW_AT_name "cu1"
     
0x100:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
       
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
         
          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
        NULL
     
        DW_TAG_import 0x100   <<<<<<<<<<<<<<<<<<<
      NULL

Result DWARF:

      DW_TAG_compile_unit
        DW_AT_name "type_table_file1"

0x200:  DW_TAG_namespace  <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
          
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
          
        NULL

      DW_TAG_compile_unit
        DW_AT_name "type_table_file2"

0x300:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"

          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
          
        NULL

     DW_TAG_compile_unit
        DW_AT_name "cu1"

        DW_TAG_import 0x200 or 0x300 ?    <<<<<<<<<<<<<<<<<<<

Which offset corresponding to the "namespace1" should be used? Any of them?

  1. Would it be OK to split DW_TAG_module ?

Source DWARF:

DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_module             <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"    <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
   
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL

  NULL
  
NULL

Result DWARF:

DW_TAG_compile_unit
  DW_AT_name "type_table_file1"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<<
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
    
  NULL
  
NULL
   
DW_TAG_compile_unit
  DW_AT_name "type_table_file2"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL
    
  NULL
  
NULL

Is it OK, that DW_TAG_module would be split?

  1. Only root types should be moved into compile unit for corresponding "decl_file", right?
DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_class_type
    DW_AT_name "class1" 
    DW_AT_decl_file "file1"
    
    DW_TAG_subroutine
      DW_AT_name "method1"
      DW_AT_decl_file "file1"    <<<<<<<<<<<<<<<<<<<<<
    NULL
    
    DW_TAG_subroutine
      DW_AT_name "method2"
      DW_AT_decl_file "file2"     <<<<<<<<<<<<<<<<<<<<<
    NULL          
  NULL
NULL

i.e. "method1" and "method2" both should be placed into the compile unit for "file1", right?

  1. what do you think: Would it be good to split current monolithic type table not on "decl_file" basis but on "buckets of dependend types" basis?
In D96035#3314256, @avl wrote:

@clayborg I`ve done some research and have a couple of questions. Would you mind looking at them, please?

  1. First thing is that I tried to model the case with separate compilation units for each declaration file and it seems that the overall impact is about 0,78% of debug_info+debug_abbrev size of the clang binary. Size of debug_info+debug_abbrev size is 158.7MB. Additional size, required for multiple compilation units, is 1.23MB(2051*(100+500)). 2051 is the number of files used in decl_file attributes, 100 bytes is the size of the compilation unit header, compilation unit die, size of namespace dies, line table header, 500 is the size of separate abbreviation table.

Thanks for trying this!

Under 1% (0.78%) , I would be ok with this overhead for really clean DWARF. To clarify: was this a all types for a specific DW_AT_decl_file in it own compile unit? Or was this for all files in the same directory?

  1. Second thing is that I divided all types into independent buckets. The current type table for the clang binary takes approx 40MB. The maximal size of a bucket containing dependent types is 1.6Mb. So, if the original type table would be divided into multiple compilation units based on types dependency then each type unit might be 1,6Mb or less.

    Dividing types into independent buckets has several advantages: since they are not dependent on each other it is possible to handle them in a parallel/independent way. Another thing is that it potentially minimizes the number of loads which should be done. It loads all dependent types at once. In case types are not divided into independent buckets we would need to load all dependencies by several loads(i.e. whenever cross-CU reference is encountered).

Assuming we decided to split the current global type table on the "decl_file" basis as you suggested, What do you think of the following questions:

  1. How the following situation should be handled:

Source DWARF:

      DW_TAG_compile_unit
        DW_AT_name "cu1"
     
0x100:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
       
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
         
          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
        NULL
     
        DW_TAG_import 0x100   <<<<<<<<<<<<<<<<<<<
      NULL

Result DWARF:

      DW_TAG_compile_unit
        DW_AT_name "type_table_file1"

0x200:  DW_TAG_namespace  <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"
          
          DW_TAG_structure
            DW_AT_name "S1"
            DW_AT_decl_file "file1"
          NULL
          
        NULL

      DW_TAG_compile_unit
        DW_AT_name "type_table_file2"

0x300:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "namespace1"

          DW_TAG_structure
            DW_AT_name "S2"
            DW_AT_decl_file "file2"
          NULL
          
        NULL

     DW_TAG_compile_unit
        DW_AT_name "cu1"

        DW_TAG_import 0x200 or 0x300 ?    <<<<<<<<<<<<<<<<<<<

We might have to include them both right? Do we actually see DWARF like this? But my initial guess is we would need them both if clients needed access to "S1" or "S2". Or if we can track what that DWARF accesses within the DW_TAG_import, maybe only the ones that are needed.

Which offset corresponding to the "namespace1" should be used? Any of them?

2. Would it be OK to split DW_TAG_module ?


Source DWARF:
DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_module             <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"    <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
   
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL

  NULL
  
NULL
Result DWARF:
DW_TAG_compile_unit
  DW_AT_name "type_table_file1"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<<
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S1"
      DW_AT_decl_file "file1"
    NULL
    
  NULL
  
NULL
   
DW_TAG_compile_unit
  DW_AT_name "type_table_file2"

  DW_TAG_module            <<<<<<<<<<<<<<<<<<< 
    DW_AT_name "module1"   <<<<<<<<<<<<<<<<<<<
    
    DW_TAG_structure
      DW_AT_name "S2"
      DW_AT_decl_file "file2"
    NULL
    
  NULL
  
NULL
Is it OK, that DW_TAG_module would be split?

I believe it would be ok to split the module. I don't believe anyone refers to the module directly do they? Only things within and as long as we fix up those direct references then that should be ok. I have to admit I am not familiar with the DW_TAG_module as we don't use it much in LLDB AFAIK.

  1. Only root types should be moved into compile unit for corresponding "decl_file", right?
DW_TAG_compile_unit
  DW_AT_name "cu1"
  
  DW_TAG_class_type
    DW_AT_name "class1" 
    DW_AT_decl_file "file1"
    
    DW_TAG_subroutine
      DW_AT_name "method1"
      DW_AT_decl_file "file1"    <<<<<<<<<<<<<<<<<<<<<
    NULL
    
    DW_TAG_subroutine
      DW_AT_name "method2"
      DW_AT_decl_file "file2"     <<<<<<<<<<<<<<<<<<<<<
    NULL          
  NULL
NULL

i.e. "method1" and "method2" both should be placed into the compile unit for "file1", right?

Yes, The entire contents of a class should go into the DW_AT_decl_file for the class itself. DW_TAG_subprogram entries that belong to the class should live with the class.

  1. what do you think: Would it be good to split current monolithic type table not on "decl_file" basis but on "buckets of dependend types" basis?

If every type can do into its own compile unit for only an extra 0.78% cost, I find this totally acceptable and I would vote for this!

I am very excited about this feature and I will push for adoption of this within Facebook ASAP if this is viable, faster and also much better organized! The one thing to think about is seeing if we can make this work with "--update" option to dsymutil (which reparses the DWARF and re-emits it and the accelerator tables. This --update option could then be used to post produce debug innfo before it gets permanently archived to disk, so the smaller and better organized the better! So if we can get this working we will have a huge set of code to run this through and test it with!

avl added a comment.Feb 14 2022, 8:17 AM

Thanks for trying this!

Under 1% (0.78%) , I would be ok with this overhead for really clean DWARF. To clarify: was this a all types for a specific DW_AT_decl_file in it own compile unit? Or was this for all files in the same directory?

That was all types for a specific DW_AT_decl_file in it own compile unit. I did not take into account duplication of file names in the line table, but I think overall size would still be within 1%(if there would not be other duplicated things).

Things might be much worse if we would need to duplicate this kind of dies for each compilation unit :

DW_TAG_const_type 
                DW_AT_type (0x000000000000004a "llvm::Twine")

DW_TAG_pointer_type
                DW_AT_type (0x000000000003c122 "const char")

We might have to include them both right? Do we actually see DWARF like this? But my initial guess is we would need them both if clients needed access to "S1" or "S2". Or if we can track what that DWARF accesses within the DW_TAG_import, maybe only the ones that are needed.

If we are going to include them both - then(in the real examples) it means inserting a dozens of DW_TAG_imported_module dies(for each used type).

We see DWARF like this very often:

using namespace llvm;

SmallSet<int, 3> Set;
SmallVector<const int *, 8> Ptrs;

Originally it looks like this:

      DW_TAG_compile_unit
        DW_AT_name "driver_cc1.cpp"
        
0x100:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "llvm"
       
          DW_TAG_class_type
            DW_AT_name "SmallSet<int, 3>"
            DW_AT_decl_file "SmallSet.h"
          NULL
         
          DW_TAG_class_type
            DW_AT_name "SmallVector<const int *, 8>"
            DW_AT_decl_file "SmallVector.h"
          NULL
        NULL
     
        DW_TAG_imported_module
          DW_AT_import 0x100   <<<<<<<<<<<<<<<<<<<
      NULL

with new scheme it would be:

      DW_TAG_compile_unit
        DW_AT_name "type_table_SmallSet"

0x200:  DW_TAG_namespace  <<<<<<<<<<<<<<<<<<<
          DW_AT_name "llvm"
          
          DW_TAG_structure
            DW_AT_name "SmallSet<int, 3>"
            DW_AT_decl_file "SmallSet.h"
          NULL
          
        NULL

      DW_TAG_compile_unit
        DW_AT_name "type_table_SmallVector"

0x300:  DW_TAG_namespace   <<<<<<<<<<<<<<<<<<<
          DW_AT_name "llvm"

          DW_TAG_structure
            DW_AT_name "SmallVector<const int *, 8>"
            DW_AT_decl_file "SmallVector.h"
          NULL
          
        NULL

     DW_TAG_compile_unit
        DW_AT_name "driver_cc1.cpp"

        DW_TAG_imported_module   <<<<<<<<<<<<<<<<<<<
          DW_AT_import 0x200           <<<<<<<<<<<<<<<<<<<
        DW_TAG_imported_module    <<<<<<<<<<<<<<<<<<<
          DW_AT_import 0x300            <<<<<<<<<<<<<<<<<<<

It looks really overweight. These DW_TAG_imported_module dies will noticeable increase added size.

I am very excited about this feature and I will push for adoption of this within Facebook ASAP if this is viable, faster and also much better organized!

Thanks!

Thing, which I did not understand yet - why "splitting on decl_file basis" is so good? I understand the problem with big "type_table": we do not want to parse a lot of DWARF if lazily use only part of it. But, if less fragmented solution exists, would it be good to use it instead of "splitting on decl_file basis"?
Or "splitting on decl_file basis" is still better for some reason?

The one thing to think about is seeing if we can make this work with "--update" option to dsymutil (which reparses the DWARF and re-emits it and the accelerator tables. This --update option could then be used to post produce debug innfo before it gets permanently archived to disk, so the smaller and better organized the better! So if we can get this working we will have a huge set of code to run this through and test it with!

yes, I think we could make it possible that "--update" (or another replacement option) would reorganize DWARF. Though it would be good if we would do at as an additional change after this patch.

Does the patch in this diff already contain the changes where each type in a separate compile unit that matches the DW_AT_decl_file? I would love to start running some tests, we have some dSYM files that are too large and are exceeding the 4GB boundary that I would like to test this patch on.

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 10:31 AM
avl added a comment.Mar 9 2022, 2:00 PM

Does the patch in this diff already contain the changes where each type in a separate compile unit that matches the DW_AT_decl_file? I would love to start running some tests, we have some dSYM files that are too large and are exceeding the 4GB boundary that I would like to test this patch on.

No, it does not yet. Implementing this needs quite serious refactoring and would require some time(a month or two).

Another thing is that I am not sure that is the right way to go. This change seems would introduce significant fragmentation. My initial estimation(1%) was based on the assumption that a small number of DIEs would be duplicated. The more I think about it the more sources of duplication become clear. f.e. it would be necessary to insert a lot of DW_TAG_imported_module DIEs(see https://reviews.llvm.org/D96035#3314256 for the example ). Also, the implementation should be significantly complicated to add this kind of fragmentation. As a result size of the final DWARF would be increased, performance results for the patch would be worse(especially for single-thread mode, which seems to be still important).

I do not understand currently why "splitting on decl_file basis" is so good? (So that it worth degradation of performance and size results).

As an alternative, I am thinking of splitting type_table on a namespace basis. It would avoid the necessity of inserting multiple DW_TAG_imported_module DIEs. In clang, it would result in three compile units 17MB, 16MB, 7MB instead of one of 40Mb. What do you think of that idea - splitting the current type table into smaller units on a namespace basis?

avl updated this revision to Diff 420072.Apr 3 2022, 12:19 PM

rebased. supported .debug_names.

avl edited the summary of this revision. (Show Details)Apr 3 2022, 12:21 PM

Does anyone cares about single thread execution time? I personally don't.

If we do not then things are easier :-)

I think @JDevlieghere and @aprantl probably care about it just from a regression perspective - but I could be wrong. (I thought they'd expressed some concerns about that in this/other reviews previously).

Apple folks ^ have you got a sense of if/how much of a single threaded regression you'd want to accept in this work, for some multithreaded gains?

If it's purely time, I don't care too much about single threaded performance. When the lookForDiesToKeep algorithm was recursive we would occasionally run out of stack space because std::threads (underlying LLVM's thread pool) have a ridiculously small stack and no way to increase it. So we had this workaround in place for a while that if you specified -j1 it would not avoid the thread pool altogether. But since I converted that algorithm to a work list/state machine that's not really a concern anymore. The thing I did express concerns about what an earlier revision of this change that caused non-determinism in multithreaded mode which is not really negotiable but IIUC that's no longer the case with the current patch.

I'm (still) very excited about this. I'm really happy that this lives behind a flag which addresses most of my original concerns. Making this the default in dsymutil would require a bunch of qualification but this approach allows me to do that gradually. One thing I suggested earlier I think was to run the lldb test suite with the new flag and see if that breaks anything. Did you ever get around to that? That should give us an initial level of confidence. Jim expressed some concerns about trying to squeeze everything in the same compilation unit. Greg, do you share that concern? Do we expect this will negatively impact lldb's performance? It would probably be worthwhile to do a few performance experiments here as well to make sure this doesn't regress debugging performance.

I'm also very excited that Sony is considering upstreaming llvm-dva (based on Paul's comment in D124082). In the past we've often stuck to binary compatibility because there's no great way to qualify big changes to the algorithm. But as long as all of this lives behind a flag that's not a dealbreaker.

If we're happy with the approach I think we should land this an iterate on smaller changes that will be much easier to review. One reason I've been so quiet in this review is that it takes a long time to page in everything.

avl added a comment.May 6 2022, 11:30 AM

I'm (still) very excited about this. I'm really happy that this lives behind a flag which addresses most of my original concerns. Making this the default in dsymutil would require a bunch of qualification but this approach allows me to do that gradually. One thing I suggested earlier I think was to run the lldb test suite with the new flag and see if that breaks anything. Did you ever get around to that?

yes, I did. This is the set of tests passed by this patch:

It passed llvm/test/tools/dsymutil lit tests.
It passed check-llvm, check-clang, check-lldb.
It passed llvm-dwarfdump --verify for all llvm&clang binaries.
It passed llvm-dwarfdump --types-compat-dump for all llvm&clang binaries.

Jim expressed some concerns about trying to squeeze everything in the same compilation unit. Greg, do you share that concern? Do we expect this will negatively impact lldb's performance? It would probably be worthwhile to do a few performance experiments here as well to make sure this doesn't regress debugging performance.

I will check the performance results and post them here. Though I think the performance degradation would not be significant.

Anyway, I think it might be a good idea to split a single type unit into several units. The question is - by which criteria? @clayborg suggested dividing it by declaration file basis. So that a separate unit would be created for every declaration file. I think that solution creates significant fragmentation. That fragmentation would lead to increased resulting DWARF size and complications of implementation.

I would prefer to integrate the current version of the patch and add functionality for separate type units later. Doing things in that way has the advantage of iterative development. If the patch would be integrated earlier then it would be tested/probed by a bigger amount of people. Also, it might be faster improved by several people. Another thing is that we will have more time to create a good concept for separated type units.

If we're happy with the approach I think we should land this an iterate on smaller changes that will be much easier to review. One reason I've been so quiet in this review is that it takes a long time to page in everything.

If we`re happy with the current approach I am ready to start the creation of separated patches.
I suggest integrating this version of the patch and do not wait for "several type units" implementation.
"several type units" implementation might be integrated as an addition to this patch.

avl added a comment.May 9 2022, 10:12 AM

It would probably be worthwhile to do a few performance experiments here as well to make sure this doesn't regress debugging performance.

please, consider run-time/memory results for lldb used with clang built with various dsymutil options:

                        |    time, sec  |  Memory, Gbytes | .debug_info, bytes |
------------------------|---------------|-----------------|--------------------|
                        |               |                 |                    |
clang-original-noodr    |     8.58      |     1.82        |    1461723785      |
                        |               |                 |                    |
------------------------|---------------|--------------------------------------|
                        |               |                 |                    |
clang-original-odr      |     8.41      |     1.76        |     484472706      |
                        |               |                 |                    |
------------------------|---------------|--------------------------------------|
                        |               |                 |                    |
clang-dlnext-odr        |     8.75      |     1.98        |     155118011      |   <<<< single type unit size is 40495379 bytes.
                        |               |                 |                    |
------------------------|---------------|--------------------------------------|

tested lldb version:

lldb --version
lldb version 15.0.0git (https://github.com/llvm/llvm-project.git revision d9cea8d3a8fff86672174780312674871729578c)
  clang revision d9cea8d3a8fff86672174780312674871729578c
  llvm revision d9cea8d3a8fff86672174780312674871729578c

lldb run command:

/usr/bin/time -l lldb clang --source lldb-commands

set of commands passed to lldb through --source option:

b main
r
n
n
n
fr v
exit

If above set of commands is not very relevant for testing lldb performance then I will retest with another set.

My main concern with creating one large unit for all types is the performance of the debugger when consuming this, but we don't need to fix this in the first iteration and can work on improving this later after we prove it is a problem. As long as the DWARF is in good shape and we are properly uniquing types in a much better way than before I am good.

Can we name the option something other than "--use-dlnext"? This option name won't mean much to users. Not sure what a good option name would be though but something that doesn't talk about internal classes that will be used would be better IMHO.

JDevlieghere accepted this revision.May 11 2022, 3:41 AM

Can we name the option something other than "--use-dlnext"? This option name won't mean much to users. Not sure what a good option name would be though but something that doesn't talk about internal classes that will be used would be better IMHO.

Agreed. How about a flag to specify the dwarf linker: --dwarf-linker={dsymutil,dwarfutil}. The respective tools would have the respective linkers as their default and we could add more options later (e.g. dwarfutil with separete type units).

With the flag renames this LGTM.

This revision is now accepted and ready to land.May 11 2022, 3:41 AM
avl added a comment.May 11 2022, 4:42 AM

Can we name the option something other than "--use-dlnext"? This option name won't mean much to users. Not sure what a good option name would be though but something that doesn't talk about internal classes that will be used would be better IMHO.

Agreed. How about a flag to specify the dwarf linker: --dwarf-linker={dsymutil,dwarfutil}. The respective tools would have the respective linkers as their default and we could add more options later (e.g. dwarfutil with separete type units).

With the flag renames this LGTM.

probably, we would like to use both dwarflinkers with dsymutil and dwarfutil. Thus, we do not want to name tools in the option. How about these variants:

--dwarf-linker={classic,mt}  (mt is multy-thread)

--dwarf-linker={classic,with-joined-types}

--dwarf-linker={classic,experimental}

?

Also, to check my understanding, We are not going to integrate this patch as a single huge change. Instead, I will separate it to the series of reviews and we are going to integrated them one by one, correct?

probably, we would like to use both dwarflinkers with dsymutil and dwarfutil. Thus, we do not want to name tools in the option. How about these variants:

--dwarf-linker={classic,mt}  (mt is multy-thread)

--dwarf-linker={classic,with-joined-types}

--dwarf-linker={classic,experimental}

How about:

--linker=apple
--linker=llvm

The "apple" would be the classic Apple linker with all of its compatability
The new "llvm" would be the new linker with the better type uniquing + other features that will get added. We would need these described in the option help text.

Any multi-threaded options should be separate options if they are not already covered by existing options.

Do we have the ability to change how types are emitted? Like, can we use the "classic" linker with new type uniquing options ("with-joined-types" or "experimental")? I am trying to figure out the three options you specified above

avl added a comment.May 11 2022, 10:32 PM

How about:

--linker=apple
--linker=llvm

The "apple" would be the classic Apple linker with all of its compatability
The new "llvm" would be the new linker with the better type uniquing + other features that will get added. We would need these described in the option help text.

These are fine to me.

Any multi-threaded options should be separate options if they are not already covered by existing options.

Do we have the ability to change how types are emitted? Like, can we use the "classic" linker with new type uniquing options ("with-joined-types" or "experimental")? I am trying to figure out the three options you specified above

No, we do not have the ability to change how types are emitted. "classic"/"apple" linker generates it one way. "experimental"/"llvm" generates it in other way.

These variants: "mt", "with-joined-types", "experimental" are different suggestions for the name of new linker. There should be used only one of them.

"mt" - supposed to note that new linker utilizes multi-thread execution heavier.
"with-joined-types" - supposed to note that new linker uses other approach to type deduplication.
"experimental" - supposed to note that new linker is an experiment yet.

dreampiggy added a comment.EditedJul 27 2022, 4:24 AM

Hi.

I try to pick the code for testing the performance. However, I found the new dwarflinker-next, will crash at some specify code path

See crash stacktrace here:

I try to debug and found the reason:

during analysis stage, NamesGenerator.assignName(CU, DieIdx) can not assign a valid name for some unhandled DIEEntry tag, like

+ DW_TAG_thrown_type

This will cause the final CU.TypeInfo.TypeName be null, and crash at the TypePool.insert during ODR stage

Any solution for this ? Should we just mark it as bad type with the raw hex with the default case (current not)? Or add some more handle cases ?

avl added a comment.EditedJul 27 2022, 6:05 AM

Hi.

I try to pick the code for testing the performance. However, I found the new dwarflinker-next, will crash at some specify code path

Hi @dreampiggy,

The patch is a bit outdated. The upstream code has been changed since I built it last time. I will rebase it.

See crash stacktrace here:

I try to debug and found the reason:

during analysis stage, NamesGenerator.assignName(CU, DieIdx) can not assign a valid name for some unhandled DIEEntry tag, like

+ DW_TAG_thrown_type

This will cause the final CU.TypeInfo.TypeName be null, and crash at the TypePool.insert during ODR stage

Any solution for this ? Should we just mark it as bad type with the raw hex with the default case (current not)? Or add some more handle cases ?

We need to handle more cases. DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker). It should be handled at least inside SyntheticTypeNameBuilder::analyzeDieForTypeName().
I will rebase and add handling of DW_TAG_thrown_type in a week.
Thanks for the analysis.

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

avl added a comment.Jul 28 2022, 10:42 AM

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

It should be as easy as classifying DW_TAG_thrown_type as a type die right?

In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

So it sounds like there's a list of attributes that may be type attributes - is it possible to detect type references by the type of the referenced DIE instead of the attribute used to reference them? (or both?) or would DW_AT_sibling break that by causing a sibling attribute to be treated differently just because the sibling was a type?

avl added a comment.Jul 29 2022, 2:46 AM
In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

So it sounds like there's a list of attributes that may be type attributes - is it possible to detect type references by the type of the referenced DIE instead of the attribute used to reference them? (or both?) or would DW_AT_sibling break that by causing a sibling attribute to be treated differently just because the sibling was a type?

This exact case is not about how to detect whether an attribute references type DIE or not.
(Though attributes referencing type die currently implemented as the predefined list: DW_AT_type, DW_AT_containing_type, DW_AT_specification, DW_AT_abstract_origin, DW_AT_import.)
This case is about how to decide which DIE referencing type DIE should be deduplicated.

f.e. The patch deduplicates this kind of DIE:

DW_TAG_pointer_type 
   DW_AT_type -> ref to type DIE.

To be able to deduplicate above DW_TAG_pointer_type, the patch creates a type descriptor
for the DIE DW_TAG_pointer_type. Any reference to the above pointer will be resolved to the
reference to this type descriptor.

There are some DIEs referencing type DIEs that should not be deduplicated.
f.e. DW_TAG_call_site or DW_TAG_call_site_parameter. Since they are used in
subroutines that are not moved into the "artificial type unit".

Currently there is a list of such DIEs inside SyntheticTypeNameBuilder::analyzeDieForTypeName().
DIEs handled in analyzeDieForTypeName() will be assigned a name and maybe deduplicated later.
DW_TAG_thrown_type was missed there. Probably, instead of just a list of DIEs that might be deduplicated,
it is possible to create a more general rule. f.e. if some DIE is a child of another DIE that is already known
to be deduplicated then such DIE should be considered as a deduplication candidate also. This rule should
cover the DW_TAG_thrown_type case.

Speaking of determining attributes referencing type DIEs: solution detecting type references by the type of the referenced DIE might be too expensive from the performance point of view. Resolving DIE references is quite expensive operation.

avl added a comment.Jul 29 2022, 2:50 AM
In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

It should be as easy as classifying DW_TAG_thrown_type as a type die right?

Yes, we can classify DW_TAG_thrown_type as a deduplication candidate and this concrete case should be cured.

Additionally we can extend the rule for deduplication candidates in such way: All children of the known deduplication candidate should also be considered as a deduplication candidates - that allows to not create a fixed list.

In D96035#3686890, @avl wrote:
In D96035#3685472, @avl wrote:

DW_TAG_thrown_type is currently not handled by new DWARFLinker(as well as by old DWARFLinker)

Yes. But seems the old DWARFLinker DIECloner does not hit any assert and crash. Seems the new "artificial compile unit" based on types name will fail on this case ?

yes, it will fail. Thus, it is necessary to add(other than adding specific handling for DW_TAG_thrown_type) default handler for unknown DW_TAG_* die. So that the patch do not crash in such a case. Will do this with nearest rebasing.

So it sounds like there's a list of attributes that may be type attributes - is it possible to detect type references by the type of the referenced DIE instead of the attribute used to reference them? (or both?) or would DW_AT_sibling break that by causing a sibling attribute to be treated differently just because the sibling was a type?

This exact case is not about how to detect whether an attribute references type DIE or not.
(Though attributes referencing type die currently implemented as the predefined list: DW_AT_type, DW_AT_containing_type, DW_AT_specification, DW_AT_abstract_origin, DW_AT_import.)
This case is about how to decide which DIE referencing type DIE should be deduplicated.

f.e. The patch deduplicates this kind of DIE:

DW_TAG_pointer_type 
   DW_AT_type -> ref to type DIE.

To be able to deduplicate above DW_TAG_pointer_type, the patch creates a type descriptor
for the DIE DW_TAG_pointer_type. Any reference to the above pointer will be resolved to the
reference to this type descriptor.

There are some DIEs referencing type DIEs that should not be deduplicated.
f.e. DW_TAG_call_site or DW_TAG_call_site_parameter. Since they are used in
subroutines that are not moved into the "artificial type unit".

Currently there is a list of such DIEs inside SyntheticTypeNameBuilder::analyzeDieForTypeName().
DIEs handled in analyzeDieForTypeName() will be assigned a name and maybe deduplicated later.
DW_TAG_thrown_type was missed there. Probably, instead of just a list of DIEs that might be deduplicated,
it is possible to create a more general rule. f.e. if some DIE is a child of another DIE that is already known
to be deduplicated then such DIE should be considered as a deduplication candidate also. This rule should
cover the DW_TAG_thrown_type case.

Speaking of determining attributes referencing type DIEs: solution detecting type references by the type of the referenced DIE might be too expensive from the performance point of view. Resolving DIE references is quite expensive operation.

Ah, OK - I'm mostly following. Thanks for helping me understand.

avl updated this revision to Diff 450799.Aug 8 2022, 7:23 AM

rebased, changed use-dlnext into --linker=[Apple,LLVM], added handling DW_TAG_thrown_type, added handling default case.

Matt added a subscriber: Matt.Jun 5 2023, 12:23 PM