This is an archive of the discontinued LLVM Phabricator instance.

[DWARFLinkerParallel] Add limited functionality to DWARFLinkerParallel.
ClosedPublic

Authored by avl on Jun 19 2023, 5:21 AM.

Details

Summary

This patch is extracted from D96035, it adds support for the existing
DWARFLinker functionality. What is not supported yet:

  1. Types deduplication(--odr mode).
  2. Modules deduplication.
  3. Generation of index tables.

run-time performance and memory requirements for clang binary --num-threads 16 :

----------------------------------------------------------------------------------
                                                      |  time, sec   |  mem, GB  |
----------------------------------------------------------------------------------
dsymutil --no-odr --accelerator None --linker llvm    |      44      |   18.0    |
----------------------------------------------------------------------------------
dsymutil --no-odr --accelerator None --linker apple   |     248      |   22.2    |
----------------------------------------------------------------------------------

run-time performance and memory requirements for clang binary --num-threads 1 :

----------------------------------------------------------------------------------
                                                      |  time, sec   |  mem, GB  |
----------------------------------------------------------------------------------
dsymutil --no-odr --accelerator None --linker llvm    |     242      |   17.2    |
----------------------------------------------------------------------------------
dsymutil --no-odr --accelerator None --linker apple   |     260      |   19.4    |
----------------------------------------------------------------------------------

The overall linking process looks like this:

parrallel_for_each(ObjectFile) {
  for_each (Compile Unit) {
    1. Load Clang modules.
  }

  parrallel_for_each(Compile Unit) {
    1. Load input DWARF for Compile Unit.
    2. Report warnings for Clang modules.
    3. Analyze live DIEs.
    4. Clone DIEs(Generate output DIEs and resulting DWARF tables).
        The result is set of sections corresponding to the current compile unit.
    5. Cleanup Input and Output DIEs.
  }

  Deallocate loaded Object file.
}

for_each (ObjectFile) {
  for_each (Compile Unit) {
    1. Set offsets to Compile Units DWARF sections.
    2. Sort offsets/attributes/patches to have a predictable result.
    3. Patch size/offsets fields.
    4. Generate index tables.
    5. Move DWARF sections of compile units into the resulting file.
 }
}

Every compile unit is processed separately, visited only once
(except case inter-CU references exist), and used data is freed
after the compile unit is processed. The resulting file is glued together
from the generated debug tables which correspond to separate compile units.

Handling inter-CU references: inter-CU references are hard to process
using only one pass. f.e. if CU1 references CU100 and CU100 references
CU1, we could not finish handling of CU1 until we finished CU100.
Thus we either need to load all CUs into the memory, either load CUs several
times. This implementation loads inter-connected CU into memory at the first
pass and processes them at the second pass.

Changes from the current implementation(making DWARFLinkerParallel to be binary incompatible with current DWARFLinker):

a) No common abbreviation table. Each compile unit has

its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables.
Later, it might be optimized a bit(by removing equal abbreviations tables).

b) .debug_frame. Already generated CIE records are not reused between object files

c) location expressions, containing type references, use fixed-length ULEB128 format as they need to be patched after reference body is generated.

d) live tracking algorithm does not depend on the order of DW_TAG_import_module nodes and in some cases keep more DIEs.

Diff Detail

Event Timeline

avl created this revision.Jun 19 2023, 5:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 19 2023, 5:21 AM
avl requested review of this revision.Jun 19 2023, 5:21 AM
avl edited the summary of this revision. (Show Details)Jun 19 2023, 5:23 AM
avl edited the summary of this revision. (Show Details)Jun 19 2023, 5:59 AM
avl edited the summary of this revision. (Show Details)
avl edited the summary of this revision. (Show Details)Jun 19 2023, 6:09 AM

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't we just let each compile unit generate its own abbrev table and then unique them into a single abbrev table at when we write out the .debug_info data? That way compile units can stay fast in parallel execution, and since we need to emit the compile units serially when generating the .debug_info, we can just have a map that maps CU specific abbrev codes to the global and final abbrev list? Then we end up with just one .debug_abbrev table in the end

avl added a comment.Jun 20 2023, 2:44 PM

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't we just let each compile unit generate its own abbrev table and then unique them into a single abbrev table at when we write out the .debug_info data? That way compile units can stay fast in parallel execution, and since we need to emit the compile units serially when generating the .debug_info, we can just have a map that maps CU specific abbrev codes to the global and final abbrev list? Then we end up with just one .debug_abbrev table in the end

The problem with this solution seems that when local abbreviation number would be replaced with global number(which is ULEB128 format) the DIEs offsets will be changed(because size of global number may differ from the size of local number). It will brake all sizes and references.

Also, currently CU is emitted(when we write out the .debug_info data) immediately. The global abbreviation table might be done in the very end after all CUs are emitted. That means we need to parse already emitted CU and update abbreviation numbers. It would probably slowdown execution.

The size of abbreviation table for clang binary having .debug_info 1.7G : 12.5K for DWARFLinker and 7.2M for DWARFLinkerParallel. i.e. new abbreviation table is approx 0.5% of overall .debug_info size. It looks relatively cheap. Though, any solution making .debug_abbrev shorter would certainly be good. Probably we might experiment with fixed length abbreviation numbers...

avl added a comment.Jun 20 2023, 2:56 PM

Another solution might be to use abbreviation table from already generated CU for the next CU. So that in any concrete moment current CU owns abbreviation table, but finally single abbreviation table would be shared between several CUs. That way, there still would exist several tables but some of them would be shared. It would reduce current size of .debug_abbrev.

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't we just let each compile unit generate its own abbrev table and then unique them into a single abbrev table at when we write out the .debug_info data? That way compile units can stay fast in parallel execution, and since we need to emit the compile units serially when generating the .debug_info, we can just have a map that maps CU specific abbrev codes to the global and final abbrev list? Then we end up with just one .debug_abbrev table in the end

The problem with this solution seems that when local abbreviation number would be replaced with global number(which is ULEB128 format) the DIEs offsets will be changed(because size of global number may differ from the size of local number). It will brake all sizes and references.

We make up DIEs to emit, but _as_ we emit the DIEs from a CU, can't we globalize the abbrev from the local CU as they are being emitted? Each abbrev could cache its global value if it has already been emitted from each CU. I am thinking that DIEs would have references to other DIEs using pointers when they are stored in intermediate form but this might not be true coming to think about it. If reference would need to be updated then this wouldn't work as you mentioned.

Also, currently CU is emitted(when we write out the .debug_info data) immediately. The global abbreviation table might be done in the very end after all CUs are emitted. That means we need to parse already emitted CU and update abbreviation numbers. It would probably slowdown execution.

Yeah, if DIE refs are done with hard coded offsets up front then this wouldn't work. A DWARF generator a created in python has references pointing to a DIE object so that it can ask that object for its offset when a reference is needed, and if it isn't available yet, it gets added to a fixup list for forward refs, and then they get fixed up in a post CU output phase. This is nice because I never have to worry about reference offsets as they "just work". I admit I am not as familiar with out the LLVM DWARFLinker does things.

The size of abbreviation table for clang binary having .debug_info 1.7G : 12.5K for DWARFLinker and 7.2M for DWARFLinkerParallel. i.e. new abbreviation table is approx 0.5% of overall .debug_info size. It looks relatively cheap. Though, any solution making .debug_abbrev shorter would certainly be good. Probably we might experiment with fixed length abbreviation numbers...

No real worry if the size isn't too big.

Another solution might be to use abbreviation table from already generated CU for the next CU. So that in any concrete moment current CU owns abbreviation table, but finally single abbreviation table would be shared between several CUs. That way, there still would exist several tables but some of them would be shared. It would reduce current size of .debug_abbrev.

I like this idea. Not needed for this patch, but sounds like a good idea.

avl added a comment.Jun 21 2023, 7:08 AM

a) No common abbreviation table. Each compile unit has its own abbreviation table. Generating common abbreviation table slowdowns parallel execution(This is a resource that is accessed many times from many threads). Abbreviation table does not take a lot of space, so it looks cheap to have separate abbreviations tables. Later, it might be optimized a bit(by removing equal abbreviations tables).

Can't we just let each compile unit generate its own abbrev table and then unique them into a single abbrev table at when we write out the .debug_info data? That way compile units can stay fast in parallel execution, and since we need to emit the compile units serially when generating the .debug_info, we can just have a map that maps CU specific abbrev codes to the global and final abbrev list? Then we end up with just one .debug_abbrev table in the end

The problem with this solution seems that when local abbreviation number would be replaced with global number(which is ULEB128 format) the DIEs offsets will be changed(because size of global number may differ from the size of local number). It will brake all sizes and references.

We make up DIEs to emit, but _as_ we emit the DIEs from a CU, can't we globalize the abbrev from the local CU as they are being emitted?

We can, but after abbrev numbers are globalized we need to update all size and reference fields and rewrite emitted DWARF.

Each abbrev could cache its global value if it has already been emitted from each CU. I am thinking that DIEs would have references to other DIEs using pointers when they are stored in intermediate form but this might not be true coming to think about it. If reference would need to be updated then this wouldn't work as you mentioned.

this patch uses Codegen DIE. It does not use pointers for references/sizes/offsets. These are absolute values:

DIE::setOffset()
DIE::setSize()
DIE::addValue(DIEAlloc, dwarf::Attribute(AttrSpec.Attr), dwarf::DW_FORM_ref_addr, DIEInteger(0x0)))

Also, currently CU is emitted(when we write out the .debug_info data) immediately. The global abbreviation table might be done in the very end after all CUs are emitted. That means we need to parse already emitted CU and update abbreviation numbers. It would probably slowdown execution.

Yeah, if DIE refs are done with hard coded offsets up front then this wouldn't work. A DWARF generator a created in python has references pointing to a DIE object so that it can ask that object for its offset when a reference is needed, and if it isn't available yet, it gets added to a fixup list for forward refs, and then they get fixed up in a post CU output phase. This is nice because I never have to worry about reference offsets as they "just work". I admit I am not as familiar with out the LLVM DWARFLinker does things.

Yes. that is the case - DIEs emitted before global abbreviations are known. This patch does similar fixup handling for references(write proper values later, when they are known). But that fixup occurs for fixed-size fields. For patching variable size fileds(like LEB128) it is necessary to rewrite already generated DWARF(as we need to shift data because of changed fields length).

The size of abbreviation table for clang binary having .debug_info 1.7G : 12.5K for DWARFLinker and 7.2M for DWARFLinkerParallel. i.e. new abbreviation table is approx 0.5% of overall .debug_info size. It looks relatively cheap. Though, any solution making .debug_abbrev shorter would certainly be good. Probably we might experiment with fixed length abbreviation numbers...

No real worry if the size isn't too big.

avl added a comment.EditedJun 21 2023, 7:18 AM

Another solution might be to use abbreviation table from already generated CU for the next CU. So that in any concrete moment current CU owns abbreviation table, but finally single abbreviation table would be shared between several CUs. That way, there still would exist several tables but some of them would be shared. It would reduce current size of .debug_abbrev.

I like this idea. Not needed for this patch, but sounds like a good idea.

Though there is some difficulty with this solution. The binary content would be non-deterministic. As it can not be guaranteed which CU share which abbreviation table.

Do we need to preserve binary representation to be deterministic(byte to byte equal between runs)?

Probably we could use special dumping which would be deterministic? Like dumping from https://reviews.llvm.org/D124082. In spite of the fact that source files are not byte to byte equal - the contents of the dump would be byte to byte equal.

I don't have all of dsymutil paged in to do a really thorough/meaningful review of all the details. Even if I did I'm sure a bunch of stuff would slip through that would only be caught by real world qualification.

At a high level I think the approach makes sense. I've spent quite a bit of time thinking of alternatives and I cant't think of anything more efficient given the constraints that we have (i.e. not being able to keep all the input DWARF concurrently in memory).

As for the implementation itself:

  • I like that you factored out things like the attribute cloning and I wonder if we could hoist this up and make it work with the non-parallel linker as well. I don't think we should go to extreme lengths to unify the two implementations, but even if we were to switch over to the new (parallel) implementation tomorrow, I expect the old one to be around for at least a few more years and I'd hate to have to maintain both.
  • I wonder if we could make the stages more explicit. If I'm reading the code right, the data (i.e. the DIE) and the logic (i.e. the live analysis) are shared in the same class and we have to maintain state. I would prefer a design where the two are decoupled and the data moves through the different stages. The way I'm imagining this is to have a class for every stage, possibly with its own shared context, and then have the DIEs move through them one after the other. It's fine for the DIE to know what stage we're in, but it would avoid things like init and reset iteration.

I left some inline comments here and there but I definitely need to do another pass.

llvm/lib/DWARFLinkerParallel/DIEAttributeCloner.h
47
67

Let's make these Doxygen style comments as well.

llvm/lib/DWARFLinkerParallel/DWARFLinkerCompileUnit.h
26

This should be singular: i.e. Stage.

96

How about clear or reset?

262

We need the reinterpret_cast because the OutDieOffsetArray stores uint64_ts instead of std::atomic<uint64_t>. Why can't we store the latter?

avl added a comment.EditedJun 22 2023, 7:17 AM

I don't have all of dsymutil paged in to do a really thorough/meaningful review of all the details. Even if I did I'm sure a bunch of stuff would slip through that would only be caught by real world qualification.

At a high level I think the approach makes sense. I've spent quite a bit of time thinking of alternatives and I cant't think of anything more efficient given the constraints that we have (i.e. not being able to keep all the input DWARF concurrently in memory).

As for the implementation itself:

  • I like that you factored out things like the attribute cloning and I wonder if we could hoist this up and make it work with the non-parallel linker as well. I don't think we should go to extreme lengths to unify the two implementations, but even if we were to switch over to the new (parallel) implementation tomorrow, I expect the old one to be around for at least a few more years and I'd hate to have to maintain both.

yes, it would be good to have single implementation. But I do not see the simple way to do it.
Actually there are other cases using similar code. f.e.

llvm/unittests/DebugInfo/DWARF/DwarfGenerator.h
https://reviews.llvm.org/D130315 .

Thus, it would be useful if we have kind of DWARFLinkerBase library, containing general code for cloning, creating and emitting DIEs.
That library might be used by DWARFLinker, DWARFLinkerParallel, unittests for DebugInfo, BOLT.
Though, I think there are a lot of details which will make implementing that library difficult task.

Anyway, if we want to have single implementation for it then, probably, the best way would be to think of creating DWARFLinkerBase library.
That library should have base implementations for compile units, for code cloning, creating and emitting DIEs, this library should not depend on AsmPrinter.
What do you think?

  • I wonder if we could make the stages more explicit. If I'm reading the code right, the data (i.e. the DIE) and the logic (i.e. the live analysis) are shared in the same class and we have to maintain state. I would prefer a design where the two are decoupled and the data moves through the different stages. The way I'm imagining this is to have a class for every stage, possibly with its own shared context, and then have the DIEs move through them one after the other. It's fine for the DIE to know what stage we're in, but it would avoid things like init and reset iteration.

If I understood correctly, it looks like the current picture is quite close to that description:

The CompileUnit contains data shared between stages and compile units know the current stage.
There is a separate class doing liveness analysys: DependencyTracker.
This class uses data keeping by CompileUnit(this data is shared context).
The result of liveness analysys is used by other stages that is why compile unit keeps the data.

We need "init" and "reset" functions as we need to erase data keeping by compile unit when we return to previous stages.
i.e. when we return to the previous stage we need to adapt shared context.
Probably instead of "init" and "reset" we need to name methods "sync shared context with the current stage" ?

llvm/lib/DWARFLinkerParallel/DWARFLinkerCompileUnit.h
262

std::atomic<> does not have copy constructor/operator. Because of this it could not be put into SmallVector.

avl updated this revision to Diff 533936.Jun 23 2023, 5:54 AM

rebased. addressed comments.

avl updated this revision to Diff 536448.Jun 30 2023, 3:00 PM

rebased.

avl updated this revision to Diff 538391.Jul 8 2023, 2:51 PM

rebased.

avl added a comment.Jul 17 2023, 7:10 AM

friendly ping.

avl added a comment.Jul 25 2023, 6:26 AM

friendly ping.

avl updated this revision to Diff 546089.Aug 1 2023, 9:12 AM

rebased.

avl added a comment.Aug 1 2023, 9:15 AM

@JDevlieghere @aprantl

Does this patch require some changes?

Previous comment asks for refactoring attributes cloning. Should we create some DWARFLinkerBase library for this?

JDevlieghere accepted this revision.Aug 1 2023, 10:59 AM

@JDevlieghere @aprantl

Does this patch require some changes?

Previous comment asks for refactoring attributes cloning. Should we create some DWARFLinkerBase library for this?

No major concerns that block landing this. I still feel it would be great to unify stuff but we can do that incrementally after the patch has landed.

This revision is now accepted and ready to land.Aug 1 2023, 10:59 AM
avl added a comment.Aug 15 2023, 7:24 AM

Thank you for the review!

avl updated this revision to Diff 550326.Aug 15 2023, 7:27 AM

rebased.

avl updated this revision to Diff 550792.Aug 16 2023, 9:49 AM

rebased.

JDevlieghere accepted this revision.Aug 16 2023, 12:55 PM
nikic added a subscriber: nikic.Aug 21 2023, 1:40 AM

Reverted due to test failures on s390x, see e.g. https://lab.llvm.org/buildbot/#/builders/94/builds/16146.

llvm/unittests/DWARFLinkerParallel/StringPoolTest.cpp