This patch is the implementation for removing obsolete debug info:
http://lists.llvm.org/pipermail/llvm-dev/2019-September/135473.html
It uses DWARFLinker which contains code refactored out from dsymutil.
Differential D74169
[WIP][LLD][ELF][DebugInfo] Remove obsolete debug info. avl on Feb 6 2020, 2:17 PM. Authored by Tokens
Details This patch is the implementation for removing obsolete debug info: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135473.html It uses DWARFLinker which contains code refactored out from dsymutil.
Diff Detail
Unit Tests
Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes
Comment Actions
Hi, following is the approximate list of changes, which assumed to match with corresponding patches :
void DWARFUnit::extractDIEsIfNeeded(bool CUDieOnly) { if (Error e = tryExtractDIEsIfNeeded(CUDieOnly)) WithColor::error() << toString(std::move(e)); } Thus, it is necessary to have a possibility to replace ErrorHandler here.
for (InputFile *file : objectFiles) { std::unique_ptr<ObjectFile> ObjectFile = createELFObjectFile(file.mb); std::unique_ptr<DWARFContext> DWARFContext::create (ObjectFile, nullptr, ErrorHandler)); }
Comment Actions
Hi George, I follow the idea is that small patches are easier to review and easier to integrate. That concrete patch does not have a functionality but it includes DWARLinker library, creates option and creates placeholders for the functionality. Which looks like a small amount of things that could be reviewed. Creating full set of patches which could be any time applied to the upstream and to the each other is quite time consuming thing because of time for synchronization/merging/rebasing. As an alternative we could discuss main points of future patches[0]. And iteratively apply patches step by step. If that plan is not good - then OK, I would start to work on preparing full set of patches for that feature. [0] following is the approximate list of changes, which assumed to match with corresponding patches :
void DWARFUnit::extractDIEsIfNeeded(bool CUDieOnly) { if (Error e = tryExtractDIEsIfNeeded(CUDieOnly)) WithColor::error() << toString(std::move(e)); } Thus, it is necessary to have a possibility to replace ErrorHandler here.
for (InputFile *file : objectFiles) { std::unique_ptr<ObjectFile> ObjectFile = createELFObjectFile(file.mb); std::unique_ptr<DWARFContext> DWARFContext::create (ObjectFile, nullptr, ErrorHandler)); }
Comment Actions A few comments from me.
Comment Actions I do not think them needed to be in state "any time applied to the upstream". Perhaps we could do something like the following: create a single large patch with all of features which can be applied. And upload it somewhere here as [not for commit]. Comment Actions That is OK. The only thing is that I planned to integrate small fixes as soon as they are ready. So I do not have a full implementation at the moment. I will create a single large patch with all of the features sometime later as soon as it would be fully ready. Comment Actions I agree with
(One thing I plan to work on is .debug_names accelerator table, which is similar to but more power than .gdb_index.) On one side, I've heard thoughts that: (1) if we use -gsplit-dwarf, is .debug* garbage collection still valuable? (2) DWARFLinker can significantly increase link time. Can we move it to a post-link tool (dsymutil but for ELF)? I am reluctant to let individual patches in before we can see the benefits, because they could easily turn out to be insufficient and cause churns. There is still possibility that we find --gc-debuginfo not suitable on the linker side and may want to do it in a post-link tool. Comment Actions For users using -gsplit-dwarf, linker garbage collection provides zero value. But not everyone does, there are reasons to use one or the other in different situations.
Presumably that'd take significantly longer and use more memory and disk space overall - but, yes, it would get you a running executable sooner. (/somewhat/ like the tradeoff of using .dwo files directly, compared to linking them into a .dwp - dwo files are ready sooner, but take up more space, etc, but dwp takes time to make - good for archival purposes) But (2) I think is unclear to me - I'd like to know how much more time/memory it takes, if it does take more (it probably will take more) - given that it should produce significantly less output (& at Google at least, output is memory usage - since we write to a memory mapped file (though we use Split DWARF in most places where size/memory usage matters, so this solution isn't immediately relevant there). (& this also might be reusable/good foundation for DWARF aware linking in llvm-dwp, for what it's worth (again, not clear what the tradeoffs will be like in terms of memory/cpu overhead, output size savings, etc))
Comment Actions
also, as an theoretical alternative - probably we might want to implement -gsplit-dwarf so that it include only used part of debug info. i.e. that which referenced after garbage collection. As the result size of all .dwo files would be smaller.
I am preparing the full patch for that feature, so that it would be possible to estimate these values: how much more time/memory it takes. Comment Actions That's not possible - the .dwo files are produced during compilation time, before the linker has done any garbage collection. (currently even when building the dwp file from dwo files it's not possible to strip the dead debug info - since there's no feedback from the linker that reaches the dwp tool (dwp and linking might be done in parallel on different machines, for instance) - but it'd be possible to implement something more like dsymutil where the linker generates a side-file address mapping/details of the garbage collection and then that is fed into the dwp tool (this would delay running the dwp tool though)) Comment Actions
Right. That is not possible in current scheme. Though we could probably change the scheme(if it would be considered beneficial). One alternative is what you were talking about - dsymutil/dwp tool which would take linker report and .dwo files and produce optimized .dwo files. Another alternative is that .dwo files could be optimized during linking time using the same scheme like current proposal. This proposal would take unoptimized .debug* sections of binary and generate optimized contents of destination .debug* sections during linking. Generally, It is possible to take unoptimized .dwo files and make them optimized. I do not suggest to do it now. That is just possible alternative which could probably be considered if current proposal would show good results. Another thing, is that this patch could be useful for -gsplit-dwarf even without optimizing .dwo files. 0x00000014: DW_TAG_skeleton_unit DW_AT_dwo_name ("split1.dwo") DW_AT_low_pc (0x0000000000000000) <<< not assigned because of deleted code DW_AT_high_pc (0x0000000000000008) DW_AT_addr_base (0x00000008) 0x00000034: DW_TAG_skeleton_unit DW_AT_dwo_name ("split2.dwo") DW_AT_low_pc (0x0000000000000000) <<< not assigned because of deleted code DW_AT_high_pc (0x0000000000000008) DW_AT_addr_base (0x00000008) Comment Actions I created dsymutil when at Apple, which has now been taken over by llvm-dsymutil. So a few thoughts on general DWARF linking and optimization: As dsymutil links the DWARF it sees, it does indeed have a linker map of all final addresses for any addresses in each .o file. dsymutil will go through all DIEs within a CU and mark the ones that should be kept by starting with the compile unit DIE, then each DIE referred to the by the final link map (DW_TAG_subprogram, global or static DW_TAG_variable), that had a mapping in the linked executable will be marked as needing to be kept along with all child DIEs and all DIEs that these child DIEs refer to. You will run into many DIEs over and over, but as soon as you run into one that is already marked as needing to be kept, you can avoid recursing into that DIE again, so it never really gets out of hand to do this pass. One thing to be careful of: enumerations. A lot of code uses enums, but they don't use the enum type when that enumeration defines bit masks, so you might end up with code that doesn't refer to an enum type with any variables (they just use a signed or unsigned integer as the variable type), but it is sometimes still a good idea to keep these types around. We had issue where we were dead stripping these types and the enum weren't available in the debug info. The other thing that dsymutil does to reduce file size, and the most valuable part of size reduction, is it uses ODR to only emit types one time when possible. If you have 100 .o files with std::string inside of them, you can often emit a single copy and change any future .o files references to their local copy, to use a DW_FORM_ref_addr form to refer to the first definition that was emitted. Enabling this allows us to trim debug info size by 75%. Dsymutil uses a method where it figures out the decl file and line for a type and the DW_AT_byte_size, along with a few other checks, to quickly unique types in C++ code. There is some trickiness here where compiler generated constructors and destructors may appear in one type and not in another, so there is some code to make sure there is a DIE within the type refer to, but overall this works really well and is pretty cheap to implement. Line tables are fully relinked as well and all line entries for addresses that were stripped are removed. This can be tricky because often times the line tables are emitted as one large stream with no DW_LNE_end_sequence opcode at the end of each function address range. New DW_LNE_end_sequence opcodes need to be inserted when optimizing line tables so that you don't end up with an invalid line table where addresses don't increase within a sequence. Adding the new DW_LNE_end_sequence opcode terminates the sequence and allows the next line entries to change addresses to a lower address than the previous DW_LNE_end_sequence line entry. dsymutil adds accelerator tables, in DWARF5 format or the Apple format, that are calculated by traversing the new DWARF once it is created. This means you don't end up with 100 accelerators tables concatenated together like linkers do now. I am keenly interested in this feature so we can prove that having the accelerator tables helps debugging speeds. Proving the accelerator tables help will help us get toolchains to integrate the accelerator tables into their builds. Having it start off as a separate post production step is a great way to help people figure out if they want to adopt them. Having all of this re-linking code in a DWARFLinker library llvm/lib/DebugInfo/DWARF would be great. We have many tools at Facebook that are relinking DWARF, and not doing a very good job. If we had a library that would allow people to re-link their DWARF, either linking .o file DWARF into a main executable, or having an optimization pass that runs on fully linked libraries that might do function outlining or block re-use optimizations, then we can ensure the final DWARF is in great shape. One last point is to have a flag that enabled DWARF verification. This can be run before the GC operation to ensure the incoming DWARF is in good shape and again after any modifications, to ensure the DWARF is still valid. Comment Actions lld already supports gdb_index & there's plans to add debug_names support in the near future (~6 months).
I believe all the functionality in llvm-dsymutil related to DWARF linking that implements the features you described above (llvm-dsymutil was initially implemented with bug-for-bug compatibility with Apple dsymutil for safety and since then I think a few of those bugs have been fixed/improvements have been made) & that code is now/already in the llvm/lib/DWARFLinker library so it can be reused in lld for this effort (apologies if I'm preaching to the choir/being redundant here) & any others you might have in mind. No doubt there's probably some refactoring to do to improve its reusability as the integration with lld & other situations is done. Comment Actions Does it support adding the GDB index after the fact on an already linked binary, or just at normal link time? I want to add a GSYM option to LLD at some point too. Can replace -gline-tables-only in many cases.
That is great to hear! Sorry I haven't looked around at that code in a while, but glad to hear it is all ready to go. Comment Actions I don't think there's been any effort to add it as a post-processing step, no. I believe it can be built either from parsed DWARF or from debug_gnu_pubnames - in either case, at link time. I mean, it'd be probably quite practical to extend it to be able to be done as a standalone/post-processing step. For the sake of object and linked executable size, I suspect we might end up trying to move the indexing out of the object files/linked executable - perhaps putting the names either in a third file, or in the .dwo files and generating an index from them instead of from the objects. Not sure. Some further experimentation expected there. Comment Actions is debug_gnu_pubnames a new section that is now complete? The old .debug_pub* sections only contained globally visible things which means all debuggers would ignore this section. If accelerator tables were made from incomplete sources that will mean debuggers will have to question if .debug_names is complete or not.
That would be great. As accelerator tables evolve and get updated, it would be nice to have a path to update older accelerator tables to newer more complete versions.
That is fine too. We make a dwarf linking tool that can link all of the .dwo files back into a normal DWARF file like with dsymutil and .o files on mac. That allows people to get the benefits of the .dwo performance improvements for the edit/compile /debug cycle and then link it into a minimal and optimized DWARF, using contribution from this patch and others to come that optimize the DWARF, for permanent storage on build servers. Comment Actions Note that such post-processing is probably more suited to llvm-objcopy than lld, though of course they can use the same underlying LLVM library. Comment Actions It's "complete" in the sense that it's whatever GDB wanted and created a contract that GCC (& Clang/LLVM as best we know/have done so far) agree too, so GDB can rely on things being present/not present in some sense. That's why debug_gnu_pubnames existed - because of, as you say, the underspecified nature of (or unreliable implementation of) debug_pubnames. I believe debug_names semantics have been more strictly specified regarding what names will be there and what names will not.
That's already done - DWP ( https://gcc.gnu.org/wiki/DebugFissionDWP ). But for an interactive scenario over a distributed filesystem shipping the whole linked DWP file back to the user for access by the debugger is a bit unfortunate/slow. Good for archival purposes, as you say - but it's still a bit of an open question as to what to do for interactive users if object file/linked executable size is a major concern - either generating a third file with just the index parts (so building the full index can be done in parallel/distributed separately from the link) or doing an index-only link of the .dwo files (not super great - still means a bottleneck of shipping all the .dwo files to one build machine, etc)...
Comment Actions Perfect, that is what I was hoping for.
Yes, it indeed fully specifies what is needed. Thanks for the info.
For the linking, I was more thinking of converting the DWP format back into vanilla DWARF with no DWO/DWP artifacts (skeleton DIEs etc) or any other redirections (string table offsets, address tables, etc). Not sure if this would reduce the DWARF size of all of the DWO/DWP instrstructure was all removed? Comment Actions
Ah, sorry, I see what you mean. Well, the offsets systems in some cases save total size even once linked (eg: having a single 64 bit address with two short uleb indexes compared to two 64 bit addresses - even once the binary is linked and there's no extra bytes of relocation records for that address/addresses) & these features are no longer Split DWARF specific - Clang/LLVM generates almost (entirely, maybe?) all these constructs the same way with split or unsplit DWARF when emitting DWARFv5. Might save a bit - certainly something someone could experiment with as improvements to DWARFLinker - for dsymutil, lld, and maybe dwp in the future. (one of the cheap ones that'd be neat to see would be debug_rnglists optimization - wouldn't need to parse DWARF DIEs, etc, would be an entirely local optimization - may be useful for PROPELLER (~= -fbasic-block-sections) - because of the indirection in debug_rnglists (though the rnglist header's offset table) you could totally rewrite range lists to coalesce any contiguous ranges that end up together & rewriting the offsets in the offset table - though this optimization would benefit from being in both lld and dsymutil I think - not applicable to dwp unless address information is passed from the linker like dsymutil) Comment Actions While working on "remove obsolete debug info in lld" patch --- a/lld/ELF/CMakeLists.txt +++ b/lld/ELF/CMakeLists.txt @@ -59,6 +59,10 @@ add_lld_library(lldELF Option Passes Support + AllTargetsCodeGens + AllTargetsDescs + AllTargetsInfos + AsmPrinter Would it be OK to add these libraries to lld code ? If that is not OK, then alternative solution would be to refactor Comment Actions Honestly I just do not know the answer, but it sounds to me that we can add them, and do Comment Actions Honestly I don't know, either. I echo most of comments on https://reviews.llvm.org/D74169#1880922 For -gsplit-dwarf or -gsplit-dwarf=single users, garbage collection of debug info can provide hardly any benefit. For non -gsplit-dwarf use cases, I am concerned whether the pros (avoid an intermediate file/avoid a post-processing step) overweight cons (time/memory/lld size). The link time/memory may increase significantly. I wonder whether the peak memory usage can increase to 2x given our experience https://reviews.llvm.org/D74773#1920751 I am also curious to learn how much + AllTargetsCodeGens + AllTargetsDescs + AllTargetsInfos + AsmPrinter can increase the size of lld executable in a -DBUILD_SHARED_LIBS=off build. Doing a full DWARF rewriting can save intermediate files. But if the memory/time trade-off is not that appealing, a standalone tool (which can have more configuration options) may be more suitable. That said, we may need some prototype to verify these theories. How much do we sacrifice in link speed/peak memory usage/lld executable size in return for X% of .debug_* size saving? Comment Actions
Currently, it has zero effect because of --gc-sections. I asked about using AsmPrinter in lld to check whether there is a fundamental limitation for that.
I am currently working on that prototype.
When the prototype is ready, it would be possible to evaluate memory/time results. I did another prototype(not based on dsymutil, though with a very similar idea) D67469. Comment Actions This version is a full patch implementing "remove obsolete debug info from lld" functionality. Limitations: it does not support DWARF5, modules, -fdebug-types-section, type units, .debug_types, multiple .debug_info sections. Testing: it passes check-all lit testing. It is able to compile llvm code base. llvm-dwarfdump --verify shows zero errors for clang built by lld from this patch. Comment Actions Not so much reason to support type units ("-fdebug-types-section/type units/.debug_types section/multiple .debug_info sections" all essentially describe the same thing). Since type units add object size overhead to reduced linked size in the face of a DWARF-agnostic/ignorant linker. If you have a DWARF aware linker, you'd probably avoid using type units so you could have smaller object files too. Does it support multipe compile units in a single section in an object file (ala LTO/ThinLTO)?
Do you have memory usage/compile time numbers for linking various programs with/without this feature? Comment Actions Size/Performance/Memory results: lld binary size with this patch(SHARED_LIBS=OFF) - 49925kb. Performance results for building whole LLVM: -------------------------------------------------------------------------- | | CL options | LLVM bin size | LLVM build time | -------------------------------------------------------------------------- | A | (default set of options*) | 100.0%(18.0GB)| 100.0%(75m15s) | | | | | | | B | A +gc-debuginfo(**) +NoOdr=false| 54%(9,7GB) | 124.0%(93m33s) | | | | | | | C | A +gc-debuginfo +NoOdr=false | 43%(7,7GB) | 107.0%(80m26s) | | | +fdebug-types-section(***) | | | -------------------------------------------------------------------------- Performance results of lld for building single clang binary: --------------------------------------------------------------------------------- | | CL options | Clang size | link time | run-time memory | --------------------------------------------------------------------------------- | A | (default set of options*) | 100.0%(1,4GB)| 100%(49s) | (100%)9.3GB | | | | | | | | B | A +gc-debuginfo +NoOdr=false | 56.0%(0.78GB)| 296%(145s)| (153%)14.2GB | | | | | | | | C | A +gc-debuginfo +NoOdr=true | 86.0%(1.2GB) | 380%(186s)| (161%)15GB | | | | | | | | D | A +gc-debuginfo +NoOdr=false | 42.0%(0.59GB)| 133%(65s) | (113%)10.5GB | | | +fdebug-types-section(***) | | | | --------------------------------------------------------------------------------- (*) (**)
(***) fdebug-types-section is not supported by dsymutil/DWARFLinker. (****) HW configuration: OS Ubuntu 18.04 Comment Actions
It seems to me type units are useful for the current scheme. They increase the size of type reference(8 bytes vs 4 bytes) and introduce type unit header, but they avoid duplication. All references inside the object file point to the single type description - thus, object files have only one type's description instead of multiple copies. The effect of de-duplication is higher than the introduced increase. DWARF aware linker does not reduce the size of object files. It reduces the size of binary, whereas type units allow decreasing object file. I think DWARF aware linker could help to get even more size reduction. If instead of type units, there would be another structure - something like "bag of DWARF"(bag of types). You mentioned this idea in http://lists.llvm.org/pipermail/llvm-dev/2019-September/135262.html. "bag of DWARF" would allow to have single type description in object file for type and to have 4-bytes references. DWARF aware linker could merge "bag of DWARF" from various object files. (similar to the current process when linker merges comdat sections. But joining "bag of DWARF" is a more complicated process than selecting one section from the set of comdat sections)
it should but I did not check it yet. I would check and post results here. Comment Actions To the best of my knowledge (as the person who implemented type unit support in LLVM... ) type units never reduce the size of object files - they only increase it. There is no type duplication, so far as I know, in LLVM's object files even without type units - one definition of the type is emitted and all the rest of the DWARF references that type. With type units that type is just the skeleton (DW_AT_declaration and DW_AT_signature), without type units that type is the full definition, without the indirection to the type unit. Are there particular examples of type duplication within a single object file you have in mind? I'd be super curious to see them - might be some bugs that need fixing.
Bag of DWARF doesn't seem like it'd help in a situation where you know you're going to use a DWARF-aware linker (just put all the DWARF in the CU as normal & know the DWARF-aware linker will introduce cross-CU references as needed as things are deduplicated). The idea for Bag of DWARF was with a DWARF agnostic linker you could reference more than one DIE from in a type unit (or a compile unit) to reduce a bunch of the duplication/overhead that type units introduce (eg: when defining a member function of a type unit, you currently have to duplicate the member function's declaration in the CU that references the TU, because you can't refer to that subprogram declaration DIE in the TU - only the type itself).
Comment Actions Nice! Another thing I have been seeing a lot of is when functions are coalesced and many .o files have the same function that gets linked to the same location. We end up with many duplicate DWARF DIEs in the final executable since the linker will just link all of the DWARF for each function over and over (.debug_info and .debug_line content). I found this when making GSYM files from the DWARF and noticing that I had hundreds of duplicate entries that matched exactly. This would be a great follow up patch for --gc-debuginfo Comment Actions
Sorry, sort of duplicating the other thread of conversation we're having on this review - I don't believe type units would help dsymutil/DWARFLinker - nor that this data would be representative of potential improvements. Could you describe how these numbers would be representative of potential improvements? What benefit would type units provide to DWARF-aware linking? Comment Actions Given this is built on top of the DWARFLinker used for llvm-dsymutil which, I believe, already strips out those duplicate descriptions - hopefully that's already included in the functionality under review? But good to check/ask - @avl does this patch already do the good things for duplicate (or dropped) function sections? (eg: "inline void f1() { } void f2() { f1(); }" + "inline void f1() { } void f2(); int main() { f1(); f2(); }" compiled and linked at -O0 the DWARF would usually contain two descriptions of 'f1', but ideally with a DWARF aware linker it'd contain only one description of 'f1'. Similarly: "void f1() { } void f2() { }" + "void f2(); int main() { f2(); }" compiled and linked with -ffunction-sections -Wl,-gc-sections would normally contain a DWARF description of "f1" but no code for f1, and a DWARF aware linker would strip out the DWARF description of 'f1' here) Comment Actions .debug_types could be much more efficient if we stop duplicating the decl context just so we can refer to a type. Right now if you have a variable, it will point to a CU local type, which duplicates the containing namespaces and class/struct/union decl contexts. So the overhead of .debug_types is a bit higher. It would be great if we could just refer directly to the type from .debug_types in the DW_TAG_variable and other DIEs that have DW_AT_type attributes. Also, .debug_types is only there for struct, unions, classes and a few other types. When you have a typedef like "std::string::size_type", or a typedef inside of the std::string, it may or may not appear in the .debug_types definition, but no one can point to those types because the way .debug_types is currently defined, only one type can be referred to within a type unit. The results is that any code wanting to use of one these types must duplicate the entire decl context (namespace "std", class "basic_string<>", typedef "size_type") just so a variable can point to the DWARF. I mention this because I think letting --gc-debuginfo do its thing with ODR type uniquing leaves the DWARF in a better state: type info for one class isn't strewn across a bunch of .o files, or having large swaths of the type duplicated all over the place. So if we have type uniquing, I would keep .debug_types out of it. We could move to make .debug_types more useful. Right now a type unit has a single type signature that people can refer to and a single offset within the type unit for the type that people will extract. If we got a DWARF specification change in, we could make it so a single type unit has N signatures and N offsets to types within the type unit. This would allow people to directly reference contained types that are not class, structs or unions, like typedefs. But that requires a DWARF spec change.
Are object files not actually bigger when using .debug_types? All the duplicated decl context junk in the DWARF is not efficient. I would be surprised if type units make .o files smaller. There is nothing to unique within a single .o file and the type units should only make things bigger.
Right now the ODR type uniquing does this, but it doesn't split the types out into their own compile units. The first time you see a type, you emit it and remember it. All other uses of this type and up using a DW_FORM_ref_addr which is 32 bits to refer to the first emitted type. So the ODR already does this. What it doesn't do is split types out into unique compile units based on their DW_AT_decl_file. If we did, you can slowly combine all DWARF that was defined in header files into DW_TAG_compile_unit DIEs with only types from those files. Then the DWARF would be really clean. The main issue with this approach is that it means you have to make all DWARF in memory to that things can lazily be added to each of these bags (compile units with types in them) since you might end up adding more types to each compile unit. With the current ODR support we can parse the .o files one at a time, and refer to things we have already emitted easily and we don't need to keep everything around so we can add things to the "bags". The downside of the dsymutil ODR technique, is many times you types that have implicitly added methods (default constructor, assignment operators, move constructors) that are only present if the user used them in their source file. So you might end up with many definitions of "class A", but some have the copy constructor in the DWARF (it was created by the compiler for types where the compiler can create them) and others that don't. So we sometimes have to emit the type multiple times when you have a DW_TAG_subprogram that refers to these implicitly created functions. But overall the approach works well for how easy it is to implement. I am all for creating compile units and adding all of the types declared in that file to them and having the DWARF be even cleaner though! Comment Actions dsymutil's ODR does unique the types and only emit one copy. See my comment on implicit methods as one thing that can cause multiple copies of a type to be emitted. Comment Actions
right. It generates one copy: $cat test1.cpp inline void f1() { } void f2(); int main() { f1(); f2(); } $cat test2.cpp inline void f1() { } void f2() { f1(); } $ clang++ test1.cpp test2.cpp -O0 -g -ffunction-sections -fuse-ld=lld -Wl,--gc-sections; llvm-dwarfdump -a a.out | grep "\"f1\"" DW_AT_name ("f1") DW_AT_name ("f1") 0x00000098: "f1" $ clang++ test1.cpp test2.cpp -O0 -g -ffunction-sections -fuse-ld=lld -Wl,--gc-sections,--gc-debuginfo; llvm-dwarfdump -a a.out | grep "\"f1\"" DW_AT_name ("f1") 0x0000008b: "f1" Comment Actions
Sorry. I need to stop writing letters in the late evening. Mainly I am talking about two things: 1 -fdebug-types-section allows processing debug info faster. dsymutil/DWARFLinker performance heavily depends on the size of input DWARF. 296%(145s) Resulting DWARF is accurate: no type duplications, no extra bytes needed for type units. But it's processing time: 145s. Time to parse in case -fdebug-types-section for clang is: 133%(65s) This difference is because not necessary to parse type information. So separating type info from the whole DWARF allows to noticeable speedup the debug info linking. 2 Something like "bag of dies" could avoid increasing size of data(created by type units). This idea is not for the current state. This is a possible DWARF format change. The linker processes comdat sections with type units quickly because it does not parse contents. To reduce size of type units it is possible to implement "bag of dies" idea. DWARFLinker could be used for merging "bag of dies" instead of lld linker. It looks like this solution would allow having a minimal size and fast processing.
right, DWARF-aware linker doing de-duplication using ODR does not need "Bag of DWARF". It seems that Bag of DWARF used with DWARF agnostic linker would have a problem of type duplication.
yes. that is exactly what I am talking about. using type hash with elements of that type unit(bag of dies) would allow to speedup types de-duplication. Comment Actions Rather than replying point-by-point, since I think the conversation has got a bit jumbled up - and maybe it's all best left to a separate thread than in this review... but: The initial idea for bag-of-dwarf was to allow a type unit to expose more than one DIE (so CUs could reference nested types, member functions, parameters, etc). Also potentially to let CUs export entities that are known to have one home & that other DWARF might want to reference (this is a bit more of a stretch - but you could imagine a type with a strong vtable, no point putting it in a type unit, so you put it in the CU - but you could, if you wanted, still expose the type and its member functions as referenceable DIEs that could be referenced from other CUs rather than by having to redeclare (though not define) the entity in that other CU - inter-object-file-DIE references). I see your point somewhat about how type units could integrate with DWARF-aware linking, but perhaps only in a very specific way - At least as LLVM emits type units, the types should always be identical (the type unit should never contain implicit special members or member function template instantiations, etc) & the variable component of a type (those implicit special members, etc) only ever appear on the declaration of the type that references to the type unit. If you knew that was the kind of input you were getting, you could rely on every type unit being identical (well, with GCC you can rely on each type unit being identical, but the same ODR type might be in multiple distinct type units because it uses exact hashing of the type unit contents, and the contents varies with implicit special members, etc)... Anyway, if you could rely on the type units for a given ODR type being identical or at least equivalent (no implicit special members, etc) - you could just use the first one you saw, drop the rest, and any new members you see can just be attached to a type declaration the way Clang does them today... hmm - I guess that'd work with GCC's output too, just ignoring any duplicate type units even though they aren't quite equivalent - they'd be no worse than Clang's type units. (Clang's never contain any of the special members, GCC's contain the used one - so if your type unit ended up being an arbitrarily chosen one from GCC, it wouldn't be any worse than CLang's - it'd just have some, but not all, the special members, member function templates, etc) Hmm, I guess no reason you couldn't do this without type units though - maybe - depending on how debuggers handle it. Maybe they aren't good with member function declarations in type declarations if those type declarations aren't referencing type units.... eg: existing debug info with type units can look something like this: DW_TAG_compile_unit DW_TAG_structure_type // DIE: t1.1 DW_AT_name "t1" DW_AT_signature 0x42 DW_AT_declaration true DW_TAG_variable DW_AT_type // -> DIE t1.1 DW_TAG_compile_unit DW_TAG_structure_type // DIE: t1.2 DW_AT_name "t1" DW_AT_signature 0x42 DW_AT_declaration true DW_TAG_subprogram // DIE: t1::t1 DW_AT_name "t1" DW_AT_declaration DW_AT_artificial true DW_TAG_subprogram DW_AT_specification // -> DIE t1::t1 // etc... DW_TAG_variable DW_AT_type // -> DIE t1.2 type unit, signature 0x42, type offset -> DIE t1.tu DW_TAG_type_unit DW_TAG_structure_type // DIE: t1.tu DW_AT_name "t1" // line/file/column/size/etc... // let's say there are no explicit members, for the sake of simplicity Then, without type units you could emit something very similar DW_TAG_compile_unit DW_TAG_structure_type // DIE: t1.1 DW_AT_name "t1" // line/file/column/size/etc... DW_TAG_variable DW_AT_type // -> DIE t1.1 DW_TAG_compile_unit DW_TAG_structure_type // DIE: t1.2 DW_AT_name "t1" DW_AT_signature 0x42 DW_AT_declaration true DW_TAG_subprogram // DIE: t1::t1 DW_AT_name "t1" DW_AT_declaration DW_AT_artificial true DW_TAG_subprogram DW_AT_specification // -> DIE t1::t1 // etc... DW_TAG_variable DW_AT_type // -> DIE t1.2 ie: don't go back and try to make t1.1 complete/merge all the contents, just tack things on to CU-local declarations whenever you come across a new member that has a definition in this translation unit. (ie: when the linker keeps the code for t1::t1 alive, then keep the subprogram definition alive, which keeps the subprogram declaration alive, which keeps the t1.2 declaration alive - but otherwise skip it and refer to t1.1 directly (or, in a first approximation, just always keep the declaration - but I guess dsymutil already has the smarts to strip the local type definition out entirely (does it use a declaration to reduce encoding length if there's lots of references to a type? Probably not, I guess)) This sort of complex design discussion probably merits some in person or video chat discussions or at least lengthy design discussion on llvm-dev. Comment Actions One drive-by request/question: have you done any measurements of the impact on debuggers? In other words, does it improve/make worse the time it takes them to load the debug data? Comment Actions
Agreed. I think it would be better to dicuss it through email and extend it to llvm-dev later. I would start that email thread. Comment Actions
no, I have not done yet. I would do and post results here. Comment Actions I've put a few first general comments and suggestions inline. side note: I've noticed you have a build time 75m15s, what looks to me too much for your configuration. Do you know about "-DLLVM_OPTIMIZED_TABLEGEN=1"? It should give an instant significant boost.
Comment Actions Few more comments. They are mostly to give a feedback about the code style, I have not look deep into how it works yet.
Comment Actions addressed comments: added test for -flto case. I did not change isDebug bit yet. would make it with separate review. Comment Actions
loading times for debuggers: no gc-debuginfo case: gc-debuginfo case: Overall result: gc-debuginfo makes loading time shorter and reduces memory usage for debuggers, Comment Actions Might be worth comparing with -gdb-index too (though not sure if gdb-index will be immediately compatible with the DWARF aware linking, so a difficult comparison - but even comparing "gdb-index without DWARF-aware linking V DWARF-aware linking without gdb-index", etc) Comment Actions
here are the results for clang binary: ------------------------------------------------------------- | Options | Run-time, sec | Used Memory, kb | Size | ------------------------------------------------------------- | | 35.58 | 9403596 | 1,4G | ------------------------------------------------------------- | gc-debuginfo | 109.15 | 14275444 | 783M | ------------------------------------------------------------- | gc-debuginfo | 130.63 | 14394700 | 797M | | gdb-index | | | | ------------------------------------------------------------- | gdb-index | 52.88 | 9524976 | 1.5G | ------------------------------------------------------------- I did not check whether gdb-index is compatible with DWARF aware linking output. There probably exists the room for optimization of gdb-index(related to the strategy of DWARF context creation). Comment Actions Ah, sorry, I meant specifically related to the gdb startup time point you made earlier about DWARF-aware linking improving startup time. It'd be useful to measure startup time with a gdb-index for comparison. (I'm guessing the time in the above chart is linking time, rather than GDB startup time)
Comment Actions
right, these are linking times. here is start-up times: ------------------------------------------------------------------- Debugger | Options | load-time, sec | Used Memory, kb | ------------------------------------------------------------------- | gdb 8.1 | | 47,79 | 2168172 | ------------------------------------------------------------------- | gdb 8.1 | gdb-index | 3.04 | 555408 | ------------------------------------------------------------------- | gdb 8.1 | gc-debuginfo | 44,8 | 1740876 | ------------------------------------------------------------------- | gdb 8.1 | gdb-index | 3.7 | 555340 | | | gc-debuginfo | | | ------------------------------------------------------------------- | lldb 8 | | 6,29 | 1511052 | ------------------------------------------------------------------- | lldb 8 | gdb-index | 6,22 | 1523516 | ------------------------------------------------------------------- | lldb 8 | gc-debuginfo | 3,32 | 843600 | ------------------------------------------------------------------- | lldb 8 | gdb-index | 3.6 | 858532 | | | gc-debuginfo | | | ------------------------------------------------------------------- though gdb-index+gc-debuginfo times could be meaningless. Comment Actions Thanks! (yeah, the gdb-index + lldb times aren't relevant either, lldb uses a different kind of accelerated access (one that's only supported on the MachO debug info distribution model currently - but I believe someone's working on/plans to work on DWARFv5 debug_names support)) Comment Actions
dsymutil supports DWARFv5 debug_names table generation. So if --gc-debuginfo specified then debug_names would be generated. This patch has a minor bug with it. I would update it. Comment Actions added test for flto case.(which shows that -flto=thin does not work currently). Comment Actions This implementation does not work for -flto=thin case. That is the same problem like in D54747: https://reviews.llvm.org/D54747#1503720. Comment Actions Re-measured size/performance results: A: --function-sections --gc-sections LLVM code base: -------------------------------------------------------------- | Options | build time | bin size | lib size | -------------------------------------------------------------- | A | 66min(100%) | 18.0G(100%) | 16.0G(100.0%) | -------------------------------------------------------------- | B | 81min(123%) | 9.7G( 54%) | 13.0G( 81.0%) | -------------------------------------------------------------- | C | 58min( 87%) | 12.0G( 67%) | 15.0G( 93.5%) | -------------------------------------------------------------- | D | 58min( 87%) | 12.0G( 67%) | 8.7G( 54.0%) | -------------------------------------------------------------- | E | 68min(103%) | 7.6G( 42%) | 14.0G( 87.5%) | -------------------------------------------------------------- Clang binary: -------------------------------------------------------------- | Options | size | link time | used memory | -------------------------------------------------------------- | A | 1.50G(100%) | 26.7sec(100%) | 9191MB(100%) | -------------------------------------------------------------- | B | 0.76G( 50%) | 107.4sec(402%) | 13361MB(145%) | -------------------------------------------------------------- | C | 0.82G( 54%) | 18.3sec( 68%) | 8300MB( 90%) | -------------------------------------------------------------- | D | 0.96G( 64%) | 7.6sec( 28%) | 4255MB( 46%) | -------------------------------------------------------------- | E | 0.58G( 45%) | 58.5sec(219%) | 10317MB(112%) | -------------------------------------------------------------- lldb loading time: --------------------------------------------- | Options | time | used memory | --------------------------------------------- | A | 6.32sec(100%) | 1475MB(100%) | --------------------------------------------- | B | 3.49sec( 55%) | 825MB( 56%) | --------------------------------------------- | C | 3.65sec( 58%) | 875MB( 59%) | --------------------------------------------- | D | 4.12sec( 65%) | 1017MB( 69%) | --------------------------------------------- | E | 3.95sec( 63%) | 628MB( 43%) | --------------------------------------------- Comment Actions Just an excited observer, but wanted to drop some statistics of my own, in case they're helpful. I have a Rust project (Materialize) that compiles down into a ~800MB (!) binary by default. The weight is almost entirely debug info; .text is less than 5% of its size. This is evidently an artifact of how Rust codegen units are much, much larger than your average C/C++ codegen units and so the toolchain relies heavily on the linker for dead code elimination. Of course that dead code elimination doesn't currently apply to the debug info. This diff is exactly what's needed to solve the problem, as far as I can tell. (There's an upstream issue about this that led me here: https://github.com/rust-lang/rust/issues/56068.) Without --gc-debuginfo: # Link time: 10.486s benesch@gibbon:~/materialize$ bloaty target/release/materialized -n 10 FILE SIZE VM SIZE -------------- -------------- 24.5% 194Mi 0.0% 0 .debug_info 24.1% 191Mi 0.0% 0 .debug_loc 13.8% 109Mi 0.0% 0 .debug_pubtypes 10.1% 79.9Mi 0.0% 0 .debug_pubnames 8.8% 70.0Mi 0.0% 0 .debug_str 8.3% 66.3Mi 0.0% 0 .debug_ranges 4.4% 35.3Mi 0.0% 0 .debug_line 3.1% 24.8Mi 66.3% 24.8Mi .text 1.8% 14.4Mi 25.1% 9.39Mi [41 Others] 0.6% 4.79Mi 0.0% 0 .strtab 0.4% 3.22Mi 8.6% 3.22Mi .eh_frame 100.0% 793Mi 100.0% 37.4Mi TOTAL With --gc-debuginfo: # Link time: 249.855s benesch@gibbon:~/materialize$ bloaty target/release/materialized -n 10 FILE SIZE VM SIZE -------------- -------------- 37.9% 115Mi 0.0% 0 .debug_info 17.9% 54.7Mi 0.0% 0 .debug_str 12.0% 36.7Mi 0.0% 0 .debug_ranges 11.2% 34.4Mi 0.0% 0 .debug_loc 8.1% 24.8Mi 66.3% 24.8Mi .text 6.6% 20.1Mi 0.0% 0 .debug_line 2.7% 8.15Mi 19.3% 7.23Mi [35 Others] 1.6% 4.79Mi 0.0% 0 .strtab 1.1% 3.22Mi 8.6% 3.22Mi .eh_frame 1.0% 2.99Mi 0.0% 0 .symtab 0.0% 0 5.8% 2.16Mi .bss 100.0% 305Mi 100.0% 37.4Mi TOTAL So hugely successful in reducing the size of the debug info! The new binary is 40% of the size of the original. Some of that savings is from omitting debug_pubtypes and debug_pubnames, which is easy enough to strip without this patch, but there is also a substantial reduction in debug_info and debug_loc—a much larger reduction than dbz is able to provide. Unfortunately there is also a massive blowup in link times, from 10s to 250s. That's a 25x increase, much larger than the link time increase for Clang described above, which was about 4x. Comment Actions Out of curiosity is there any work in the direction of fragmenting these modules/objects further to reduce this & other negative side effects of that situation?
Comment Actions
I'm afraid I have no idea. Nothing I'm aware of, but I'm not a rustc dev. Perhaps @rocallahan knows. I'll ask on the Rust Zulip, too. Comment Actions @benesch Thank you for the interest in this patch and the statistics data.
Excuse me. It looks like my reported statistic is not very accurate. I would re-measure it and re-submit. /usr/bin/time -f "%E %M" bin/clang-11 /usr/bin/time -f "%E %M" bin/clang-11 --gc-debuginfo i.e. it is about 6.5x, not 25x. I would also check the numbers for the updated patch and for your materialize project. Probably, it might be quite close since .text section takes 7.5% of entire binary in clang.
The reason why debug_pubtypes and debug_pubnames were stripped, is that the current patch uses already existed dsymutil code, which does not generate debug_pubtypes and debug_pubnames. This patch changes debug info. Thus it removes other debug info tables, which it does not support(since they most probably would contain no relevant data). Supporting all these tables(as well as DWARF5) is a future task for the DWARFLinker. Comment Actions Re-measured size/performance results once more. These results are for 12-cores machine. Actually, the more cores the bigger difference between --gc-debuginfo/--no-gc-debuginfo cases. A: --function-sections --gc-sections LLVM code base: -------------------------------------------------------------- | Options | build time | bin size | lib size | -------------------------------------------------------------- | A | 54min(100%) | 19.0G(100%) | 15.0G(100.0%) | -------------------------------------------------------------- | B | 65min(120%) | 9.7G( 51%) | 12.0G( 80.0%) | -------------------------------------------------------------- | C | 53min( 98%) | 12.0G( 63%) | 15.0G(100.0%) | -------------------------------------------------------------- | D | 52min( 96%) | 12.0G( 63%) | 8.2G( 55.0%) | -------------------------------------------------------------- | E | 64min(118%) | 5.3G( 28%) | 12.0G( 80.0%) | -------------------------------------------------------------- | F | 57min(105%) | 7.6G( 40%) | 14.0G( 93.0%) | -------------------------------------------------------------- Clang binary: ------------------------------------------------------------- | Options | size | link time | used memory | ------------------------------------------------------------- | A | 1.50G(100%) | 9sec(100%) | 9307MB(100%) | ------------------------------------------------------------- | B | 0.76G( 50%) | 72sec(800%) | 15055MB(161%) | ------------------------------------------------------------- | C | 0.82G( 54%) | 8sec( 89%) | 8402MB( 90%) | ------------------------------------------------------------- | D | 0.96G( 64%) | 6sec( 67%) | 4273MB( 46%) | ------------------------------------------------------------- | E | 0.43G( 29%) | 80sec(889%) | 15000MB(161%) | ------------------------------------------------------------- | F | 0.58G( 39%) | 30sec(334%) | 10890MB(117%) | ------------------------------------------------------------- lldb loading time: -------------------------------------------- | Options | time | used memory | -------------------------------------------- | A | 6.4sec(100%) | 1495MB(100%) | -------------------------------------------- | B | 4.0sec( 63%) | 826MB( 55%) | -------------------------------------------- | C | 3.7sec( 58%) | 877MB( 59%) | -------------------------------------------- | D | 4.3sec( 67%) | 1023MB( 69%) | -------------------------------------------- | E | 2.1sec( 33%) | 478MB( 32%) | -------------------------------------------- | F | 2.7sec( 42%) | 631MB( 43%) | -------------------------------------------- Comment Actions @ruiu @grimar @MaskRay @dblaikie @clayborg What do you think about that patch ? Whether it is generally OK to start integration? If it is OK then I would divide the patch to smaller patches and start to integrate. I think it is worth integrating: it has a small impact on existing lld code(most implementation is in separate library DWARFLinker. Only lld/ELF/DWARF.* and lld/ELF/LLDDwarfLinker.* are noticeably affected). --gc-debuginfo is off by default so it would not affect most of the users. Only those who explicitly specified --gc-debuginfo would use that functionality. The patch noticeably reduces the size of debug info. It also resolves"overlapping address ranges" problem. Upcoming optimizations could minimize the performance reduction. Comment Actions @ayermolo Hi Alexander, I rebased on latest but did not check whether current size/performance numbers match with reported previously.
|
clang-format: please reformat the code