This is an archive of the discontinued LLVM Phabricator instance.

[DWARF] Create subprogram's DIE in the unit specified by its DISubprogram
AbandonedPublic

Authored by jmorse on Jan 19 2021, 9:12 AM.

Details

Summary

This is a fix for PR48790 [0]. Over in D70350, subprogram DIEs were permitted to be shared between CUs. However, as documented in the PR the creation of a subprogram DIE can be triggered early, from other CUs. The subprogram definition is then placed in that other CU incorrectly, breaking internal CU references attached to the subprogram later by things expecting it to be in the correct CU.

To fix this, I've added an extra test to DwarfUnit::getOrCreateSubprogramDIE that examines the "unit" field of the DISubprogram being created. If it isn't the current CU, the correct one is looked up and creation passed to that CU. The test also accepts being placed in the skeleton CU in split DWARF, happily there are a few tests looking for that.

This should (TM) prevent subprogram DIEs being placed in the wrong unit, but I have two worries:

  • I believe all subprograms definitions get units added to them, but could a declaration DISubprogram passed to DwarfUnit::getOrCreateSubprogramDIE with no unit lead to the same problem?
  • We assume that all call-sites attempting to create a subprogram DIE are prepared to handle a cross-CU reference (which is what this patch will start producing), which as we've seen in [0], isn't necessarily true.

I've also made DwarfCompileUnits lookable-uppable with a DICompileUnit by publishing getOrCreateDwarfCompileUnit. I don't think there's any other way around creating a DIE in a different unit without doing this.

Test added is the reproducer from [0], gets run through llvm-dwarfdump -verify and has its output examined to ensure "eee" is in the CU for 2.cpp.

[0] https://bugs.llvm.org/show_bug.cgi?id=48790

Diff Detail

Event Timeline

jmorse created this revision.Jan 19 2021, 9:12 AM
jmorse requested review of this revision.Jan 19 2021, 9:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 19 2021, 9:12 AM

(seems there is perhaps some nearby pretty problematic issues I'm just writing down here because I saw it: Generally there are two kinds of DISubprograms: definition DISubprograms are "distinct" and referenced/kept alive from an llvm::Function's subprogram attachment. Declaration DISubprograms are not distinct and referenced from the member list of a DICompositeType, for instance. That's the simple case - if IR Linking occurs, one copy of an llvm::Function definition is retained, along with its referenced DISubprogram, simple enough. But it seems we are keeping definition DISubprograms alive through a few other means (or declaration DISubprograms, perhaps) - such as from call site metadata, using declarations, and the scope chains for types like lambdas. These DISubprograms aren't properly deduplicated because they can't necessarily piggyback on the llvm::Function deduplication: They are referenced from multiple parts of the debug info metadata - this can lead to buggy/weird DWARF like this:

$ cat a.cpp 
namespace ns {
inline void f1() {
}
}
void f3() {
  ns::f1();
}
$ cat b.cpp
namespace ns {
inline void f1() {
}
}
void f2() { ns::f1(); }
using ns::f1;
void f3();
int main() {
  f3();
  f2();
}
$ ~/dev/llvm/build/default/bin/clang++ -fuse-ld=lld -g -flto a.cpp b.cpp && llvm-dwarfdump-tot a.out
a.out:  file format elf64-x86-64

.debug_info contents:
0x00000000: Compile Unit: length = 0x0000006e, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x00000072)

0x0000000b: DW_TAG_compile_unit
              DW_AT_producer    ("clang version 12.0.0 (git@github.com:llvm/llvm-project.git 05f846b97e222815b36efa815cbc67802c3e283c)")
              DW_AT_language    (DW_LANG_C_plus_plus_14)
              DW_AT_name        ("b.cpp")
              DW_AT_stmt_list   (0x00000000)
              DW_AT_comp_dir    ("/usr/local/google/home/blaikie/dev/scratch")
              DW_AT_low_pc      (0x0000000000000000)
              DW_AT_ranges      (0x00000000
                 [0x00000000002017c0, 0x00000000002017cb)
                 [0x00000000002017d0, 0x00000000002017e2))

0x0000002a:   DW_TAG_namespace
                DW_AT_name      ("ns")

0x0000002f:     DW_TAG_subprogram

0x00000030:     NULL

0x00000031:   DW_TAG_imported_declaration
                DW_AT_decl_file ("/usr/local/google/home/blaikie/dev/scratch/b.cpp")
                DW_AT_decl_line (6)
                DW_AT_import    (0x0000002f)

0x00000038:   DW_TAG_subprogram
                DW_AT_low_pc    (0x00000000002017c0)
                DW_AT_high_pc   (0x00000000002017cb)
                DW_AT_frame_base        (DW_OP_reg6 RBP)
                DW_AT_linkage_name      ("_Z2f2v")
                DW_AT_name      ("f2")
                DW_AT_decl_file ("/usr/local/google/home/blaikie/dev/scratch/b.cpp")
                DW_AT_decl_line (5)
                DW_AT_external  (true)

0x00000051:   DW_TAG_subprogram
                DW_AT_low_pc    (0x00000000002017d0)
                DW_AT_high_pc   (0x00000000002017e2)
                DW_AT_frame_base        (DW_OP_reg6 RBP)
                DW_AT_name      ("main")
                DW_AT_decl_file ("/usr/local/google/home/blaikie/dev/scratch/b.cpp")
                DW_AT_decl_line (8)
                DW_AT_type      (0x0000006a "int")
                DW_AT_external  (true)

0x0000006a:   DW_TAG_base_type
                DW_AT_name      ("int")
                DW_AT_encoding  (DW_ATE_signed)
                DW_AT_byte_size (0x04)

0x00000071:   NULL
0x00000072: Compile Unit: length = 0x0000005f, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x000000d5)

0x0000007d: DW_TAG_compile_unit
              DW_AT_producer    ("clang version 12.0.0 (git@github.com:llvm/llvm-project.git 05f846b97e222815b36efa815cbc67802c3e283c)")
              DW_AT_language    (DW_LANG_C_plus_plus_14)
              DW_AT_name        ("a.cpp")
              DW_AT_stmt_list   (0x00000059)
              DW_AT_comp_dir    ("/usr/local/google/home/blaikie/dev/scratch")
              DW_AT_low_pc      (0x0000000000000000)
              DW_AT_ranges      (0x00000030
                 [0x00000000002017a0, 0x00000000002017ab)
                 [0x00000000002017b0, 0x00000000002017b6))

0x0000009c:   DW_TAG_subprogram
                DW_AT_low_pc    (0x00000000002017a0)
                DW_AT_high_pc   (0x00000000002017ab)
                DW_AT_frame_base        (DW_OP_reg6 RBP)
                DW_AT_linkage_name      ("_Z2f3v")
                DW_AT_name      ("f3")
                DW_AT_decl_file ("/usr/local/google/home/blaikie/dev/scratch/a.cpp")
                DW_AT_decl_line (5)
                DW_AT_external  (true)

0x000000b5:   DW_TAG_namespace
                DW_AT_name      ("ns")

0x000000ba:     DW_TAG_subprogram
                  DW_AT_low_pc  (0x00000000002017b0)
                  DW_AT_high_pc (0x00000000002017b6)
                  DW_AT_frame_base      (DW_OP_reg6 RBP)
                  DW_AT_linkage_name    ("_ZN2ns2f1Ev")
                  DW_AT_name    ("f1")
                  DW_AT_decl_file       ("/usr/local/google/home/blaikie/dev/scratch/a.cpp")
                  DW_AT_decl_line       (2)
                  DW_AT_external        (true)

0x000000d3:     NULL

0x000000d4:   NULL
$ cat ab.ll | grep f1
$_ZN2ns2f1Ev = comdat any
  call void @_ZN2ns2f1Ev(), !dbg !18
define linkonce_odr dso_local void @_ZN2ns2f1Ev() #1 comdat !dbg !20 {
  call void @_ZN2ns2f1Ev(), !dbg !23
!7 = distinct !DISubprogram(name: "f1", linkageName: "_ZN2ns2f1Ev", scope: !8, file: !4, line: 2, type: !9, scopeLine: 2, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !3, retainedNodes: !2)
!20 = distinct !DISubprogram(name: "f1", linkageName: "_ZN2ns2f1Ev", scope: !8, file: !1, line: 2, type: !9, scopeLine: 2, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0, retainedNodes: !2)

Which is a bit problematic. Almost feels like the solution to that is to have the debug info metadata refer (weakly) to the llvm::Function instead of the distinct DISubprogram... 

I would expect this should show up for call site debug info too, which might catch more interest from the folks who've been implementing that more recently. Oh, maybe it doesn't show up there - precisely because that debug info does essentially refer to the llvm::Function (more specifically call site debug info isn't carried in the IR at all (except by adding DISubprograms to llvm::Function declarations, which isn't otherwise used/needed) - it's created at the backend).

Might be able to tickle something like this with the scoping issue you've got here - even with this bug fixed, if you had the sort of cross-CU referencing I've created in this example)
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
1083–1092

This seems a bit more complicated than it needs to be (does it need to check skeleton DIEs, etc?)

The code I'd expect would be something like DwarfDebug's constructSubprogramDefinitionDIE - perhaps updating the caller to call constructSubprogramDefinitionDIE rather than modifying getOrCreateSubprogramDIE would be suitable.

(I realize this leaves open the possibility that other callers might need the same treatment - an assertion here that DD->getOrCreateDwarfCompileUnit(SP) == this might be good to catch those & see if there's a more systemic issue to be addressed?)

llvm/test/DebugInfo/X86/subprogram-across-cus.ll
7–29

Would be good to simplify this down a bit as I did in the bug - it doesn't need a lambda (a named local type will do - without a need for any members, etc) or a while loop, or the extra 'xxx' type, 'ccc' can be a static member to avoid the need to describe a parameter.

jmorse updated this revision to Diff 318150.Jan 21 2021, 4:03 AM

Replace regression test with David's smaller test

jmorse updated this revision to Diff 318196.Jan 21 2021, 7:26 AM

Change interception of Subprogram creation to call back out to DwarfDebug instead of looking up CUs within DwarfUnit.

Responding to inline comment that's now sort-of-orphaned: calling back out to constructSubprogramDefinitionDIE sounds good, I've added that to DwarfUnit::getOrCreateContextDIE with a comment.

IMHO, getOrCreateContextDIE is a large part of the problem here, callers use it to create DIEs without actually caring what kind of DIE it is, and so they're not able to make decisions about what CU the DIE should be created in. In the call stack [0] where the fault occurs, it's the earliest point that we can identify we're going to create a Subprogram, and so might need to change unit.

(The skeleton faff was because the DebugInfo/X86/fission-inline.ll test failed otherwise, looks like constructSubprogramDefinitionDIE takes care of that now).

[0] https://gist.github.com/jmorse/bb283cd69b351a8ad568f26a50c5553c

llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
863

Changed this to call getOrCreateContextDIE so that it'll switch on whether the SP is a definition or a declaration.

dblaikie added inline comments.Jan 21 2021, 11:27 AM
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
863

Hmm, this doesn't make sense to me - my expectation would be that definitions never appear in the elements list of a class/structure type DI metadata. (I'd expect a definition DISubprogram to be attached to an llvm::Function, and for that DISubprogram to refer to a declaration DISubprogram in the structure type's elements/member list)

Do you have an example where there are definitions in the elements list?

(at least, even if this is valid - it doesn't seem to be covered by the test?)

llvm/test/DebugInfo/X86/subprogram-across-cus.ll
31

Probably good to include the compile commands here (though handy to have the bug reference for extra context).

jmorse updated this revision to Diff 318504.Jan 22 2021, 6:08 AM

Remove needless redirect of subprogram creation; add original compile commands to test case.

jmorse marked 4 inline comments as done.Jan 22 2021, 6:09 AM
jmorse added inline comments.
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
863

No example, I added this to be conservative, I was confident all the other call sites would be safe against definition/declaration differences except this. Now removed!

dblaikie added inline comments.Jan 22 2021, 12:20 PM
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
525–530

Are these ever declarations/non-definitions? (I can't think of a place where we'd create a declaration as a scope) Perhaps this could just assert that the SP is a definition?

jmorse marked 2 inline comments as done.Jan 26 2021, 11:04 AM
jmorse added inline comments.
llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp
525–530

Not to my knowledge -- again, this isn't my comfort zone. I stuck an assertion in and ran a stage2 RelWithDebInfo build and it didn't fire, that'd good assurance IMO.

jmorse updated this revision to Diff 319355.Jan 26 2021, 11:05 AM
jmorse marked an inline comment as done.

We don't create declaration DISubprograms as scopes, so don't handle it in getOrCreateContextDIE.

dblaikie accepted this revision.Jan 26 2021, 3:24 PM

Looks good - thanks!

This revision is now accepted and ready to land.Jan 26 2021, 3:24 PM
This revision was automatically updated to reflect the committed changes.
jmorse updated this revision to Diff 321115.Feb 3 2021, 9:21 AM

Re-opening this as D95622 reverted it, because using type units led to various crashes and assertions. My revision to this patch just disables the new behaviour (created subprograms in their preferred unit) when using type units.

Here's a reduced reproducer that crashes for me with clang -O3 -g -mllvm -generate-type-units -c.

enum a {};
inline a b() {
  auto c = [] {};
}
void d(int) { b; }

I believe the type for lambda "c" is placed in a type unit. The type unit creates the scope of the lambda (subprogram "b") , the DIE for which is placed in the normal compile unit with this patch applied. However, the type information added to the subprogram to complete the type unit comes from that type units bump allocator pool. The result is that when the type unit is disposed of, the subprogram in the normal compile unit has children in freed memory. This annoys asan, which reports a use-after-free.

Using the old behaviour for type units seems to work just fine, DwarfUnit::isShareableAcrossCUs already disallows subprogram sharing when type units are used, which my change to. I continue to worry that there's more hidden beneath the surface here :(

jmorse reopened this revision.Feb 3 2021, 9:21 AM
This revision is now accepted and ready to land.Feb 3 2021, 9:21 AM
jmorse requested review of this revision.Feb 3 2021, 9:21 AM
dblaikie accepted this revision.Feb 3 2021, 3:48 PM

Looks no worse than before at least.

Seems GCC doesn't include any scope information in the type units for function-local types. (though for namespace-scoped types GCC does include the namespaces in the type unit - but if the type is local to a namespace scoped function, none of the scoping is included in the type unit... ) - so might be worth changing our behavior to match tehre too (because the current behavior of a subprogram with no name, etc, isn't really useful - doesn't seem to get in gdb's way, but doesn't add any value either )

This revision is now accepted and ready to land.Feb 3 2021, 3:48 PM
This revision was automatically updated to reflect the committed changes.
MaskRay added a subscriber: MaskRay.Feb 4 2021, 1:21 PM

Heads-up: this patch seems to cause malformed .debug_gnu_pubnames with -fsanitize=cfi -gsplit-dwarf with some internal targets. I will report back when I have done more investigation.

% llvm-dwarfdump -debug-gnu-pubnames spinlock.pic.o > /dev/null
error: name lookup table at offset 0x13d9 has a terminator at offset 0x13e7 before the expected end at 0x140a

Heads-up: this patch seems to cause malformed .debug_gnu_pubnames with -fsanitize=cfi -gsplit-dwarf with some internal targets. I will report back when I have done more investigation.

% llvm-dwarfdump -debug-gnu-pubnames spinlock.pic.o > /dev/null
error: name lookup table at offset 0x13d9 has a terminator at offset 0x13e7 before the expected end at 0x140a

Might've been more suitable to rollback the original patch rather than implementing an untested hack without a clear explanation for why the problem arises in the first place. (perhaps this papers over more fundamental issues, etc)

@jmorse - would you mind reverting both this patch and @MaskRay's workaround while a test case is being worked on? (sounds like @MaskRay is working on providing that, yeah?)

This probably also caused two debuginfo-tests to regress when running against LLDB (at least that's the only relevant commit in the change log)

debuginfo-tests.dexter-tests.nrvo-string.cpp
debuginfo-tests.llgdb-tests.nrvo-string.cpp

See http://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/28154/

jmorse added a comment.Feb 8 2021, 4:31 AM

This probably also caused two debuginfo-tests to regress when running against LLDB (at least that's the only relevant commit in the change log)

That sounds a lot like it -- I'll get onto reverting then.

Maskrays reproducer ended up here https://reviews.llvm.org/D95617#2547111 I think, quoting:

Reproduce: https://gist.github.com/MaskRay/4c67b29bb038d5a016260437c9e25a7a

# git revert -n 56fa34ae3570a34fd0f4c2cf1bfaf095da01a959 # revert the workaround
% cat run.sh 
#!/bin/sh -e
llc -split-dwarf-file x.dwo -filetype=obj $1 -o x.o
if llvm-dwarfdump -debug-gnu-pubnames x.o 2> log; then
  exit 1
fi
grep 'name lookup table at' log
% ./run.sh a.ll
error: name lookup table at offset 0x12cc has a terminator at offset 0x12da before the expected end at 0x12fd

Thank you! Let me know if you need help reproducing the debuginfo failures.

jmorse added a comment.Feb 9 2021, 9:09 AM

The immediate cause of the failure in Maskrays reproducer is that a zero DIE offset is emitted to .debug_gnu_pubnames, which is interpreted as the (premature) end of the names list. Putting assert(Entity->getOffset() != 0); in [0] fails.

The reason why we get a zero offset is due to the following arrangements of debug-info:

  • This function [1] has an inlined copy of "LowLevelCallOnce<(lambda at internal/sysinfo.cc:299:28)>"
  • "LowLevelCallOnce" has a template type "Callable", with a type that's node !2140 in the input LLVM-IR,
  • That type is in the scope of a subprogram, "NumCPUs", which is created in a separate CU,
  • "NumCPUs" is in a namespace "base_internal", in another namespace "absl".

So far this is fine. With this patch applied we change the CU that "NumCPUs" is created in to another (from node !2 to !1446). However: this unit only has a skeleton DIE emitted. When the "absl" namespace is created, DwarfUnit adds it directly to the unit DIE (not the skeleton) and list of globals [2]. When emitting .debug_gnu_pubnames, we refer to a DIE that hasn't been (and won't be) emitted in the unit DIE, thus getting a zero offset.

Exactly which piece is at fault here is a bit opaque to me; I'll carry on studying what's going on.

[0] https://github.com/llvm/llvm-project/blob/79b222c39f0e4377b49191b6aba080b1607f3fa7/llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp#L2394
[1] "_ZN4absl13base_internal12CallOnceImplIZNS0_8SpinLock8SpinLoopEvE3$_0JEEEvPNSt3__16atomicIjEENS0_14SchedulingModeEOT_DpOT0_"
[2] https://github.com/llvm/llvm-project/blob/79b222c39f0e4377b49191b6aba080b1607f3fa7/llvm/lib/CodeGen/AsmPrinter/DwarfUnit.cpp#L1066

Could be related to a change I made maybe a couple of years ago to avoid emitting split units when there was nothing of interest in the split unit. I forget where that code is exactly - but this was part of a set of fixes I made for a few related issues due to a ThinLTO debug info case - something was causing excess CU fragments under ThinLTO, ecah of which was being split even though there wasn't anything interesting in the split part.

Might be necessary to tweak the condition for that CU elision - I can go looking for more detail on exactly where/how that works, if you like.

jmorse reopened this revision.Feb 10 2021, 3:25 AM

Might be necessary to tweak the condition for that CU elision - I can go looking for more detail on exactly where/how that works, if you like.

Yes please, it sounds like a likely cause of this. The global pubnames list seems like a side-channel that could affect that elision.

(Re-opening this as it's still not stuck a landing).

This revision is now accepted and ready to land.Feb 10 2021, 3:25 AM

I did some bisecting and I found out the debuginfo tests have incorrect build dependencies and this commit was *not* at fault (someone/something deleted the built test binaries and caused a long overdue rebuild). The commit that enabled the new pass manager actually caused the tests to fail.

kyulee added a subscriber: kyulee.Jul 9 2021, 12:37 AM
jmorse abandoned this revision.Aug 10 2021, 2:23 AM

Abandoning in favour of D107076, which reverts the shared-subprogram behaviour, no cycles to hunt down the which-unit-owns-what issues I ran into. I added the regression test from here in that revert.