Page MenuHomePhabricator

[RFC][DwarfDebug][AsmPrinter] Support emitting function-local declaration for a lexical block
AbandonedPublic

Authored by krisb on Nov 12 2021, 1:21 AM.

Details

Summary

This is another attempt to make function-local declarations
(like static variables, structs/classes, and other) be correctly emitted
within a lexical (bracketed) block.

Some ideas were taken from previous attempts [0] and [1].

Limitations:

Since we skip lexical blocks if we couldn't find non-scope children for them,
we need a way to force the emission of blocks that have static locals or local
type declarations but don't have local variables. As a workaround, I
collect such scopes by collectLocalScopesWithDeclsFromCU() and
store them in a set. This set is used later to check whether we need to emit
a lexical block. It, however, lacks information about used records
and typedefs because there is no access to them from a CU.
In this case, we will put the entity to a first available parent scope
(with subprogram as a bound).

Fixes https://bugs.llvm.org/show_bug.cgi?id=19238.

clang's part is at D113743.

[0] https://reviews.llvm.org/D11180
[1] https://reviews.llvm.org/D15976
[2] https://bugs.llvm.org/show_bug.cgi?id=27579

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
krisb added a comment.EditedDec 23 2021, 10:30 AM

@dblaikie added another attempt to revert this as a review: D116225. Will commit once pre-commit tests + local lldb tests passed.

About the issues.
First of all, thank you for all the details! I was able to reproduce both issues.
(1) The crash doesn't seem related to the patch and reverting it doesn't make things better. It seems there is an inconsistency between when/where an abstract origin gets emitted and when/where it's requested.
(2) The issue with wrong dwo attributes is indeed caused by the imported declaration that gets emitted where it shouldn't (at least comparing with the behavior before this patch).
Before this patch 'includeMinimalInlineScopes()' makes emission of imported entity skipped and the second CU assumed empty, so we did not emit dwo attributes for it.
But I'm not sure I fully understand how DwarfDebug handles split dwarf. Why do we need the 'includeMinimalInlineScopes()' check for correctness here?

krisb reopened this revision.Dec 23 2021, 10:42 AM
This revision is now accepted and ready to land.Dec 23 2021, 10:42 AM

@dblaikie added another attempt to revert this as a review: D116225. Will commit once pre-commit tests + local lldb tests passed.

About the issues.
First of all, thank you for all the details! I was able to reproduce both issues.
(1) The crash doesn't seem related to the patch and reverting it doesn't make things better. It seems there is an inconsistency between when/where an abstract origin gets emitted and when/where it's requested.

Hmm, bother - guess I failed to correctly reduce the issue. I'll try again.

(2) The issue with wrong dwo attributes is indeed caused by the imported declaration that gets emitted where it shouldn't (at least comparing with the behavior before this patch).
Before this patch 'includeMinimalInlineScopes()' makes emission of imported entity skipped and the second CU assumed empty, so we did not emit dwo attributes for it.
But I'm not sure I fully understand how DwarfDebug handles split dwarf. Why do we need the 'includeMinimalInlineScopes()' check for correctness here?

It gets a bit complicated because a DWO file can only contain a single CU, but the .o file can contain multiple CUs - so to preserve as much fidelity as possible, if "split dwarf inlining" is enabled (include inline stack frames in the CU in the .o file, so the program can be symbolized even without the dwo/dwp files) then the .o file will get multiple CUs with -g1-like debug info, and one of those CUs will also have the dwo attributes and be a skeleton for the .dwo CU (but the dwo file, instead of having multiple CUs to describe the cross-CU inlining, it'll degrade the debug info quality by moving the DIEs that should be in other CUs in the dwo into the single CU there).

Turned out we had two different patches causing the same assert (should've bisected them both to be sure - added some unnecessary confusion to this whole situation). One of my reduced test cases actually was for https://reviews.llvm.org/D115325#3209087 - working on another for this patch (D113741) now...

To continue with a more accurate fix, one example for your reference. (The behavior is changed with/without this reversion):

int main() {      // <====== function scope
  int x = 3;
  int num = 6;
  printf("%d\n", x);
  if (num) {   // <======= lexical block 1
    static int x = 45;
    printf("%d\n", x);
    if (x) {   // <======= lexical block 2
      int x = 145;
      printf("%d\n", x);
    }
  }
  return 0;
}

Without this patch:

<1><2a>: Abbrev Number: 2 (DW_TAG_subprogram)
   <2b>   DW_AT_low_pc      : 0x10010680
   <33>   DW_AT_high_pc     : 0xd4
   <37>   DW_AT_frame_base  : 1 byte block: 6f         (DW_OP_reg31 (r31))
   <39>   DW_AT_name        : (indirect string, offset: 0xcb): main
   <3f>   DW_AT_type        : <0x94>
   <43>   DW_AT_external    : 1
<2><43>: Abbrev Number: 3 (DW_TAG_variable)
   <44>   DW_AT_location    : 3 byte block: 91 f0 0    (DW_OP_fbreg: 112)
   <48>   DW_AT_name        : (indirect string, offset: 0xd4): x
   <4e>   DW_AT_type        : <0x94>
<2><52>: Abbrev Number: 3 (DW_TAG_variable)
   <53>   DW_AT_location    : 3 byte block: 91 ec 0    (DW_OP_fbreg: 108)
   <57>   DW_AT_name        : (indirect string, offset: 0xd0): num
   <5d>   DW_AT_type        : <0x94>
<2><61>: Abbrev Number: 4 (DW_TAG_lexical_block)     <<<<<<<<<<this should be for lexical block 2, so its DIE level should be 3
   <62>   DW_AT_low_pc      : 0x1001070c
   <6a>   DW_AT_high_pc     : 0x20
<3><6e>: Abbrev Number: 3 (DW_TAG_variable)             <<<<<<<<<this is for x=145 and nested in lexical block2, so its DIE level should be 4
   <6f>   DW_AT_location    : 3 byte block: 91 e8 0    (DW_OP_fbreg: 104)
   <73>   DW_AT_name        : (indirect string, offset: 0xd4): x
   <79>   DW_AT_type        : <0x94>
<3><7d>: Abbrev Number: 0
<2><7e>: Abbrev Number: 5 (DW_TAG_variable)              <<<<<<<<<<this is for static int x = 45, so there should be another lexical block DIE for lexical block 1, and its DIE level should be 2.
   <7f>   DW_AT_name        : (indirect string, offset: 0xd4): x
   <83>   DW_AT_type        : <0x94>
   <89>   DW_AT_location    : 9 byte block: 3 e8 a 3 10 0 0 0 0        (DW_OP_addr: 10030ae8)
<2><93>: Abbrev Number: 0
<1><94>: Abbrev Number: 6 (DW_TAG_base_type)
   <95>   DW_AT_name        : (indirect string, offset: 0x0): int
   <99>   DW_AT_encoding    : 5        (signed)
   <9a>   DW_AT_byte_size   : 4
<1><9b>: Abbrev Number: 0

And we get wrong result for static number x at line 7. In debugger, it is printed as 3.

With this patch:

0x0000002a:   DW_TAG_subprogram
                DW_AT_low_pc    (0x0000000010010680)
                DW_AT_high_pc   (0x0000000010010754)
                DW_AT_frame_base        (DW_OP_reg31 X31)
                DW_AT_name      ("main")
                DW_AT_type      (0x00000094 "int")
                DW_AT_external  (true)

0x00000043:     DW_TAG_variable
                  DW_AT_name    ("x")          <<<<<<<<<<<<we need a lexical block for this definition
                  DW_AT_type    (0x00000094 "int")
                  DW_AT_location        (DW_OP_addr 0x10030ae8)

0x00000058:     DW_TAG_variable
                  DW_AT_location        (DW_OP_fbreg +112)
                  DW_AT_name    ("x")
                  DW_AT_type    (0x00000094 "int")

0x00000067:     DW_TAG_variable
                  DW_AT_location        (DW_OP_fbreg +108)
                  DW_AT_name    ("num")
                  DW_AT_type    (0x00000094 "int")

0x00000076:     DW_TAG_lexical_block
                  DW_AT_low_pc  (0x000000001001070c)
                  DW_AT_high_pc (0x000000001001072c)

0x00000083:       DW_TAG_variable
                    DW_AT_location      (DW_OP_fbreg +104)
                    DW_AT_name  ("x")
                    DW_AT_type  (0x00000094 "int")

0x00000092:       NULL

Now we get wrong result for local variable x at line 3. In debugger, it is printed as 45. (The DIE for the local variable x is overridden by the static variable x in lexical block 1).

To continue with a more accurate fix, one example for your reference. (The behavior is changed with/without this reversion):

Sorry, I'm not following this comment ^ (& the rest) - could you rephrase if there's a question/concern you have I can help clarify/describe/work with you on?

To continue with a more accurate fix, one example for your reference. (The behavior is changed with/without this reversion):

Sorry, I'm not following this comment ^ (& the rest) - could you rephrase if there's a question/concern you have I can help clarify/describe/work with you on?

I guess there will be a follow-up patch for the function-local declaration for lexical block? This patch seems like a reversion of a previous functionality change?
In our internal test, we also found an issue related to variables in the lexical block, so I pasted it here for reference : )

krisb added a comment.Dec 28 2021, 4:13 PM

To continue with a more accurate fix, one example for your reference. (The behavior is changed with/without this reversion):

Sorry, I'm not following this comment ^ (& the rest) - could you rephrase if there's a question/concern you have I can help clarify/describe/work with you on?

I guess there will be a follow-up patch for the function-local declaration for lexical block? This patch seems like a reversion of a previous functionality change?
In our internal test, we also found an issue related to variables in the lexical block, so I pasted it here for reference : )

This is the functional change itself, reverting patch is at https://reviews.llvm.org/D116225.

Without this patch:

I guess you meant without the revert, and actually _with_ this patch, right?
If so, you'd need a clang side change https://reviews.llvm.org/D113743 (which has been reverted as well) to see static locals placed correctly.
If to apply all the reverted patches (D114705, D113741, and D113743) I believe you should see a correct DWARF compiled for your example.

If to apply all the reverted patches (D114705, D113741, and D113743) I believe you should see a correct DWARF compiled for your example.

Yeah, verify with manually applying all the above patches, it works as expected. Looking forward to seeing these patches merged in again.

krisb updated this revision to Diff 411133.Feb 24 2022, 7:46 AM
  • Rebase on ToT & minor fixes.
  • Fix the issue when an local imported entity caused emission of multiple CUs in a dwo when split dwarf is enabled.
  • Add a test for the aforementioned issue.
krisb added a comment.Feb 24 2022, 8:15 AM

@dblaikie I've updated the patch with a fix for split dwarf issue (the test is included). Could I ask you to check the full build on your side?

The issue was in wrong assumptions about in which CUs we need to create those local entities. Basically, they [CUs] should match the ones where the parent subprogram was emitted.
It's not an easy thing to find the particular list of suited CUs while we are in endModule() (especially, because split-dwarf and split-dwarf inlining introduce a tricky logic around), so I chose to collect this information during subprograms emission.
The solution still looks a bit hacky, but I hope this is a step forward.

What about the second issue, unfortunately (or fortunately? :)), I'm still able to reproduce the crash on ToT of main, so I'm not sure this patch may cause the issue. I'll appreciate if could check it once again.

Thank you and sorry for the long delay.

@dblaikie I've updated the patch with a fix for split dwarf issue (the test is included). Could I ask you to check the full build on your side?

The issue was in wrong assumptions about in which CUs we need to create those local entities. Basically, they [CUs] should match the ones where the parent subprogram was emitted.
It's not an easy thing to find the particular list of suited CUs while we are in endModule() (especially, because split-dwarf and split-dwarf inlining introduce a tricky logic around), so I chose to collect this information during subprograms emission.
The solution still looks a bit hacky, but I hope this is a step forward.

What about the second issue, unfortunately (or fortunately? :)), I'm still able to reproduce the crash on ToT of main, so I'm not sure this patch may cause the issue. I'll appreciate if could check it once again.

Thank you and sorry for the long delay.

All good.

I'm not immediately following why a list of CUs is needed/why we'd create the variable in more than one CU - could you explain that?

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1405–1406

This looks buggy - it's returning a pointer to a local variable (the TheCU pointer) which means it'll be dangling once the function returns, yeah? (I'd expect msan to catch this, for instance)

krisb updated this revision to Diff 411184.Feb 24 2022, 10:51 AM

Minor fix.

@dblaikie I've updated the patch with a fix for split dwarf issue (the test is included). Could I ask you to check the full build on your side?

The issue was in wrong assumptions about in which CUs we need to create those local entities. Basically, they [CUs] should match the ones where the parent subprogram was emitted.
It's not an easy thing to find the particular list of suited CUs while we are in endModule() (especially, because split-dwarf and split-dwarf inlining introduce a tricky logic around), so I chose to collect this information during subprograms emission.
The solution still looks a bit hacky, but I hope this is a step forward.

What about the second issue, unfortunately (or fortunately? :)), I'm still able to reproduce the crash on ToT of main, so I'm not sure this patch may cause the issue. I'll appreciate if could check it once again.

Thank you and sorry for the long delay.

All good.

I'm not immediately following why a list of CUs is needed/why we'd create the variable in more than one CU - could you explain that?

Well, this is because we may have a parent subprogram emitted in more than one CU (in the case of split dwarf), and we do not know in which one.
Here are the situations I see (hope I got them right):
(1) If split dwarf disabled, we have a single list of subprogram's abstract origins and can reference them between CUs. So, we'll emit a concrete definition and its abstract origin (any existing) in the same CU. In this case, we have only one CU.
(2) If we have split dwarf enabled, and shareAcrossDWOCUs() is false (it's false by default), every CU will maintain its own abstract origin's list. In this case, we cannot have references between CUs, and if we have cross-CU inlining, we'll create:

  • concrete def -> in the CU the subprogram is defined,
  • abstract origin -> in every CU the subprogram is inlined. For all local entities we should follow the same logic.

(3) SplitDwarfInlining adds skeleton units to this list. Basically, we do not need to emit any local entities for skeleton units in this case, but I just reused already existing logic around includeMinimalInlineScopes().

The reproducer you provided is about the case (2). We tried to emit the imported entity in a CU, where its parent subprogram was defined, but the subprogram itself was emitted in the CU it was inlined.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1405–1406

Uuups, bad experiments. Fixed.

dblaikie added inline comments.Feb 24 2022, 10:57 AM
llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1405–1406

This code still looks buggy, it's still returning the address of a local variable (the address of "TheU" pointer) so it'd still be UB/dangling/msan-unclean, I think?

krisb updated this revision to Diff 411198.Feb 24 2022, 11:29 AM

Attempt 2.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1405–1406

Hurry isn't a good way to do smth good. Sorry. One more try and I'll go to check this with msan.

@dblaikie I've updated the patch with a fix for split dwarf issue (the test is included). Could I ask you to check the full build on your side?

The issue was in wrong assumptions about in which CUs we need to create those local entities. Basically, they [CUs] should match the ones where the parent subprogram was emitted.
It's not an easy thing to find the particular list of suited CUs while we are in endModule() (especially, because split-dwarf and split-dwarf inlining introduce a tricky logic around), so I chose to collect this information during subprograms emission.
The solution still looks a bit hacky, but I hope this is a step forward.

What about the second issue, unfortunately (or fortunately? :)), I'm still able to reproduce the crash on ToT of main, so I'm not sure this patch may cause the issue. I'll appreciate if could check it once again.

Thank you and sorry for the long delay.

All good.

I'm not immediately following why a list of CUs is needed/why we'd create the variable in more than one CU - could you explain that?

Well, this is because we may have a parent subprogram emitted in more than one CU (in the case of split dwarf), and we do not know in which one.
Here are the situations I see (hope I got them right):
(1) If split dwarf disabled, we have a single list of subprogram's abstract origins and can reference them between CUs. So, we'll emit a concrete definition and its abstract origin (any existing) in the same CU. In this case, we have only one CU.
(2) If we have split dwarf enabled, and shareAcrossDWOCUs() is false (it's false by default), every CU will maintain its own abstract origin's list. In this case, we cannot have references between CUs, and if we have cross-CU inlining, we'll create:

  • concrete def -> in the CU the subprogram is defined,
  • abstract origin -> in every CU the subprogram is inlined. For all local entities we should follow the same logic.

Do you have a test case for this (sorry, I realize it might be one of the test cases in the patch - I haven't looked/would appreciate some pointers)? My recollection is that without shareAcrossDWOCUs we should probably be only emitting one CU if using Split DWARF - Split DWARF doesn't really support multiple CUs in a single .dwo file (& we only support emitting a single .dwo for a given .o file) - but I guess maybe we do in full LTO mode, even if we don't do it in ThinLTO mode (ThinLTO's the only thing I've been especially concerned with/the only mode we use at Google)?

(3) SplitDwarfInlining adds skeleton units to this list. Basically, we do not need to emit any local entities for skeleton units in this case, but I just reused already existing logic around includeMinimalInlineScopes().

The reproducer you provided is about the case (2). We tried to emit the imported entity in a CU, where its parent subprogram was defined, but the subprogram itself was emitted in the CU it was inlined.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1405–1406

That doesn't seem like a valid range either - that range would be from "TheCU" until nullptr - so a huge chunk of the address space?

I think now you need/could have "TheCU + 1" as the end of the range and that'd be correct, now that it's passed in a pointer-to-pointer.

krisb added a comment.Feb 24 2022, 1:18 PM

All good.

I'm not immediately following why a list of CUs is needed/why we'd create the variable in more than one CU - could you explain that?

Well, this is because we may have a parent subprogram emitted in more than one CU (in the case of split dwarf), and we do not know in which one.
Here are the situations I see (hope I got them right):
(1) If split dwarf disabled, we have a single list of subprogram's abstract origins and can reference them between CUs. So, we'll emit a concrete definition and its abstract origin (any existing) in the same CU. In this case, we have only one CU.
(2) If we have split dwarf enabled, and shareAcrossDWOCUs() is false (it's false by default), every CU will maintain its own abstract origin's list. In this case, we cannot have references between CUs, and if we have cross-CU inlining, we'll create:

  • concrete def -> in the CU the subprogram is defined,
  • abstract origin -> in every CU the subprogram is inlined. For all local entities we should follow the same logic.

Do you have a test case for this (sorry, I realize it might be one of the test cases in the patch - I haven't looked/would appreciate some pointers)? My recollection is that without shareAcrossDWOCUs we should probably be only emitting one CU if using Split DWARF - Split DWARF doesn't really support multiple CUs in a single .dwo file (& we only support emitting a single .dwo for a given .o file) - but I guess maybe we do in full LTO mode, even if we don't do it in ThinLTO mode (ThinLTO's the only thing I've been especially concerned with/the only mode we use at Google)?

I was inspired by these two tests:

LLVM :: DebugInfo/X86/split-dwarf-cross-cu-gmlt-g.ll
LLVM :: DebugInfo/X86/split-dwarf-cross-unit-reference.ll

They test the case I described in (2), with 2 CUs in a single .dwo.

krisb updated this revision to Diff 411797.Feb 28 2022, 5:29 AM

Apply David's comment and add 'REQUIRES: object-emission' to tests.

krisb added a comment.Feb 28 2022, 5:29 AM

I'm not immediately following why a list of CUs is needed/why we'd create the variable in more than one CU - could you explain that?

Well, this is because we may have a parent subprogram emitted in more than one CU (in the case of split dwarf), and we do not know in which one.
Here are the situations I see (hope I got them right):
(1) If split dwarf disabled, we have a single list of subprogram's abstract origins and can reference them between CUs. So, we'll emit a concrete definition and its abstract origin (any existing) in the same CU. In this case, we have only one CU.
(2) If we have split dwarf enabled, and shareAcrossDWOCUs() is false (it's false by default), every CU will maintain its own abstract origin's list. In this case, we cannot have references between CUs, and if we have cross-CU inlining, we'll create:

  • concrete def -> in the CU the subprogram is defined,
  • abstract origin -> in every CU the subprogram is inlined. For all local entities we should follow the same logic.

Do you have a test case for this (sorry, I realize it might be one of the test cases in the patch - I haven't looked/would appreciate some pointers)? My recollection is that without shareAcrossDWOCUs we should probably be only emitting one CU if using Split DWARF - Split DWARF doesn't really support multiple CUs in a single .dwo file (& we only support emitting a single .dwo for a given .o file) - but I guess maybe we do in full LTO mode, even if we don't do it in ThinLTO mode (ThinLTO's the only thing I've been especially concerned with/the only mode we use at Google)?

Okay, let's summarize this a bit.

Statement 1. The problem with the example you provided above was because this patch made subprograms and their local "global" entities emitted in different CUs in the case of split DWARF. This is a bug, for sure.
Statement 2. Currently, there is nothing in the code that restricts emitting multiple CUs in a single .dwo (whether for FullLTO or for manually linked modules (as in the tests) we will generate more than one CU per .dwo).

All my assumptions and conclusions are based on the code and existing regression tests. Unfortunately, I'm not aware of the real situation of split DWARF or split DWARF inlining support (and I couldn't find anything related in the docs either).

Having this, I think we cannot assume that a subprogram is guaranteed to be emitted only in one CU, unless we explicitly disallow the case with multiple CUs per .dwo. DWARF standard seems to assume only one CU per .dwo (but I couldn't find any explicit wording about restrictions on the number of CUs in a .dwo). However, it looks like GDB is okay and works fine with more than one CU in a .dwo. It'd be good to have this situation clarified. But it seems to be out of the scope of this patch. Do you have any thoughs on how to proceed with this?

(3) SplitDwarfInlining adds skeleton units to this list. Basically, we do not need to emit any local entities for skeleton units in this case, but I just reused already existing logic around includeMinimalInlineScopes().

The reproducer you provided is about the case (2). We tried to emit the imported entity in a CU, where its parent subprogram was defined, but the subprogram itself was emitted in the CU it was inlined.

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 2:06 AM

Friendly ping :)

Friendly ping :)

Sorry I haven't got to this yet - it is on my list.

I'm not immediately following why a list of CUs is needed/why we'd create the variable in more than one CU - could you explain that?

Well, this is because we may have a parent subprogram emitted in more than one CU (in the case of split dwarf), and we do not know in which one.
Here are the situations I see (hope I got them right):
(1) If split dwarf disabled, we have a single list of subprogram's abstract origins and can reference them between CUs. So, we'll emit a concrete definition and its abstract origin (any existing) in the same CU. In this case, we have only one CU.
(2) If we have split dwarf enabled, and shareAcrossDWOCUs() is false (it's false by default), every CU will maintain its own abstract origin's list. In this case, we cannot have references between CUs, and if we have cross-CU inlining, we'll create:

  • concrete def -> in the CU the subprogram is defined,
  • abstract origin -> in every CU the subprogram is inlined. For all local entities we should follow the same logic.

Do you have a test case for this (sorry, I realize it might be one of the test cases in the patch - I haven't looked/would appreciate some pointers)? My recollection is that without shareAcrossDWOCUs we should probably be only emitting one CU if using Split DWARF - Split DWARF doesn't really support multiple CUs in a single .dwo file (& we only support emitting a single .dwo for a given .o file) - but I guess maybe we do in full LTO mode, even if we don't do it in ThinLTO mode (ThinLTO's the only thing I've been especially concerned with/the only mode we use at Google)?

Okay, let's summarize this a bit.

Statement 1. The problem with the example you provided above was because this patch made subprograms and their local "global" entities emitted in different CUs in the case of split DWARF. This is a bug, for sure.

Is that bug now addressed in the proposed patch & with tests to demonstrate it's fixed?

Statement 2. Currently, there is nothing in the code that restricts emitting multiple CUs in a single .dwo (whether for FullLTO or for manually linked modules (as in the tests) we will generate more than one CU per .dwo).

I believe DwarfDebug.cpp:SplitDwarfCrossCuReferences is intended to disallow this functionality, so far as I can tell/remember?

All my assumptions and conclusions are based on the code and existing regression tests. Unfortunately, I'm not aware of the real situation of split DWARF or split DWARF inlining support (and I couldn't find anything related in the docs either).

Having this, I think we cannot assume that a subprogram is guaranteed to be emitted only in one CU, unless we explicitly disallow the case with multiple CUs per .dwo. DWARF standard seems to assume only one CU per .dwo (but I couldn't find any explicit wording about restrictions on the number of CUs in a .dwo). However, it looks like GDB is okay and works fine with more than one CU in a .dwo. It'd be good to have this situation clarified. But it seems to be out of the scope of this patch. Do you have any thoughs on how to proceed with this?

One of the issues is that the dwp file format describes an index by CU hash (DWO ID) - it's unclear what it would mean to have multiple entries that overlap (because a single region contains multiple CUs) and where the region doesn't start with the CU for the lookup (because it's some region with multiple units).

krisb added a comment.Mar 30 2022, 2:43 AM

Statement 1. The problem with the example you provided above was because this patch made subprograms and their local "global" entities emitted in different CUs in the case of split DWARF. This is a bug, for sure.

Is that bug now addressed in the proposed patch & with tests to demonstrate it's fixed?

I believe, it is. I created a test from the example you provided above, see llvm/test/DebugInfo/Generic/split-dwarf-local-import.ll.
It checks that the second CU doesn't have DW_AT_GNU_dwo_name and DW_AT_GNU_dwo_id and the imported entity has a proper parent.
But, if you have a chance to check the updated patch in the same conditions this bug originally reproduced, I'll highly appreciate it.

Statement 2. Currently, there is nothing in the code that restricts emitting multiple CUs in a single .dwo (whether for FullLTO or for manually linked modules (as in the tests) we will generate more than one CU per .dwo).

I believe DwarfDebug.cpp:SplitDwarfCrossCuReferences is intended to disallow this functionality, so far as I can tell/remember?

Hmm, I'm not sure fully I understood your point. SplitDwarfCrossCuReferences seems to _allow_ references between CUs, but it isn't enabled by default.
For example, if you enable split-dwarf with FullLTO, clang would emit a single .dwo per complied&linked binary with all the CUs in this .dwo. Neither clang nor backend would emit any errors/warnings.
I'm not insisting this is correct, or this may be useful for someone (I'm simply not aware), just want to mention that this is possible.

All my assumptions and conclusions are based on the code and existing regression tests. Unfortunately, I'm not aware of the real situation of split DWARF or split DWARF inlining support (and I couldn't find anything related in the docs either).

Having this, I think we cannot assume that a subprogram is guaranteed to be emitted only in one CU, unless we explicitly disallow the case with multiple CUs per .dwo. DWARF standard seems to assume only one CU per .dwo (but I couldn't find any explicit wording about restrictions on the number of CUs in a .dwo). However, it looks like GDB is okay and works fine with more than one CU in a .dwo. It'd be good to have this situation clarified. But it seems to be out of the scope of this patch. Do you have any thoughs on how to proceed with this?

One of the issues is that the dwp file format describes an index by CU hash (DWO ID) - it's unclear what it would mean to have multiple entries that overlap (because a single region contains multiple CUs) and where the region doesn't start with the CU for the lookup (because it's some region with multiple units).

So the question is: should we explicitly disallow emitting multiple CUs in .dwo (meaning, for example, split-dwarf with FullLTO w/o split-dwarf-cross-cu-references)? By reporting an error or at least specifying somewhere in docs that this case it not correct thus not supported?

jmorse added a subscriber: jmorse.Apr 1 2022, 6:05 AM

I tried applying this internally & running the same build that tripped over something last time this was committed & I seem to be tripping over the same thing:

(.debug_gnu_pubnames): name lookup table at offset 0x38cd has a terminator at offset 0x38db before the expected end at 0x3a3d

As well as a bunch of unresolved symbols during linking... :/

I'll have to work on reducing a test case for this.

krisb added a comment.Apr 8 2022, 12:28 PM

I tried applying this internally & running the same build that tripped over something last time this was committed & I seem to be tripping over the same thing:

(.debug_gnu_pubnames): name lookup table at offset 0x38cd has a terminator at offset 0x38db before the expected end at 0x3a3d

As well as a bunch of unresolved symbols during linking... :/

I'll have to work on reducing a test case for this.

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

Drive-by comment on the topic of the gnu pubnames error,

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

I've run into that error myself when trying to alter which unit a subprogram definition goes in -- see the comment at D94976#2543257 . That patch never landed in the end because I couldn't figure out exactly what was wrong. IMHO, there's some latent weirdness that triggers that error, exactly how is unclear though. IMO, it's not specific to rG78d15a112cbd5 .

[0] "(.debug_gnu_pubnames): name lookup table at offset (...) has a terminator at offset (...) before the expected end at (...)"

I tried applying this internally & running the same build that tripped over something last time this was committed & I seem to be tripping over the same thing:

(.debug_gnu_pubnames): name lookup table at offset 0x38cd has a terminator at offset 0x38db before the expected end at 0x3a3d

As well as a bunch of unresolved symbols during linking... :/

I'll have to work on reducing a test case for this.

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

That's correct. Sorry I lost context on this again and don't think I have a reduced test case/something to work from right now, but I'll keep in on my radar to work on.

krisb added a comment.May 16 2022, 8:41 AM

There is an alternative implementation that relies on addition field of DISubprogram/DILexicalScope that tracks static locals, local imports and types D125693.


Drive-by comment on the topic of the gnu pubnames error,

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

I've run into that error myself when trying to alter which unit a subprogram definition goes in -- see the comment at D94976#2543257 . That patch never landed in the end because I couldn't figure out exactly what was wrong. IMHO, there's some latent weirdness that triggers that error, exactly how is unclear though. IMO, it's not specific to rG78d15a112cbd5 .

[0] "(.debug_gnu_pubnames): name lookup table at offset (...) has a terminator at offset (...) before the expected end at (...)"

I've tried to reproduce the issue using the example from D94976 (and with this patch), it compiled fine and didn't show the mentioned problem with pubnames.
Another question is about the combination of FullLTO and split-dwarf which has been already discussed in this thread and which isn't (fully) supported by LLVM mainline (yet, at least in a way that doesn't violate DWARF specification).

So, I'm still looking for a reproducer for the issue with pubnames. Any help will be appreciated.

There is an alternative implementation that relies on addition field of DISubprogram/DILexicalScope that tracks static locals, local imports and types D125693.

Ah, OK - so that 5-part sequence is an alternative to this one?

Might be good to figure out the issue here before debating which direction to go in, etc. (hmm, one of the patches in that series I've already approved? So am I reviewing alternatives here? Given the complexity here that's a bit costly for me/end up feeling like I'm drowning under trying to figure out all these moving parts... - a linear sequence is generally OK/good, I can review the easy parts and focus on the difficult ones, but multiple large alternatives can be a bit daunting)


Drive-by comment on the topic of the gnu pubnames error,

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

I've run into that error myself when trying to alter which unit a subprogram definition goes in -- see the comment at D94976#2543257 . That patch never landed in the end because I couldn't figure out exactly what was wrong. IMHO, there's some latent weirdness that triggers that error, exactly how is unclear though. IMO, it's not specific to rG78d15a112cbd5 .

[0] "(.debug_gnu_pubnames): name lookup table at offset (...) has a terminator at offset (...) before the expected end at (...)"

I've tried to reproduce the issue using the example from D94976 (and with this patch), it compiled fine and didn't show the mentioned problem with pubnames.
Another question is about the combination of FullLTO and split-dwarf which has been already discussed in this thread and which isn't (fully) supported by LLVM mainline (yet, at least in a way that doesn't violate DWARF specification).

So, I'm still looking for a reproducer for the issue with pubnames. Any help will be appreciated.

Yeah, I still need to figure out how to do that/get you that. :/

(is this at all/possibly related to https://reviews.llvm.org/D124892 ?)

There is an alternative implementation that relies on addition field of DISubprogram/DILexicalScope that tracks static locals, local imports and types D125693.

Ah, OK - so that 5-part sequence is an alternative to this one?

Might be good to figure out the issue here before debating which direction to go in, etc. (hmm, one of the patches in that series I've already approved? So am I reviewing alternatives here? Given the complexity here that's a bit costly for me/end up feeling like I'm drowning under trying to figure out all these moving parts... - a linear sequence is generally OK/good, I can review the easy parts and focus on the difficult ones, but multiple large alternatives can be a bit daunting)


Drive-by comment on the topic of the gnu pubnames error,

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

I've run into that error myself when trying to alter which unit a subprogram definition goes in -- see the comment at D94976#2543257 . That patch never landed in the end because I couldn't figure out exactly what was wrong. IMHO, there's some latent weirdness that triggers that error, exactly how is unclear though. IMO, it's not specific to rG78d15a112cbd5 .

[0] "(.debug_gnu_pubnames): name lookup table at offset (...) has a terminator at offset (...) before the expected end at (...)"

I've tried to reproduce the issue using the example from D94976 (and with this patch), it compiled fine and didn't show the mentioned problem with pubnames.
Another question is about the combination of FullLTO and split-dwarf which has been already discussed in this thread and which isn't (fully) supported by LLVM mainline (yet, at least in a way that doesn't violate DWARF specification).

So, I'm still looking for a reproducer for the issue with pubnames. Any help will be appreciated.

Yeah, I still need to figure out how to do that/get you that. :/

(is this at all/possibly related to https://reviews.llvm.org/D124892 ?)

Seems like a discussion is more about multiple CUs in DWO with full LTO? Need to read all the history in more details.
Side note BOLT doesn't support multiple CUs in the dwo. There are certain assumptions about abbrev, and how it writes out new dwo files.
Does pubnames actually used for anything? My impression is that debuggers (or at least lldb) ignores it.

There is an alternative implementation that relies on addition field of DISubprogram/DILexicalScope that tracks static locals, local imports and types D125693.

Ah, OK - so that 5-part sequence is an alternative to this one?

Might be good to figure out the issue here before debating which direction to go in, etc. (hmm, one of the patches in that series I've already approved? So am I reviewing alternatives here? Given the complexity here that's a bit costly for me/end up feeling like I'm drowning under trying to figure out all these moving parts... - a linear sequence is generally OK/good, I can review the easy parts and focus on the difficult ones, but multiple large alternatives can be a bit daunting)

Ping on this ^


Drive-by comment on the topic of the gnu pubnames error,

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

I've run into that error myself when trying to alter which unit a subprogram definition goes in -- see the comment at D94976#2543257 . That patch never landed in the end because I couldn't figure out exactly what was wrong. IMHO, there's some latent weirdness that triggers that error, exactly how is unclear though. IMO, it's not specific to rG78d15a112cbd5 .

[0] "(.debug_gnu_pubnames): name lookup table at offset (...) has a terminator at offset (...) before the expected end at (...)"

I've tried to reproduce the issue using the example from D94976 (and with this patch), it compiled fine and didn't show the mentioned problem with pubnames.
Another question is about the combination of FullLTO and split-dwarf which has been already discussed in this thread and which isn't (fully) supported by LLVM mainline (yet, at least in a way that doesn't violate DWARF specification).

So, I'm still looking for a reproducer for the issue with pubnames. Any help will be appreciated.

Yeah, I still need to figure out how to do that/get you that. :/

(is this at all/possibly related to https://reviews.llvm.org/D124892 ?)

Seems like a discussion is more about multiple CUs in DWO with full LTO? Need to read all the history in more details.
Side note BOLT doesn't support multiple CUs in the dwo. There are certain assumptions about abbrev, and how it writes out new dwo files.
Does pubnames actually used for anything? My impression is that debuggers (or at least lldb) ignores it.

Yep, debug_gnu_pubnames are used by gold and lld to build gdb_index which improves gdb startup time.

There is an alternative implementation that relies on addition field of DISubprogram/DILexicalScope that tracks static locals, local imports and types D125693.

Ah, OK - so that 5-part sequence is an alternative to this one?

Might be good to figure out the issue here before debating which direction to go in, etc. (hmm, one of the patches in that series I've already approved? So am I reviewing alternatives here? Given the complexity here that's a bit costly for me/end up feeling like I'm drowning under trying to figure out all these moving parts... - a linear sequence is generally OK/good, I can review the easy parts and focus on the difficult ones, but multiple large alternatives can be a bit daunting)

Ping on this ^


Drive-by comment on the topic of the gnu pubnames error,

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

I've run into that error myself when trying to alter which unit a subprogram definition goes in -- see the comment at D94976#2543257 . That patch never landed in the end because I couldn't figure out exactly what was wrong. IMHO, there's some latent weirdness that triggers that error, exactly how is unclear though. IMO, it's not specific to rG78d15a112cbd5 .

[0] "(.debug_gnu_pubnames): name lookup table at offset (...) has a terminator at offset (...) before the expected end at (...)"

I've tried to reproduce the issue using the example from D94976 (and with this patch), it compiled fine and didn't show the mentioned problem with pubnames.
Another question is about the combination of FullLTO and split-dwarf which has been already discussed in this thread and which isn't (fully) supported by LLVM mainline (yet, at least in a way that doesn't violate DWARF specification).

So, I'm still looking for a reproducer for the issue with pubnames. Any help will be appreciated.

Yeah, I still need to figure out how to do that/get you that. :/

(is this at all/possibly related to https://reviews.llvm.org/D124892 ?)

Seems like a discussion is more about multiple CUs in DWO with full LTO? Need to read all the history in more details.
Side note BOLT doesn't support multiple CUs in the dwo. There are certain assumptions about abbrev, and how it writes out new dwo files.
Does pubnames actually used for anything? My impression is that debuggers (or at least lldb) ignores it.

Yep, debug_gnu_pubnames are used by gold and lld to build gdb_index which improves gdb startup time.

Ah right, forgot about it.

krisb added a comment.May 25 2022, 9:43 AM

There is an alternative implementation that relies on addition field of DISubprogram/DILexicalScope that tracks static locals, local imports and types D125693.

Ah, OK - so that 5-part sequence is an alternative to this one?

Mostly. The alternative implementation reuses large part of work of this patch.
Basically, this first implementation has (had) 3 patches in sequence:

  • D114705 (moving handling of globals, types and imports from beginModule() to endModule() in DwarfDebug.cpp)
  • D113741 (this one, backend's part)
  • D113743 (clang's part)

The new one has 5 patches:

  • D125691 (adding new 'localDecls' field to DILexicalBlock and DISubprogram),
  • D114705 (taken from previous implementation w/o changes),
  • D125693 (backend's part, replacement for D113741)
  • D125694, D125695 (clang's part, the same as D113743, but split on 2 patches and with some minor improvements).

Might be good to figure out the issue here before debating which direction to go in, etc. (hmm, one of the patches in that series I've already approved? So am I reviewing alternatives here? Given the complexity here that's a bit costly for me/end up feeling like I'm drowning under trying to figure out all these moving parts... - a linear sequence is generally OK/good, I can review the easy parts and focus on the difficult ones, but multiple large alternatives can be a bit daunting)

I'm ready to continue working with this patchset to find and fix the issue with pubnames, but I have no ideas how I can reproduce it, so I decided to move forward and try another approach which looks a bit better to me.
If you have some time to help me finding the reproducer or could give some hints on this, I'll be happy to fix the aforementioned problem first.

Drive-by comment on the topic of the gnu pubnames error,

If I remember correctly, this issue with .debug_gnu_pubnames was caused by another patch (https://reviews.llvm.org/rG78d15a112cbd545fbb6e1aa37c221ef5aeffb3f2).
Did I understand you right that the same issue is reproduced with w/o that patch?

I've run into that error myself when trying to alter which unit a subprogram definition goes in -- see the comment at D94976#2543257 . That patch never landed in the end because I couldn't figure out exactly what was wrong. IMHO, there's some latent weirdness that triggers that error, exactly how is unclear though. IMO, it's not specific to rG78d15a112cbd5 .

[0] "(.debug_gnu_pubnames): name lookup table at offset (...) has a terminator at offset (...) before the expected end at (...)"

I've tried to reproduce the issue using the example from D94976 (and with this patch), it compiled fine and didn't show the mentioned problem with pubnames.
Another question is about the combination of FullLTO and split-dwarf which has been already discussed in this thread and which isn't (fully) supported by LLVM mainline (yet, at least in a way that doesn't violate DWARF specification).

So, I'm still looking for a reproducer for the issue with pubnames. Any help will be appreciated.

Yeah, I still need to figure out how to do that/get you that. :/

(is this at all/possibly related to https://reviews.llvm.org/D124892 ?)

One of the issues you mentioned here https://reviews.llvm.org/D113741#3207125 seems to relate to D124892, but as I said, it reproduces w/o this patchset and likely has different root cause.

OK, so if I want to reproduce the issue for you, I need to apply these three patches? D114705 D113741 D113743

krisb added a comment.May 25 2022, 1:53 PM

OK, so if I want to reproduce the issue for you, I need to apply these three patches? D114705 D113741 D113743

You, likely, not need D113743 (it'd been reverted just after being committed, so it seems it couldn't be the guilty one).

OK, so if I want to reproduce the issue for you, I need to apply these three patches? D114705 D113741 D113743

OK, reapplied the patches and reproduced the issue with a simple test program using Control Flow Integrity. I'm not sure entirely how easy this will be to extract into something reproducible outside Google's internal build system (my understanding is that CFI is a whole-program (including the standard library) sort of thing, which might make satisfying those requirements outside difficult, though not impossible - the features do exist outside of Google/are usable/tested there, etc) but I'll get started.

@dblaikie, thank you for the reproducers!

I've tried to compile&examine them, but haven't found any issues. Likely, I did something wrong.
Could you, please, give more details on how to reproduce the issue(s)? I mean, exact compilation command line and/or crash details / wrong dwarf output? Thank you!

Oh, right, sorry:

For the first one, you should observe a crash (in a +Asserts build) when running this command:

$ clang++-tot ab-crash.ll -gsplit-dwarf -c -O1

The backtrace looks something like this:

$ clang++-tot ab-crash.ll -gsplit-dwarf -c -O1 -S 
clang++-tot: /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:660: llvm::DIE *llvm::DwarfCompileUnit::constructInlinedScopeDIE(llvm::LexicalScope *): Assertion `OriginDIE && "Unable to find original DIE for an inlined subprogram."' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: clang++-tot ab-crash.ll -gsplit-dwarf -c -O1 -S
1.      Code generation
2.      Running pass 'Function Pass Manager' on module 'ab-crash.ll'.
3.      Running pass 'X86 Assembly Printer' on function '@main'
 #0 0x0000000009e604ea llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x0000000009e6069b PrintStackTraceSignalHandler(void*) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x0000000009e5ed63 llvm::sys::RunSignalHandlers() /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/Support/Signals.cpp:97:5
 #3 0x0000000009e5fdfe llvm::sys::CleanupOnSignal(unsigned long) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/Support/Unix/Signals.inc:362:1
 #4 0x0000000009d6fa4e (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/Support/CrashRecoveryContext.cpp:0:7
 #5 0x0000000009d6fe03 CrashRecoverySignalHandler(int) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/Support/CrashRecoveryContext.cpp:390:1
 #6 0x00007f60568c08e0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x138e0)
 #7 0x00007f6056389e71 raise ./signal/../sysdeps/unix/sysv/linux/raise.c:50:1
 #8 0x00007f6056373536 abort ./stdlib/abort.c:81:7
 #9 0x00007f605637341f get_sysdep_segment_value ./intl/loadmsgcat.c:509:8
#10 0x00007f605637341f _nl_load_domain ./intl/loadmsgcat.c:970:34
#11 0x00007f60563827f2 (/lib/x86_64-linux-gnu/libc.so.6+0x357f2)
#12 0x000000000b57ce6e llvm::DwarfCompileUnit::constructInlinedScopeDIE(llvm::LexicalScope*) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:0:3
#13 0x000000000b57cb6c llvm::DwarfCompileUnit::constructScopeDIE(llvm::LexicalScope*, llvm::DIE&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:538:10
#14 0x000000000b57d3ae llvm::DwarfCompileUnit::createAndAddScopeChildren(llvm::LexicalScope*, llvm::DIE&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:1072:7
#15 0x000000000b57efa1 llvm::DwarfCompileUnit::constructSubprogramScopeDIE(llvm::DISubprogram const*, llvm::LexicalScope*) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:1023:14
#16 0x000000000b52bc0d llvm::DwarfDebug::endFunctionImpl(llvm::MachineFunction const*) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp:2301:50
#17 0x000000000b505ab2 llvm::DebugHandlerBase::endFunction(llvm::MachineFunction const*) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/DebugHandlerBase.cpp:0:5
#18 0x000000000b4e1f4c llvm::AsmPrinter::emitFunctionBody() /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp:1505:3
#19 0x0000000007ed7113 llvm::X86AsmPrinter::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/Target/X86/X86AsmPrinter.cpp:82:3
#20 0x0000000008a96477 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/CodeGen/MachineFunctionPass.cpp:72:8
#21 0x0000000009165d5e llvm::FPPassManager::runOnFunction(llvm::Function&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/IR/LegacyPassManager.cpp:1439:23
#22 0x000000000916abb2 llvm::FPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/IR/LegacyPassManager.cpp:1485:16
#23 0x0000000009166649 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/IR/LegacyPassManager.cpp:1554:23
#24 0x00000000091661bd llvm::legacy::PassManagerImpl::run(llvm::Module&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/IR/LegacyPassManager.cpp:542:16
#25 0x000000000916ae91 llvm::legacy::PassManager::run(llvm::Module&) /usr/local/google/home/blaikie/dev/llvm/src/llvm/lib/IR/LegacyPassManager.cpp:1681:3

But that's actually not the original situation I was investigating, so maybe it's another bug? (possibly same root cause, though)

Though the original issue I was investigating was from invalid .debug_gnu_pubnames (DIE offsets of 0, which prematurely terminates the .debug_gnu_pubnames list and produces a warning from llvm-dwarfdump), and by adding assert(Offset) in llvm::DIE::getOffset (initially I added it in the pubnames emission code, but generally it should be true that the offset of a DIE should not be queried until it's been computed, and no DIE has offset zero, because the offset is from the start of the CU contribution - so the headers come first and mean the DIE offset is always non-zero). It seems I still don't have a small reproduction for that exact issue, though.

The second test case provided above, that produces multiple DWO CUs can be reproduced as follows:

$ clang++-tot ab-multi-cu.ll -gsplit-dwarf -c -O1 && llvm-dwarfdump-tot ab-multi-cu.o
ab-multi-cu.o:  file format elf64-x86-64

.debug_info contents:
0x00000000: Compile Unit: length = 0x00000053, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x00000057)

0x0000000b: DW_TAG_compile_unit
              DW_AT_stmt_list   (0x00000000)
              DW_AT_comp_dir    ("/usr/local/google/home/blaikie/dev/scratch")
              DW_AT_GNU_pubnames        (true)
              DW_AT_GNU_dwo_name        ("ab-multi-cu.dwo")
              DW_AT_GNU_dwo_id  (0xe9d1660e0395035d)
              DW_AT_low_pc      (0x0000000000000000)
              DW_AT_high_pc     (0x000000000000000d)
              DW_AT_GNU_addr_base       (0x00000000)

0x00000030:   DW_TAG_subprogram
                DW_AT_low_pc    (0x0000000000000000)
                DW_AT_high_pc   (0x000000000000000d)
                DW_AT_name      ("main")

0x00000041:     DW_TAG_inlined_subroutine
                  DW_AT_abstract_origin (0x000000000000007b "f1")
                  DW_AT_low_pc  (0x0000000000000004)
                  DW_AT_high_pc (0x0000000000000009)
                  DW_AT_call_file       ("/usr/local/google/home/blaikie/dev/scratch/b.cpp")
                  DW_AT_call_line       (3)
                  DW_AT_call_column     (0x03)

0x00000055:     NULL

0x00000056:   NULL
0x00000057: Compile Unit: length = 0x00000027, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x00000082)

0x00000062: DW_TAG_compile_unit
              DW_AT_stmt_list   (0x0000004c)
              DW_AT_comp_dir    ("/usr/local/google/home/blaikie/dev/scratch")
              DW_AT_GNU_pubnames        (true)
              DW_AT_GNU_dwo_name        ("ab-multi-cu.dwo")
              DW_AT_GNU_dwo_id  (0xeb227ecffcd242c1)
              DW_AT_GNU_addr_base       (0x00000000)

0x0000007b:   DW_TAG_subprogram
                DW_AT_name      ("f1")
                DW_AT_inline    (DW_INL_inlined)

0x00000081:   NULL

In this case the second CU is OK (for the cross-CU inlining that's occurred), but it shouldn't have any DWO_* attributes, and the .dwo file shouldn't have a second CU in it (which it does after this patch) because Split DWARF doesn't have a good way to describe cross-CU references correctly once the CU is packaged in a dwp file. (this behavior is meant to be controlled by the SplitDwarfCrossCUReferences flag.

This does look like a stack trace for D124892.

Right, so this does only reproduce with Split DWARF + Split DWARF inlining + ThinLTO, I think. The CUs for the imported units Split DWARF inlining are being designated as split units, but then split units aren't actually emitted (because of the single-unit limitation of Split DWARF) and so the pubnames are emitted for DIEs that were never layed out/emitted, by the looks of it, so they get an offset of zero, which looks like/creates a truncated pubnames contribution.

Trying to make a small reproducer.

Right, so this does only reproduce with Split DWARF + Split DWARF inlining + ThinLTO, I think. The CUs for the imported units Split DWARF inlining are being designated as split units, but then split units aren't actually emitted (because of the single-unit limitation of Split DWARF) and so the pubnames are emitted for DIEs that were never layed out/emitted, by the looks of it, so they get an offset of zero, which looks like/creates a truncated pubnames contribution.

Trying to make a small reproducer.

OK, think I've got something in IR at least. Looks like it could be reproduced from source + LTO probably. With an assertions-enabled build of LLVM/Clang:

define void @f1() !dbg !13 {
lbl:
  ret void, !dbg !16
}

define void @f2() !dbg !22 {
lbl:
  ret void, !dbg !23
}

!llvm.dbg.cu = !{!0, !2, !10}
!llvm.module.flags = !{!12}

!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !1, emissionKind: FullDebug)
!1 = !DIFile(filename: "a.cc", directory: "")
!2 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !3, emissionKind: FullDebug, imports: !4)
!3 = !DIFile(filename: "b.cc", directory: "")
!4 = !{!5}
!5 = !DIImportedEntity(tag: DW_TAG_imported_declaration, scope: !6, entity: !7)
!6 = !DISubprogram(scope: null, spFlags: DISPFlagOptimized)
!7 = !DINamespace(scope: !2)
!8 = !DISubroutineType(types: !9)
!9 = !{}
!10 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !11, emissionKind: FullDebug)
!11 = !DIFile(filename: "c.cc", directory: "")
!12 = !{i32 2, !"Debug Info Version", i32 3}
!13 = distinct !DISubprogram(scope: null, type: !8, spFlags: DISPFlagDefinition, unit: !0)
!16 = !DILocation(line: 0, scope: !17, inlinedAt: !18)
!17 = distinct !DISubprogram(scope: null, unit: !10)
!18 = !DILocation(line: 0, scope: !21)
!21 = !DILexicalBlockFile(scope: !13, discriminator: 0)
!22 = distinct !DISubprogram(scope: null, type: !8, spFlags: DISPFlagDefinition, unit: !0)
!23 = !DILocation(line: 0, scope: !24, inlinedAt: !25)
!24 = distinct !DISubprogram(scope: null, unit: !2)
!25 = !DILocation(line: 0, scope: !22)
clang -cc1 -emit-obj  -split-dwarf-file x.dwo x.ll

Does that work for you?

Right, so this does only reproduce with Split DWARF + Split DWARF inlining + ThinLTO, I think. The CUs for the imported units Split DWARF inlining are being designated as split units, but then split units aren't actually emitted (because of the single-unit limitation of Split DWARF) and so the pubnames are emitted for DIEs that were never layed out/emitted, by the looks of it, so they get an offset of zero, which looks like/creates a truncated pubnames contribution.

Trying to make a small reproducer.

OK, think I've got something in IR at least. Looks like it could be reproduced from source + LTO probably. With an assertions-enabled build of LLVM/Clang:

define void @f1() !dbg !13 {
lbl:
  ret void, !dbg !16
}

define void @f2() !dbg !22 {
lbl:
  ret void, !dbg !23
}

!llvm.dbg.cu = !{!0, !2, !10}
!llvm.module.flags = !{!12}

!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !1, emissionKind: FullDebug)
!1 = !DIFile(filename: "a.cc", directory: "")
!2 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !3, emissionKind: FullDebug, imports: !4)
!3 = !DIFile(filename: "b.cc", directory: "")
!4 = !{!5}
!5 = !DIImportedEntity(tag: DW_TAG_imported_declaration, scope: !6, entity: !7)
!6 = !DISubprogram(scope: null, spFlags: DISPFlagOptimized)
!7 = !DINamespace(scope: !2)
!8 = !DISubroutineType(types: !9)
!9 = !{}
!10 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !11, emissionKind: FullDebug)
!11 = !DIFile(filename: "c.cc", directory: "")
!12 = !{i32 2, !"Debug Info Version", i32 3}
!13 = distinct !DISubprogram(scope: null, type: !8, spFlags: DISPFlagDefinition, unit: !0)
!16 = !DILocation(line: 0, scope: !17, inlinedAt: !18)
!17 = distinct !DISubprogram(scope: null, unit: !10)
!18 = !DILocation(line: 0, scope: !21)
!21 = !DILexicalBlockFile(scope: !13, discriminator: 0)
!22 = distinct !DISubprogram(scope: null, type: !8, spFlags: DISPFlagDefinition, unit: !0)
!23 = !DILocation(line: 0, scope: !24, inlinedAt: !25)
!24 = distinct !DISubprogram(scope: null, unit: !2)
!25 = !DILocation(line: 0, scope: !22)
clang -cc1 -emit-obj  -split-dwarf-file x.dwo x.ll

Does that work for you?

Yes, I was able to reproduce the issue. Thank you, @dblaikie!

krisb added a comment.May 31 2022, 6:51 AM

Right, so this does only reproduce with Split DWARF + Split DWARF inlining + ThinLTO, I think. The CUs for the imported units Split DWARF inlining are being designated as split units, but then split units aren't actually emitted (because of the single-unit limitation of Split DWARF) and so the pubnames are emitted for DIEs that were never layed out/emitted, by the looks of it, so they get an offset of zero, which looks like/creates a truncated pubnames contribution.

Trying to make a small reproducer.

OK, think I've got something in IR at least. Looks like it could be reproduced from source + LTO probably. With an assertions-enabled build of LLVM/Clang:

define void @f1() !dbg !13 {
lbl:
  ret void, !dbg !16
}

define void @f2() !dbg !22 {
lbl:
  ret void, !dbg !23
}

!llvm.dbg.cu = !{!0, !2, !10}
!llvm.module.flags = !{!12}

!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !1, emissionKind: FullDebug)
!1 = !DIFile(filename: "a.cc", directory: "")
!2 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !3, emissionKind: FullDebug, imports: !4)
!3 = !DIFile(filename: "b.cc", directory: "")
!4 = !{!5}
!5 = !DIImportedEntity(tag: DW_TAG_imported_declaration, scope: !6, entity: !7)
!6 = !DISubprogram(scope: null, spFlags: DISPFlagOptimized)
!7 = !DINamespace(scope: !2)
!8 = !DISubroutineType(types: !9)
!9 = !{}
!10 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !11, emissionKind: FullDebug)
!11 = !DIFile(filename: "c.cc", directory: "")
!12 = !{i32 2, !"Debug Info Version", i32 3}
!13 = distinct !DISubprogram(scope: null, type: !8, spFlags: DISPFlagDefinition, unit: !0)
!16 = !DILocation(line: 0, scope: !17, inlinedAt: !18)
!17 = distinct !DISubprogram(scope: null, unit: !10)
!18 = !DILocation(line: 0, scope: !21)
!21 = !DILexicalBlockFile(scope: !13, discriminator: 0)
!22 = distinct !DISubprogram(scope: null, type: !8, spFlags: DISPFlagDefinition, unit: !0)
!23 = !DILocation(line: 0, scope: !24, inlinedAt: !25)
!24 = distinct !DISubprogram(scope: null, unit: !2)
!25 = !DILocation(line: 0, scope: !22)
clang -cc1 -emit-obj  -split-dwarf-file x.dwo x.ll

Does that work for you?

Okay, I got what's wrong.

auto getCUs = [this](const DIScope *S, DwarfCompileUnit *const *TheCU) {
  if (auto *LScope = dyn_cast_or_null<DILocalScope>(S)) {
    auto CUIt = SPCUsMap.find(LScope->getSubprogram());
    if (CUIt != SPCUsMap.end()) {
      auto &CUs = CUIt->second;
      return llvm::make_range(CUs.begin(), CUs.end());
    }
  }
  return llvm::make_range(TheCU, TheCU + 1);
};

Doing this I was assuming that if SP is not in SPCUsMap, we should fall back to current CU and emit a local entity anyway.
But actually we should skip emitting any local entities if we haven't emitted their parent subprogram by this time. This corresponds to previous behavior of DwarfDebug/DwarfCompileUnit.

I'll update this patch with a fix and a test soon (at least for history).
D125693 doesn't suffer from the same issue, as we can decide whether we need to create a local entity while emitting its parent subprogram (so when we skip emission of a subprogram, we also skip emission of all its local entities).

krisb updated this revision to Diff 433083.May 31 2022, 7:28 AM
  • Skip emission of local entities if its parent is not in SPCUsMap.
  • Skip emission of local entities for CUs if includeMinimalInlineScopes() == true.
  • Add test llvm/test/DebugInfo/Generic/split-dwarf-local-import2.ll.
  • Fix llvm/test/Instrumentation/InstrProfiling/debug-info-correlate.ll

(add DILocaltion to force emission of DISubprogram as we no longer emit local entities (including static locals)
for empty SPs (i.e. w/o location info).

dblaikie added inline comments.May 31 2022, 9:38 AM
llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1405

Rather than filtering here - could we avoid this work by not putting minimal-inline-scope CUs in the SPCUsMap values in the first place?

1408

Wouldn't this return a dangling reference, since CUs is local to this function?

krisb updated this revision to Diff 433181.May 31 2022, 1:55 PM
krisb marked 5 inline comments as done.
  • Apply review comments.
llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1408

Right. Second time 'make_range' makes me thinking it's creating a copy here. My bad. Fixed (might be not the best way & best naming, but should be ok for a patch that, likely, will be abandoned in favor of D125693).

dblaikie added inline comments.Jun 1 2022, 11:17 AM
llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1408

patch that, likely, will be abandoned in favor of D125693

Not sure, I think this patch might still be a good direction - without the IR changes necessary for D125693. Does mean local types that end up unreferenced (because their uses are optimized awya) could be eliminated, which is good and bad (it's something we favor for non-local types for size reasons, but arguably we try more to keep all the local names for scoping/shadowing reasons - though it's not a terribly consistent principle)

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.h
318

Why does this map from one subprogram to multiple CUs? Shouldn't we be only placing local types/static variables in one CU (on the abstract origin SP, which should be singular)?

krisb updated this revision to Diff 434406.Jun 6 2022, 2:27 AM

Add tests:

  • llvm/test/DebugInfo/Generic/split-dwarf-local-import-fulllto.ll (temporary)
  • llvm/test/DebugInfo/Generic/split-dwarf-local-import3.ll
krisb added inline comments.Jun 6 2022, 2:27 AM
llvm/lib/CodeGen/AsmPrinter/DwarfDebug.cpp
1408

The main problem with this patch is that in some cases function-local declarations may be placed incorrectly, and moreover, their placement may not be deterministic, depending on order of functions, as an example.
And unfortunately, I do not see good options to fix/improve this w/o IR changes.

The problem comes from collectLocalScopesWithDeclsFromCU() and a necessity to predict which lexical blocks are non-empty and require to be emitted (so later we can place local declarations to them).

collectLocalScopesWithDeclsFromCU() fills in a map that is a member of DwarfCompileUnit and it does this iff a DwarfCompileUnit is created. So, LocalScopesWithLocalDecls describes only local scopes belong to this CU.

But if to talk about split-dwarf, a subprogram from one CU (source CU) may be inlined into another CU (destination CU). So, when we emit such a subprogram we will check LocalScopesWithLocalDecls of destination CU which knows nothing about local entities for the subprogram. Moreover, source CU for this subprogram may be never emitted (added split-dwarf-local-import3.ll to demonstrate this).

We'd make LocalScopesWithLocalDecls a member of DwarfDebug and fill it in in beginModule(), but this looks a kind of weird.

It'd be good also to address the issue described in collectLocalScopesWithDeclsFromCU() for //used/ types, but it'll require additional computations (like, iterating over all the instructions, find all the used types and record their scopes to be sure later we can place that type correctly).

I do not like all this workarounds including the necessity to track mapping between emitted subprogram DIE and its parent CU. The approach splits emission of local declarations between 3 steps and it's hard to maintain everything in sync.

D125693 requires changes in IR, but proposes a far more cleaner emission flow.
If IR changes from D125691 is something that should be strongly avoided (but I found multiple places where comments mention function-local import should be tracked by a subprogram, not by a compile unit), I'll fix/improve whatever I can for this patch, but I'd like to consider D125693 as a next (second) step then.

llvm/lib/CodeGen/AsmPrinter/DwarfDebug.h
318

We are returning to the question about multiple CUs in a DWO we've discussed a couple months ago. It still isn't clear whether this case is valid/useful for someone (e.g. split-dwarf+FullLTO).

I've temporarily added 'split-dwarf-local-import-fulllto.ll' test that demonstrates this case and backend's behavior before (see CHECK prefix) and after (CHECK-NEW prefix) this patch. (Basically, llvm/test/DebugInfo/X86/split-dwarf-cross-unit-reference.ll also tests this case).

ellis added a comment.Jul 5 2022, 1:55 PM

Seems fine to me, but I'm not quite familiar enough with this area to give this an accept.

llvm/test/Instrumentation/InstrProfiling/debug-info-correlate.ll
23

Why is this test changed?

krisb added inline comments.Jul 11 2022, 12:45 AM
llvm/test/Instrumentation/InstrProfiling/debug-info-correlate.ll
23