This is an archive of the discontinued LLVM Phabricator instance.

[dsymutil] Fix assertion: (Ref > InputDIE.getOffset()), function cloneDieReferenceAttribute
ClosedPublic

Authored by JDevlieghere on Nov 16 2022, 7:05 PM.

Details

Summary

D125469 changed the assumptions regarding forward references. Before the patch, a reference to an uncloned DIE was always a forward reference. After the change, that assumption no longer holds. It was enforced by the assertion in DWARFLinker::DIECloner::cloneDieReferenceAttribute (DWARFLinker.cpp:937):

assert(Ref > InputDIE.getOffset());

This patch is relatively straightforward: it adds a field to the DIEInfo to remember the DIE is a forward or backward reference. We then treat backward references like a forward references by fixing them up after the fact, when all DIEs have been cloned.

We tripped this assertion for an internal project built with LTO. Unfortunately that means I'm not able to share a test case. I've spent a few hours trying to craft a test case that would hit this case but didn't succeed. Getting a backward reference is easy, but a backward reference to a DIE that hasn't been cloned yet is not something I managed to reproduce. I'd love to have a test case to avoid regressing this in the future, so please let me know if you have an idea on how to trigger this behavior from an artificial test case.

rdar://102148054

Diff Detail

Event Timeline

JDevlieghere created this revision.Nov 16 2022, 7:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2022, 7:05 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
JDevlieghere requested review of this revision.Nov 16 2022, 7:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2022, 7:05 PM
JDevlieghere edited the summary of this revision. (Show Details)Nov 16 2022, 7:05 PM

Remove unrelated changes

friss added inline comments.Nov 16 2022, 7:11 PM
llvm/include/llvm/DWARFLinker/DWARFLinkerCompileUnit.h
81–82

When I read the comment, I was confused as I thought the flag was used to differentiate between forward and backward references. But in fact it's just signaling a reference to a DIE that has not been emitted yet. Maybe tweak the comment.

llvm/lib/DWARFLinker/DWARFLinker.cpp
961–962

Should we change the name of this function now that it's not only about forward references.

JDevlieghere marked 2 inline comments as done.

Address @friss' feedback.

avl added inline comments.Nov 17 2022, 8:46 AM
llvm/include/llvm/DWARFLinker/DWARFLinkerCompileUnit.h
82

Probably name it UnclonedReference ? or RefShouldBePatched?(or something like that)

llvm/lib/DWARFLinker/DWARFLinkerCompileUnit.cpp
79

/// Keep track of a cross-cu reference from this unit

JDevlieghere marked 2 inline comments as done.

Address @avl's feedback

avl accepted this revision.Nov 17 2022, 12:44 PM

LGTM

This revision is now accepted and ready to land.Nov 17 2022, 12:44 PM
avl requested changes to this revision.Nov 18 2022, 8:02 AM

I think we need to create a test case. I tried(still trying) to create one, but it seems this assertions should always be satisfied : "assert(Ref > InputDIE.getOffset());". That might be an error if referenced DIE is not cloned.

This revision now requires changes to proceed.Nov 18 2022, 8:02 AM
avl added a comment.Nov 20 2022, 2:47 AM

@JDevlieghere Is this assertion happened in single thread mode only, or multi-thread mode only, or both?

That might be an error if referenced DIE is not cloned.

@friss and I were starting to wonder the same thing. I was out last week because of Thanksgiving but I'll dig into this a bit more this week.

@JDevlieghere Is this assertion happened in single thread mode only, or multi-thread mode only, or both?

This is happening in dsymutil which runs in "lockstep mode" where analyze and clone happen sequentially per CU (but we start analyzing the next CU while we're cloning the previous one). It definitely sounds like that could have something to do with it.

Last week I debugged this further. Here's what I believe is going on. Consider the following simplified example with two CUs and a type Foo.

0x01: DW_TAG_compile_unit

0x02: DW_TAG_class_type
        DW_AT_containing_type (0x11)
        DW_AT_name            ("Foo")

0x03: DW_TAG_pointer_type
        DW_AT_type            (0x02 "Foo") 

...

0x10: DW_TAG_compile_unit

0x11: DW_TAG_class_type
       DW_AT_containing_type (0x11)
       DW_AT_name            ("Foo")

...

Assume that in the first compile unit, we have a relocation that causes us to keep the pointer to foo.

During the analysis phase, when looking for DIEs to keep, we encounter the definition of Foo. Through an ODR attribute (in this case DW_AT_containing_type) we see the same definition of Foo in the second CU. We decide to keep the definition in CU2 (0x11) and mark the decl context as having a canonical ODR DIE. As a result, we don't keep the definition in CU1.

During the emission phase, we encounter the pointer to foo again (0x03). As part of cloning, we go over its attributes and encounter 0x02 through the DW_AT_type. The canonical DIE in CU2 hasn't been cloned yet, so the RefInfo.Clone is false and so is the DeclContext's canonical DIE offset. As a result, we trip off the assertion because the reference (0x2) is less than the input DIEs offset (0x3) and therefore should already have been emitted. Instead, we should note the (forward) reference and fix it up later, which is what this patch does.

In the patch I called it a backward reference, because the original DIE comes before the DIE being emitted now, but as shown here, it really is a forward reference, so maybe it's better to keep the original naming.

This reproduces for several internal projects. The smallest one is an LTO build with 1.8 million lines of IR. I've been trying to reduce it with llvm-reduce, delta and bugpoint since last Thursday, but all of these tools are struggling with DI metadata and the size of the file. I'll give it another day or so but I'm not holding my breath this will generate anything that can be used as a test.

With the reduction running in the background, I've been trying to hand-craft a test that exposes the behavior I've described here, but with every DIE it becomes exponentially harder to trick dsymutil's algorithm into traversing the DIEs in the order you want.

Revert to old naming

avl added a comment.Dec 15 2022, 4:46 AM

Last week I debugged this further. Here's what I believe is going on. Consider the following simplified example with two CUs and a type Foo.

0x01: DW_TAG_compile_unit

0x02: DW_TAG_class_type
        DW_AT_containing_type (0x11)
        DW_AT_name            ("Foo")

0x03: DW_TAG_pointer_type
        DW_AT_type            (0x02 "Foo") 

...

0x10: DW_TAG_compile_unit

0x11: DW_TAG_class_type
       DW_AT_containing_type (0x11)
       DW_AT_name            ("Foo")

...

Assume that in the first compile unit, we have a relocation that causes us to keep the pointer to foo.

During the analysis phase, when looking for DIEs to keep, we encounter the definition of Foo. Through an ODR attribute (in this case DW_AT_containing_type) we see the same definition of Foo in the second CU. We decide to keep the definition in CU2 (0x11) and mark the decl context as having a canonical ODR DIE. As a result, we don't keep the definition in CU1.

During the emission phase, we encounter the pointer to foo again (0x03). As part of cloning, we go over its attributes and encounter 0x02 through the DW_AT_type. The canonical DIE in CU2 hasn't been cloned yet, so the RefInfo.Clone is false and so is the DeclContext's canonical DIE offset. As a result, we trip off the assertion because the reference (0x2) is less than the input DIEs offset (0x3) and therefore should already have been emitted. Instead, we should note the (forward) reference and fix it up later, which is what this patch does.

I tryed to create a test according to this description. But, in its current state it does not lead to the assertion. So the real case, is probable more complicated.
In this example the type from the first compile unit is always cloned.

In the patch I called it a backward reference, because the original DIE comes before the DIE being emitted now, but as shown here, it really is a forward reference, so maybe it's better to keep the original naming.

This reproduces for several internal projects. The smallest one is an LTO build with 1.8 million lines of IR. I've been trying to reduce it with llvm-reduce, delta and bugpoint since last Thursday, but all of these tools are struggling with DI metadata and the size of the file. I'll give it another day or so but I'm not holding my breath this will generate anything that can be used as a test.

With the reduction running in the background, I've been trying to hand-craft a test that exposes the behavior I've described here, but with every DIE it becomes exponentially harder to trick dsymutil's algorithm into traversing the DIEs in the order you want.

Though I think it would be useful to finally have a test case for this situation, it looks OK to approve this patch. If there would be reference to the uncloned DIE then this assertion should catch it:

void CompileUnit::fixupForwardReferences() {
  for (const auto &Ref : ForwardDIEReferences) {
    DIE *RefDie;
    const CompileUnit *RefUnit;
    PatchLocation Attr;
    DeclContext *Ctxt;
    std::tie(RefDie, RefUnit, Ctxt, Attr) = Ref;
    if (Ctxt && Ctxt->hasCanonicalDIE()) {
      assert(Ctxt->getCanonicalDIEOffset() &&
             "Canonical die offset is not set");   <<<<<<<<<<<<<<<<
      Attr.set(Ctxt->getCanonicalDIEOffset());
    } else
      Attr.set(RefDie->getOffset() + RefUnit->getStartOffset());
  }
}
avl accepted this revision.Dec 15 2022, 4:48 AM

LGTM

This revision is now accepted and ready to land.Dec 15 2022, 4:48 AM

Last week I debugged this further. Here's what I believe is going on. Consider the following simplified example with two CUs and a type Foo.

0x01: DW_TAG_compile_unit

0x02: DW_TAG_class_type
        DW_AT_containing_type (0x11)
        DW_AT_name            ("Foo")

0x03: DW_TAG_pointer_type
        DW_AT_type            (0x02 "Foo") 

...

0x10: DW_TAG_compile_unit

0x11: DW_TAG_class_type
       DW_AT_containing_type (0x11)
       DW_AT_name            ("Foo")

...

Assume that in the first compile unit, we have a relocation that causes us to keep the pointer to foo.

During the analysis phase, when looking for DIEs to keep, we encounter the definition of Foo. Through an ODR attribute (in this case DW_AT_containing_type) we see the same definition of Foo in the second CU. We decide to keep the definition in CU2 (0x11) and mark the decl context as having a canonical ODR DIE. As a result, we don't keep the definition in CU1.

During the emission phase, we encounter the pointer to foo again (0x03). As part of cloning, we go over its attributes and encounter 0x02 through the DW_AT_type. The canonical DIE in CU2 hasn't been cloned yet, so the RefInfo.Clone is false and so is the DeclContext's canonical DIE offset. As a result, we trip off the assertion because the reference (0x2) is less than the input DIEs offset (0x3) and therefore should already have been emitted. Instead, we should note the (forward) reference and fix it up later, which is what this patch does.

I tryed to create a test according to this description. But, in its current state it does not lead to the assertion. So the real case, is probable more complicated.
In this example the type from the first compile unit is always cloned.

Yes, I tried as well and I wasn't able to trick the algorithm into keeping the the type from the second CU.

In the patch I called it a backward reference, because the original DIE comes before the DIE being emitted now, but as shown here, it really is a forward reference, so maybe it's better to keep the original naming.

This reproduces for several internal projects. The smallest one is an LTO build with 1.8 million lines of IR. I've been trying to reduce it with llvm-reduce, delta and bugpoint since last Thursday, but all of these tools are struggling with DI metadata and the size of the file. I'll give it another day or so but I'm not holding my breath this will generate anything that can be used as a test.

With the reduction running in the background, I've been trying to hand-craft a test that exposes the behavior I've described here, but with every DIE it becomes exponentially harder to trick dsymutil's algorithm into traversing the DIEs in the order you want.

Though I think it would be useful to finally have a test case for this situation, it looks OK to approve this patch. If there would be reference to the uncloned DIE then this assertion should catch it:

void CompileUnit::fixupForwardReferences() {
  for (const auto &Ref : ForwardDIEReferences) {
    DIE *RefDie;
    const CompileUnit *RefUnit;
    PatchLocation Attr;
    DeclContext *Ctxt;
    std::tie(RefDie, RefUnit, Ctxt, Attr) = Ref;
    if (Ctxt && Ctxt->hasCanonicalDIE()) {
      assert(Ctxt->getCanonicalDIEOffset() &&
             "Canonical die offset is not set");   <<<<<<<<<<<<<<<<
      Attr.set(Ctxt->getCanonicalDIEOffset());
    } else
      Attr.set(RefDie->getOffset() + RefUnit->getStartOffset());
  }
}

Yup, exactly.

avl added a comment.Dec 16 2022, 9:47 AM

Looks like I have a test case for this problem:

compile units cross reference each other in this test case:

CU1:
  0x10 type 1
  ref to 0x40
CU2:
  0x40 type 1
  ref to 0x10
avl added a comment.Dec 16 2022, 11:15 AM

Since this patch extends number of cases when uncloned dies are created, it is worth to add following check:

void CompileUnit::fixupForwardReferences() {
  for (const auto &Ref : ForwardDIEReferences) {
    DIE *RefDie;
    const CompileUnit *RefUnit;
    PatchLocation Attr;
    DeclContext *Ctxt;
    std::tie(RefDie, RefUnit, Ctxt, Attr) = Ref;
    if (Ctxt && Ctxt->hasCanonicalDIE()) {
      assert(Ctxt->getCanonicalDIEOffset() &&
             "Canonical die offset is not set");
      Attr.set(Ctxt->getCanonicalDIEOffset());
    } else {
      assert(RefDie->getOffset() && "Referenced die offset is not set");    <<<<<<<<<<<<<<<<<<<<
      Attr.set(RefDie->getOffset() + RefUnit->getStartOffset());
    }
  }
}
JDevlieghere added a comment.EditedDec 16 2022, 11:25 AM

Looks like I have a test case for this problem:

That's awesome, thank you for spending time on that!

I'll add the assert you suggested to this patch. Do you want to commit the test yourself or can I include it in my commit (I don't want to take credit for your hard work, but also don't want to give you more work :-)

avl added a comment.Dec 16 2022, 11:32 AM

I'll add the assert you suggested to this patch. Do you want to commit the test yourself or can I include it in my commit (I don't want to take credit for your hard work, but also don't want to give you more work :-)

thank you for asking, include it to your commit please.

This revision was automatically updated to reflect the committed changes.