Page MenuHomePhabricator

Fix a crash in lazy loading of Metadata in ThinLTO
AbandonedPublic

Authored by Sunil_Srivastava on Dec 20 2017, 5:43 PM.

Details

Summary

During the lazy loading pass of ThinLTO, if the load of a function body
creates forward references, resolve them right there.

This fixes PR35472.

Diff Detail

Event Timeline

tejohnson added inline comments.
lib/Bitcode/Reader/BitcodeReader.cpp
4620

I think this is just hiding the issue that the below error is trying to catch. That is - we created a forward ref and didn't resolve it as expected during the function parsing.

What created the forward ref? Do you know why it wasn't resolved in the usual way?

The description of r291027 lists some limitations - is there another one that needs to be handled specially (e.g. like the METADATA_GLOBAL_DECL_ATTACHMENT case)?

mehdi_amini added inline comments.Dec 21 2017, 9:20 AM
test/ThinLTO/X86/pr35472-a.ll
132

Can you reduce the test case? And make it very clear *what* is the exact structure that lead to the issue?

See the kind of test here: llvm/trunk/test/ThinLTO/X86/lazyload_metadata.ll

What created the forward ref? Do you know why it wasn't resolved in the usual way?

What is the usual way of resolving forward refs?

I modeled FinishFunctionMetadata code from ..MetadataLoaderImpl::parseMetadats just below it, which calls lazyLoadModuleMetadataBlcok and then resolveForwardRefsandPlaceholders just after that. So I assumed that will be the usual way.

The unique thing in this test case is that a normal function, with debug info, gets inlined inside a nodebug function. The inlined instructions have attached metadata whereas the whole function is still nodebug.

The semantics of this situation is not clearly defined. We could rule that the debug info from inlined code should be stripped., which will avoid this problem in the first case, but I can see that some users may carefully choose where they want debug info and the inlining will foil their attempt then.

I have trimmed the .ll test somewhat, but I am not sure what is the exact structure that is leading to this error.

The error seems to be related to the limitation (2) mentioned in r291027 for function level, and the fix is along the lines of MetadataLoaderImpl::parseMetadats for the global level, i.e, call resolveForwardRefsAndPlaceholders().

I will be happy to hear any suggestion about where to look for more info on this mechanism and how to proceed.

The thing that I do not understand is that, even though the error happens while reading EchoD2, a change in DeltaD2 avoid the error.

Specifically, removing the getelementptr instr from DeltaD2 avoids the error, even though the error in the failing run occurs before reading DeltaD2.

Under debugger readers of these two differing version follow widely different paths.

Are there any kind of internal documentation of notes describing what lazy loading is doing precisely?

test/ThinLTO/X86/pr35472-a.ll
132

There is the smallest C++ test case that I could come up with, but I have not attempted to reduce the IR. I will try that.

The exact structure leading to the error is inlining of a normal (not nodebug) function inside a nodug function. That leads to an instruction with attached debuginfo inside an otherwise nodebug function.

Any chance this gets fixed soon? I'm hitting the same problem.

vsk added a subscriber: vsk.Oct 15 2018, 1:39 PM
vsk added a comment.Oct 15 2018, 1:50 PM

Hi @tejohnson @Sunil_Srivastava, just a heads-up that with https://reviews.llvm.org/rL344545 applied, I'm hitting "Invalid function metadata: outgoing forward refs" in an internal project with ThinLTO and hot/cold splitting enabled. There's some sort of problem caused by having an outlined function with debug locations in it but no DISubprogram. I'll try to dig a bit deeper.

Can we land this patch

Can we land this patch

Let me take a look today. As noted in my earlier comment, this still seems like a hammer to fix something that was intended to be handled earlier by the MDLoader. I'd like to understand why it didn't work as intended.

Can we land this patch

Let me take a look today. As noted in my earlier comment, this still seems like a hammer to fix something that was intended to be handled earlier by the MDLoader. I'd like to understand why it didn't work as intended.

FWIW I'm concerned about this patch too. I think we need to understand the issue (and as per Mehdi's comments, understand the exact structure leading to the issue) before committing a fix.

Can we land this patch

Let me take a look today. As noted in my earlier comment, this still seems like a hammer to fix something that was intended to be handled earlier by the MDLoader. I'd like to understand why it didn't work as intended.

FWIW I'm concerned about this patch too. I think we need to understand the issue (and as per Mehdi's comments, understand the exact structure leading to the issue) before committing a fix.

Quick note that I have been looking at this and have a rough idea of what is going on and another possible fix. Needs some more investigation tomorrow.

kristina added a subscriber: kristina.EditedOct 22 2018, 10:36 PM

Just a small nitpick, please remove the svn:executable property on test/ThinLTO/X86/pr35472-a.ll if you ever get this to a point where it's approved and committed, some buildbots get upset if it's present and will fail while updating their checkouts.

Ok here's what is going on.

The scope of the DILocation in _ZN4EchoD2Ev has a scope that is a DILexicalBlock. In the test case, because that same DILexicalBlock is used as the scope for DILocation from two different routines (_ZN4EchoD2Ev and _ZN5DeltaD2Ev), it lives in the MODULE's metadata block. Therefore, during importing, we try to load it lazily. We parse the debug location when from BitcodeReader::parseFunction, in the FUNC_CODE_DEBUG_LOC case. This case invokes MDLoader->getMDNodeFwdRefOrNull() for the scope and inlinedAt fields, which creates a forward ref if it hasn't been loaded already, which it hasn't since it lives in the lazily loaded Module level metadata.

Most places in that invoke getMDNodeFwdRefOrNull have a corresponding call to resolveForwardRefsAndPlaceholders which will take care of resolving them. E.g. places that call getMetadataFwdRefOrLoad, or at the end of parsing a function-level metadata block, or at the end of the initial lazy load of module level metadata in order to handle invocations of getMDNodeFwdRefOrNull for named metadata and global object attachments. However, the calls for the scope/inlinedAt of debug locations are not backed by any such call to resolveForwardRefsAndPlaceholders.

The reason the problem goes away as noted by the patch author when the getelementptr in _ZN5DeltaD2Ev is removed (actually, all it takes is removing the !dbg attachment there), is that the DILexicalBlock then lives in the function metadata block for _ZN4EchoD2Ev since it is only referenced by one function. The function metadata block as mentioned above is loaded eagerly.

I'm not sure why we haven't hit this too often historically (I put in an assert where we get the scope/inlinedAt that it isn't a temporary and it isn't hit at all for the current llvm regression suite), but I am guessing that the reason we suddenly started hitting with ThinLTO+hotcold function splitting is that more scopes are being referenced by multiple functions (part in the hot split func and part in the cold split func) and therefore moving to the lazily loaded module level metadata. This is likely also happening because splitting is/was happening in the pre-link compile step before importing (this was fixed for the old PM in D53437 but as I noted there is not yet fixed for the new PM).

The fix I tried that works both for this test case and for the large app where I hit it after enabling hot cold splitting with ThinLTO is to change the scope and inlinedAt handling to instead invoke getMetadataFwdRefOrLoad. If this seems like a reasonable fix (I am not sure why it wouldn't be), I can send a new patch with that fix. Here's the fix that worked for me:

  • a/lib/Bitcode/Reader/BitcodeReader.cpp

+++ b/lib/Bitcode/Reader/BitcodeReader.cpp
@@ -3520,12 +3520,12 @@ Error BitcodeReader::parseFunctionBody(Function *F) {

MDNode *Scope = nullptr, *IA = nullptr;
if (ScopeID) {
  • Scope = MDLoader->getMDNodeFwdRefOrNull(ScopeID - 1);

+ Scope = dyn_cast_or_null<MDNode>(MDLoader->getMetadataFwdRefOrLoad(ScopeID - 1));

  if (!Scope)
    return error("Invalid record");
}
if (IAID) {
  • IA = MDLoader->getMDNodeFwdRefOrNull(IAID - 1);

+ IA = dyn_cast_or_null<MDNode>(MDLoader->getMetadataFwdRefOrLoad(IAID - 1));

  if (!IA)
    return error("Invalid record");
}

I mailed D53596 with the new proposed fix.

Ping - can someone review/approve this? Suggestions on who else to review?

Ping - can someone review/approve this? Suggestions on who else to review?

Nevermind - pinged the wrong one, meant to ping follow on D53596