This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Fix a crash in lazy loading of Metadata
ClosedPublic

Authored by tejohnson on Oct 23 2018, 11:44 AM.

Details

Summary

This is a revised version of D41474.

When the debug location is parsed in BitcodeReader::parseFunction, the
scope and inlinedAt MDNodes are obtained via MDLoader->getMDNodeFwdRefOrNull(),
which will create a forward ref if they were not yet loaded.
Specifically, if one of these MDNodes is in the module level metadata
block, and this is during ThinLTO importing, that metadata block is
lazily loaded.

Most places in that invoke getMDNodeFwdRefOrNull have a corresponding call
to resolveForwardRefsAndPlaceholders which will take care of resolving them.
E.g. places that call getMetadataFwdRefOrLoad, or at the end of parsing a
function-level metadata block, or at the end of the initial lazy load of
module level metadata in order to handle invocations of getMDNodeFwdRefOrNull
for named metadata and global object attachments. However, the calls for
the scope/inlinedAt of debug locations are not backed by any such call to
resolveForwardRefsAndPlaceholders.

To fix this, change the scope and inlinedAt parsing to instead use
getMetadataFwdRefOrLoad, which will ensure the forward refs to lazily
loaded metadata are resolved.

Fixes PR35472.

Diff Detail

Event Timeline

tejohnson created this revision.Oct 23 2018, 11:44 AM

Added part of test case missed in first version of patch.

Harbormaster completed remote builds in B24096: Diff 170714.
vsk added a comment.Oct 23 2018, 11:52 AM

Thanks for the patch! Your explanation makes sense to me, but I'm not familiar enough with this code to give a +1. @steven_wu, any thoughts on this?

This revision was not accepted when it landed; it landed in state Needs Review.Oct 23 2018, 3:59 PM
This revision was automatically updated to reflect the committed changes.

Yikes! Accidental commit! Reverting...

tejohnson reopened this revision.Oct 23 2018, 4:03 PM

Reverted in r345097.

In D53596#1272986, @vsk wrote:

Thanks for the patch! Your explanation makes sense to me, but I'm not familiar enough with this code to give a +1. @steven_wu, any thoughts on this?

Looks fine but I don't know enough about debug info to comment on this.

ping - @dexonsmith , can you take a look?

vsk added a comment.Oct 31 2018, 7:09 AM

Hi Teresa, with this patch applied, I do still see a (possibly unrelated) crash when building an internal framework with ThinLTO + hot/cold splitting. I'm not sure what the best way is to prepare a reproducer (please let me know), but I've tried to collect some output from the debugger:

In D53596#1282167, @vsk wrote:

Hi Teresa, with this patch applied, I do still see a (possibly unrelated) crash when building an internal framework with ThinLTO + hot/cold splitting. I'm not sure what the best way is to prepare a reproducer (please let me know), but I've tried to collect some output from the debugger:

Doesn't look like this one has anything to do with the issue I'm fixing here. The stack trace shows it is coming from the call to materializeMetadata() from FunctionImport.cpp, and that is called before we invoke the IRMover (which is where we encountered the code with the bug, and the function changed by my patch). The code changed in this patch is not reached via materializeMetadata.

Can you file a new bug for the problem you encountered, along with some kind of reproducer? Looks like it might be hitting infinite, or at least very deep, recursion and running out of stack.

But what version are you on? The stack trace says the call to materializeMetadata is on line 765 of FunctionImport.cpp, and it is currently nowhere near that location and hasn't been in awhile.

vsk added a comment.Oct 31 2018, 1:19 PM
In D53596#1282167, @vsk wrote:

Hi Teresa, with this patch applied, I do still see a (possibly unrelated) crash when building an internal framework with ThinLTO + hot/cold splitting. I'm not sure what the best way is to prepare a reproducer (please let me know), but I've tried to collect some output from the debugger:

Doesn't look like this one has anything to do with the issue I'm fixing here. The stack trace shows it is coming from the call to materializeMetadata() from FunctionImport.cpp, and that is called before we invoke the IRMover (which is where we encountered the code with the bug, and the function changed by my patch). The code changed in this patch is not reached via materializeMetadata.

Can you file a new bug for the problem you encountered, along with some kind of reproducer? Looks like it might be hitting infinite, or at least very deep, recursion and running out of stack.

But what version are you on? The stack trace says the call to materializeMetadata is on line 765 of FunctionImport.cpp, and it is currently nowhere near that location and hasn't been in awhile.

Thanks for looking. I'm using AppleClang-1000 (derived from llvm-6.0), as it's the version currently used to build our OSes. I'll work on finding a reproducer against the top-of-tree compiler.

Ping - can someone review/approve this? Suggestions on who else to review?

steven_wu accepted this revision.Nov 5 2018, 11:02 AM

Oh, I see. This is not really related to the debug information quality. It is a pure metadata lazy loading problem. LGTM.

The only suggestion I have (maybe can be done in a separate commit) is that none of the call site of getMDNodeFwdRefOrNull are expecting null as a valid output because they all check for nullptr and error out in some way. Maybe this function should really be:

Error<MDNode*> getMDNodeFwdRef(unsigned);

And it can return a better error message.

This revision is now accepted and ready to land.Nov 5 2018, 11:02 AM

Hmm. I remember writing this code. I have a faint recollection of proving at the time that these could never be forward refs, and that it was important for bitcode reading performance.

Hmm. I remember writing this code. I have a faint recollection of proving at the time that these could never be forward refs, and that it was important for bitcode reading performance.

Sorry, I neglected to say anything actionable.

  • If any bitcode writer in a released Clang has written forward refs for scope/inline-at, then clearly we need to handle this in the reader. But if that's not the case, perhaps we can instead change the writer to make it impossible again. (I think I might have intentionally used dyn_cast instead of dyn_cast_or_null as an assertion in the reader.)
  • Either way, you might look at a bitcode-reading profile and see if this is still something to be careful about for compile-time.
  • Note: I wrote that code way back before we had lazy loading of metadata and I don't remember if/how things changed there.

Hmm. I remember writing this code. I have a faint recollection of proving at the time that these could never be forward refs, and that it was important for bitcode reading performance.

Sorry, I neglected to say anything actionable.

  • If any bitcode writer in a released Clang has written forward refs for scope/inline-at, then clearly we need to handle this in the reader. But if that's not the case, perhaps we can instead change the writer to make it impossible again. (I think I might have intentionally used dyn_cast instead of dyn_cast_or_null as an assertion in the reader.)

It looks like this code was already handling forward refs for the scope/inline-at, since it was calling getMDNodeFwdRefOrNull - so is it possible that changed at some point?

  • Either way, you might look at a bitcode-reading profile and see if this is still something to be careful about for compile-time.
  • Note: I wrote that code way back before we had lazy loading of metadata and I don't remember if/how things changed there.

For lazy loading of metadata, I don't think it matters whether it is a forward reference originally, we won't load it until we need it.

I don't think the patch changes whether it handles forward references, or am I misunderstanding how the code currently works?

Hmm. I remember writing this code. I have a faint recollection of proving at the time that these could never be forward refs, and that it was important for bitcode reading performance.

Sorry, I neglected to say anything actionable.

  • If any bitcode writer in a released Clang has written forward refs for scope/inline-at, then clearly we need to handle this in the reader. But if that's not the case, perhaps we can instead change the writer to make it impossible again. (I think I might have intentionally used dyn_cast instead of dyn_cast_or_null as an assertion in the reader.)

It looks like this code was already handling forward refs for the scope/inline-at, since it was calling getMDNodeFwdRefOrNull - so is it possible that changed at some point?

  • Either way, you might look at a bitcode-reading profile and see if this is still something to be careful about for compile-time.
  • Note: I wrote that code way back before we had lazy loading of metadata and I don't remember if/how things changed there.

For lazy loading of metadata, I don't think it matters whether it is a forward reference originally, we won't load it until we need it.

I don't think the patch changes whether it handles forward references, or am I misunderstanding how the code currently works?

Ping on this comment/question - I don't think I am changing anything here on the handling of forward references in the non-lazy loading case, and I believe the need to handle them as "forward refs" is inherent in the lazy loading.

Sorry, I missed your questions. Thanks for pinging.

I wasn’t concerned about this patch going in. I was just concerned we may have had compile time regressions already. Metadata forward references are quite expensive to track and resolve.

But I think I’d misread the patch entirely. Is see this is just loosening to Metadata .

  • Am I correct that somehow the scope field is not an MDNode? The verifier should fail for that, unless something has changed. @aprantl can you take a quick look?
  • Can we have a positive test for what the debug info should like like here, rather than just a crash test?
dexonsmith resigned from this revision.Nov 14 2018, 11:00 AM

Sorry, I missed your questions. Thanks for pinging.

I wasn’t concerned about this patch going in. I was just concerned we may have had compile time regressions already. Metadata forward references are quite expensive to track and resolve.

But I think I’d misread the patch entirely. Is see this is just loosening to Metadata .

I misread the code in a different way. I don’t have time this week to really understand this so I’m resigning as a reviewer. I don’t want to hold up the fix and Steven already reviewed this.

  • Am I correct that somehow the scope field is not an MDNode? The verifier should fail for that, unless something has changed. @aprantl can you take a quick look?
  • Can we have a positive test for what the debug info should like like here, rather than just a crash test?

I’m still a bit concerned about the test being debug info related and not testing the debug info output at all, but if @aprantl has comments there they can probably be dealt with post-commit.

My thought is that this patch should make bitcode reader more robust when lazy loading, which is always a good thing. If this is a performance regression, the regression is coming from how debug information is generated. If there are no such forward-ref in the metadata, there is no slowdown with this patch. We can investigate performance regression post commit if needed.

My thought is that this patch should make bitcode reader more robust when lazy loading, which is always a good thing. If this is a performance regression, the regression is coming from how debug information is generated. If there are no such forward-ref in the metadata, there is no slowdown with this patch. We can investigate performance regression post commit if needed.

Right and note that if we are not doing lazy metadata loading, there is essentially no change here. Before, the code called MDLoader->getMDNodeFwdRefOrNull(), which boils down to dyn_cast_or_null<MDNode>(getMetadataFwdRef(Idx)). With this change we instead call MDLoader->getMetadataFwdRefOrLoad, and if there is no lazy loading enabled this will end up returning getMetadataFwdRef(ID), to which the caller is now applying the dyn_cast_or_null<MDNode>.

Steven - you had a suggestion around callers to getMDNodeFwdRefOrNull, but after this change there will be no more callers to that. So should I go ahead and submit this one, then remove that interface completely in a follow up? Or do it in this patch?

Steven - you had a suggestion around callers to getMDNodeFwdRefOrNull, but after this change there will be no more callers to that. So should I go ahead and submit this one, then remove that interface completely in a follow up? Or do it in this patch?

Doesn't matter. Follow up is perfectly fine.

Steven - you had a suggestion around callers to getMDNodeFwdRefOrNull, but after this change there will be no more callers to that. So should I go ahead and submit this one, then remove that interface completely in a follow up? Or do it in this patch?

Doesn't matter. Follow up is perfectly fine.

Ok, retesting this one at HEAD, then will remove that interface in a follow on NFC change.

tejohnson closed this revision.Nov 14 2018, 1:09 PM

My commit message lost the phabricator link, so this didn't get auto-closed. Closed by commit r346891.