Page MenuHomePhabricator

[LLD][COFF] When using LLD-as-a-library, always prevent re-entrance on failures
ClosedPublic

Authored by aganea on Sep 25 2020, 5:56 PM.

Details

Summary

This is a follow-up for D70378 (Cover usage of LLD as a library).

I'm increasingly convinced that it is impossible to support re-entrance once a failure occurs in LLD, even if that currently works on the surface.

While debugging an intermittent failure on a bot, I recalled this scenario which causes the issue:

  1. When executing lld/test/ELF/invalid/symtab-sh-info.s L45, we reach lld::elf::ObjFile::ObjFile() which goes straight into its base ELFFileBase(), then ELFFileBase::init().
  2. At that point fatal() is thrown in lld/ELF/InputFiles.cpp L381, leaving a half-initialized ObjFile instance.
  3. We end up in lld::exitLld() and since we are running with LLD_IN_TEST, we hapily restore the control flow to CrashRecoveryContext::RunSafely() then back in lld::safeLldMain().
  4. Before this patch, we called errorHandler().reset() just after, and this attempted to reset the associated SpecificAlloc<ObjFile<ELF64LE>>. That tried to free the half-initialized ObjFile instance, and more precisely its ObjFile::dwarf member.

Sometimes that worked, sometimes it failed and was catched by the CrashRecoveryContext. This scenario was the reason we called errorHandler().reset() through a CrashRecoveryContext.

But in some rare cases, the above repro somehow corrupted the heap, creating a stack overflow. When the CrashRecoveryContext's filter (that is, __except (ExceptionFilter(GetExceptionInformation()))) tried to handle the exception, it crashed again since the stack was exhausted -- and that took the whole application down. That is the issue seen on the bot. Locally it happens about 1 times out of 15.

Now this situation can happen anywhere in LLD. Since catching stack overflows is not a reliable scenario when using CrashRecoveryContext, we're now preventing further re-entrance when such failures occur, by signaling lld::SafeReturn::canRunAgain=false. When running with LLD_IN_TEST=2, only one iteration will be executed in this case.

Diff Detail

Event Timeline

aganea created this revision.Sep 25 2020, 5:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 25 2020, 5:56 PM
aganea requested review of this revision.Sep 25 2020, 5:56 PM
aganea updated this revision to Diff 294466.Sep 25 2020, 6:00 PM
aganea edited the summary of this revision. (Show Details)

Diff with context.

rnk added a comment.Sep 28 2020, 5:19 PM

I think I'm OK with the functional change, but I think I would prefer not to keep the code that buffers stdout/stderr for the lit test. I think the complexity/size cost outweighs the test benefit.

Since catching stack overflows is not a reliable scenario when using CrashRecoveryContext

I have seen this in a language server. It seems that CrashRecoveryContext does not (correctly) use sigaltstack(). With sigaltstack, we can recover after a stack overflow.
For the stack overflow you observed, how did that happen?

In D88348#2299551, @rnk wrote:

I think I'm OK with the functional change, but I think I would prefer not to keep the code that buffers stdout/stderr for the lit test. I think the complexity/size cost outweighs the test benefit.

I share the same feeling. Hope bufferedStdout can be avoided

llvm/lib/Support/CrashRecoveryContext.cpp
447

This deserves a comment

llvm/lib/Support/Process.cpp
24

_Exit does not require unistd.h

I have seen this in a language server. It seems that CrashRecoveryContext does not (correctly) use sigaltstack(). With sigaltstack, we can recover after a stack overflow.

Good to know at least it could work on Linux. I wanted to activate coverage for this at some point in D74229 but then it wasn't working on Debug builds on Windows. I'll need to get back to it.

For the stack overflow you observed, how did that happen?

The stack overflow happens in the first place somewhere in the Windows CRT while releasing heap blocks. This is rare, as it depends on the random pattern that was in memory at the location of ObjFile::dwarf when fatal() was called by ELFFileBase::init().

In D88348#2299551, @rnk wrote:

I think I'm OK with the functional change, but I think I would prefer not to keep the code that buffers stdout/stderr for the lit test. I think the complexity/size cost outweighs the test benefit.

I share the same feeling. Hope bufferedStdout can be avoided

I'll try to find another way. The issue is that we're blocking stdout/stderr on all iterations, but the last one. Which doesn't work if fatal() occurs in the first iteration (as expected, in a test). Perhaps I could simply set LLD_IN_TEST=1 for tests that expect fatal() or a crash.

aganea updated this revision to Diff 295813.EditedOct 2 2020, 6:24 AM
aganea marked 2 inline comments as done.

Address comments. Removed stdout/stderr buffering, and replaced with env LLD_IN_TEST=1 set appropriately in tests, where needed.

aganea added a comment.Oct 8 2020, 3:49 PM

Ping! Any further comments?

rnk added a comment.Oct 13 2020, 12:52 PM

The code changes seem fine, but I haven't figured out why we need to sprinkle LLD_IN_TEST=1 for every LLD invocation we expect to error. What goes wrong exactly? I thought the point of your change is that, we *won't* re-run LLD when LLD_IN_TEST=2 after it reports an error.

aganea added a comment.EditedOct 13 2020, 2:14 PM
In D88348#2328425, @rnk wrote:

The code changes seem fine, but I haven't figured out why we need to sprinkle LLD_IN_TEST=1 for every LLD invocation we expect to error. What goes wrong exactly? I thought the point of your change is that, we *won't* re-run LLD when LLD_IN_TEST=2 after it reports an error.

When LLD_IN_TEST>0 we only emit stdout/stderr on the last iteration. But if we crash, FileCheck requires for stout & stderr to be emitted before the crash. That was the point for buffering the streams in the previous version of this patch. Either that, or LLD_IN_TEST=1 below. If you think of other ideas, I would be willing to try!

Ping.

I'd like to fix the lld.ELF/invalid::symtab-sh-info.s that fails for a while, and this patch has the fix.

If you think this whole LLD-as-a-library should go into a different direction please let me know.

MaskRay accepted this revision.Fri, Oct 30, 11:40 AM

Have the stdout/stderr changes be removed? The updated diff looks good to me. I agree that in the absence of exceptions (proper stack unwinding), we cannot safely recover from fatal(). For library use cases, if they can ensure no fatal() is called, they are probably still safe.

Marking LLD_IN_TEST=1 seems fine. It serves as a hint that abort() is called. LG with some nits, though I still hope @rnk or @amccarth can take a look.

llvm/lib/Support/CrashRecoveryContext.cpp
445

Nit: newer functions should stick with the naming rule. For Exit, you can keep it capitalized to reduce disruption since you just add a default argument which does not require all call sites to change.

llvm/lib/Support/Process.cpp
101

This needs stdlib.h

This revision is now accepted and ready to land.Fri, Oct 30, 11:40 AM
jmorse added a subscriber: jmorse.Mon, Nov 2, 5:37 AM

Will this land soon? We are eagerly anticipating its arrival :)

Will this land soon? We are eagerly anticipating its arrival :)

Yes, today! :) Let me finish a few things, I'll land it around noon EST.

This revision was landed with ongoing or failed builds.Thu, Nov 12, 5:14 AM
This revision was automatically updated to reflect the committed changes.
aganea marked 2 inline comments as done.