Page MenuHomePhabricator

[Support] make report_fatal_error `abort` instead of `exit`
ClosedPublic

Authored by ychen on Sep 20 2019, 10:59 AM.

Details

Summary

This patch could be treated as a rebase of D33960. It also fixes PR35547.
A fix for llvm/test/Other/close-stderr.ll is proposed in D68164. Seems
the consensus is that the test is passing by chance and I'm not
sure how important it is for us. So it is removed like in D33960 for now.
The rest of the test fixes are just adding --crash flag to not tool.

  • The reason it fixes PR35547 is

exit does cleanup including calling class destructor whereas abort
does not do any cleanup. In multithreading environment such as ThinLTO or JIT,
threads may share states which mostly are ManagedStatic<>. If faulting thread
tearing down a class when another thread is using it, there are chances of
memory corruption. This is bad 1. It will stop error reporting like pretty
stack printer; 2. The memory corruption is distracting and nondeterministic in
terms of error message, and corruption type (depending one the timing, it
could be double free, heap free after use, etc.).

Diff Detail

Event Timeline

ychen created this revision.Sep 20 2019, 10:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2019, 10:59 AM
rnk added a subscriber: MatzeB.Sep 20 2019, 11:48 AM

This has been proposed before by @MatzeB:
https://reviews.llvm.org/D33960

At the time, there was an issue that some people had been "fixing" fuzzer bugs by replacing asserts with if (cond) report_fatal_error(...);. Moving to abort will reopen all of those fuzzer bugs. However, I think we have consensus that that is the desired behavior: every call to report_fatal_error is essentially a bug,. Code should be using the error handlers on MCContext and LLVMContext instead and recovering.

ychen added a comment.Sep 20 2019, 5:17 PM
In D67847#1677251, @rnk wrote:

This has been proposed before by @MatzeB:
https://reviews.llvm.org/D33960

At the time, there was an issue that some people had been "fixing" fuzzer bugs by replacing asserts with if (cond) report_fatal_error(...);. Moving to abort will reopen all of those fuzzer bugs. However, I think we have consensus that that is the desired behavior: every call to report_fatal_error is essentially a bug,. Code should be using the error handlers on MCContext and LLVMContext instead and recovering.

Thanks for the pointer, Reid. It feels to me that both versions of report_fatal_error have their practical value in the codebase. If the only known value of the exit version is to make the fuzzer happy, should we just make fuzzer not report abort version of report_fatal_error? This way we could encourage the correct error handling and not miss anything from the fuzzer side?

PS: I think ThinLTO is another case of should not use report_fatal_error in the first place.

The main alternative worth mentioning is _exit, which LLD already uses for the same reason you mention here.

Honestly, I think there is consensus that calling abort here is the right thing and we should just do it. It's not like LLVM is already fuzzer clean, and if this does reopen fuzzer bugs, it's good to fix them properly anyway. :)

I mostly wanted to link back to the past attempt to do this to get the reviewers involved in that to +1 this new attempt. I guess I'll @ some more: @davide, @filcab, @mehdi_amini

LGTM: report_fatal_error() should not be considered like a "normal way" of aborting the compilation flow, crashing is never a good thing: in general I'm in favor of better error propagation (we claim that LLVM is a library framework right?)

This sounds fine to me. We should probably replace most of report_fatal_error with llvm::Error and propagate them correctly. In the meantime, having better crash is an improvement and does not conflict with what we want to do eventually.

MaskRay accepted this revision.EditedSep 30 2019, 10:36 PM
MaskRay added a subscriber: MaskRay.

This also fixes a C-compliance problem. report_fatal_error() may be called by a destructor, registered by atexit(). C11 says "If a program calls the exit function more than once, or calls the quick_exit function in addition to the exit function, the behavior is undefined."

As noted at https://reviews.llvm.org/D68164#1689155, close-stderr.ll should be updated to use not --crash opt --reject-this-option 2>&-. Don't rely on the exit code.

This revision is now accepted and ready to land.Sep 30 2019, 10:36 PM
ychen updated this revision to Diff 222734.Oct 1 2019, 5:33 PM

Fix tests

Herald added a project: Restricted Project. · View Herald Transcript
ychen edited the summary of this revision. (Show Details)Oct 2 2019, 11:11 AM

The abort() function raises SIGABRT, for which the default behavior is to trigger a coredump. Do we actually want that behavior?

Either _exit() (long available extension, which lld already uses) or quick_exit() (the new C standard way) seem possibly preferable?

ychen added a comment.Oct 2 2019, 2:04 PM

The abort() function raises SIGABRT, for which the default behavior is to trigger a coredump. Do we actually want that behavior?

Either _exit() (long available extension, which lld already uses) or quick_exit() (the new C standard way) seem possibly preferable?

I think this patch is more for the purpose of making error handling better than for fixing PR35547. Fixing PR35547 is probably just a side effect. Apologies that I should have updated the summary to reflect that.

ychen edited the summary of this revision. (Show Details)Oct 2 2019, 4:35 PM
ychen edited the summary of this revision. (Show Details)
rnk accepted this revision.Oct 3 2019, 2:51 PM

The abort() function raises SIGABRT, for which the default behavior is to trigger a coredump. Do we actually want that behavior?

Either _exit() (long available extension, which lld already uses) or quick_exit() (the new C standard way) seem possibly preferable?

It's easy to crash LLVM even without this change, so anyone running LLVM better have core dumps configured the way they like. Failed asserts raise SIGABRT already, for example, and we have tons of those. The only difference is that now end users, who may have never configured this stuff, may see more crashes. If it's a problem, I'd consider it QoI: we should fix the report_fatal_error to use proper diagnostics anyway so end users don't see them as often, just as we would treat any other user-visible crash.

ychen updated this revision to Diff 238314.Jan 15 2020, 10:19 AM
  • rebase

Unit tests: fail. 61892 tests passed, 2 failed and 782 were skipped.

failed: Clang.CXX/temp/temp_arg/temp_arg_template/p3-2a.cpp
failed: LLVM.CodeGen/SystemZ/mverify-optypes.mir

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

rnk added a comment.Jan 15 2020, 11:50 AM

I'm still in favor of this change.

MaskRay accepted this revision.Jan 15 2020, 12:00 PM

Apologies that it takes so long. I'll fix the test and land it today.

In D67847#1693694, @rnk wrote:

The abort() function raises SIGABRT, for which the default behavior is to trigger a coredump. Do we actually want that behavior?

Either _exit() (long available extension, which lld already uses) or quick_exit() (the new C standard way) seem possibly preferable?

It's easy to crash LLVM even without this change, so anyone running LLVM better have core dumps configured the way they like. Failed asserts raise SIGABRT already, for example, and we have tons of those. The only difference is that now end users, who may have never configured this stuff, may see more crashes. If it's a problem, I'd consider it QoI: we should fix the report_fatal_error to use proper diagnostics anyway so end users don't see them as often, just as we would treat any other user-visible crash.

That may be, but right now we do report a bunch of errors via report_fatal_error, even if we should not. Coredumps seem unlikely to be of use in diagnosing the issue involved (unlike, perhaps, with a segfault), so I don't see why we'd _want_ to call abort vs quick_exit/_exit.

I express this as my opinion, but do not veto acceptance given by others.

ychen updated this revision to Diff 238378.Jan 15 2020, 3:15 PM
  • fix tests

Unit tests: pass. 61894 tests passed, 0 failed and 782 were skipped.

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

This revision was automatically updated to reflect the committed changes.
ychen added a comment.Jan 15 2020, 7:21 PM

There are some interesting failures that need to be investigated.
https://reviews.llvm.org/rG647c3f4e47de8a850ffcaa897db68702d8d2459a

rnk added a comment.Jan 29 2020, 11:43 AM

If we can reland this patch, it should fix this new clang bug:
https://bugs.llvm.org/show_bug.cgi?id=44705

This was the list of failures you noted on the commit:

The ppc64be stuff seems like only someone with hardware will be able to debug it.
For clang/test/CXX/temp/temp.arg/temp.arg.template/p3-2a.cpp, you are saying you can reproduce that locally?
The timeout seems like it could be flakiness or something unrelated.
The thinlto failures... I'm not sure either, but again it's PPC64, so it could be flakiness.

It seems like it would be best to alert the owners of those bots and reland and get new results, assuming things pass on less exotic bot configurations.

ychen added a comment.Jan 29 2020, 2:03 PM
In D67847#1847757, @rnk wrote:

If we can reland this patch, it should fix this new clang bug:
https://bugs.llvm.org/show_bug.cgi?id=44705

This was the list of failures you noted on the commit:

The ppc64be stuff seems like only someone with hardware will be able to debug it.
For clang/test/CXX/temp/temp.arg/temp.arg.template/p3-2a.cpp, you are saying you can reproduce that locally?
The timeout seems like it could be flakiness or something unrelated.
The thinlto failures... I'm not sure either, but again it's PPC64, so it could be flakiness.

It seems like it would be best to alert the owners of those bots and reland and get new results, assuming things pass on less exotic bot configurations.

p3-2a.cpp random failure is fixed.
Thinlto / PPC issues are OOM related. PPC admin suggested recommit.

I've confirmed locally that failures on sanitier_windows are real (Patches are below), but I have no clue about how they are related.
D73329
D73327