This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Print current request context along with the stack trace
ClosedPublic

Authored by 0x1eaf on Sep 9 2021, 6:52 AM.

Details

Summary

Motivation:

At the moment it is hard to attribute a clangd crash to a specific request out of all in-flight requests that might be processed concurrently. So before we can act on production clangd crashes, we have to do quite some digging through the log tables populated by our in-house VSCode extension or sometimes even directly reach out to the affected developer. Having all the details needed to reproduce a crash printed alongside its stack trace has a potential to save us quite some time, that could better be spent on fixing the actual problems.

Implementation approach:

  • introduce ThreadCrashReporter class that allows to set a temporary signal handler for the current thread
  • follow RAII pattern to simplify printing context for crashes occurring within a particular scope
  • hold std::function as a handler to allow capturing context to print
  • set local ThreadCrashReporter within JSONTransport::loop() to print request JSON for main thread crashes, and in ASTWorker::run() to print the file paths, arguments and contents for worker thread crashes

ThreadCrashReporter currently allows only one active handler per thread, but the approach can be extended to support stacked handlers printing context incrementally.

Example output for main thread crashes:

...
#15 0x00007f7ddc819493 __libc_start_main (/lib64/libc.so.6+0x23493)
#16 0x000000000249775e _start (/home/emmablink/local/llvm-project/build/bin/clangd+0x249775e)
Signalled while processing message:
{"jsonrpc": "2.0", "method": "textDocument/didOpen", "params": {"textDocument": {"uri": "file:///home/emmablink/test.cpp", "languageId": "cpp", "version": 1, "text": "template <typename>\nclass Bar {\n  Bar<int> *variables_to_modify;\n  foo() {\n    for (auto *c : *variables_to_modify)\n      delete c;\n  }\n};\n"}}}

Example output for AST worker crashes:

...
#41 0x00007fb18304c14a start_thread pthread_create.c:0:0
#42 0x00007fb181bfcdc3 clone (/lib64/libc.so.6+0xfcdc3)
Signalled during AST action:
Filename: test.cpp
Directory: /home/emmablink
Command Line: /usr/bin/clang -resource-dir=/data/users/emmablink/llvm-project/build/lib/clang/14.0.0 -- /home/emmablink/test.cpp
Version: 1
Contents:
template <typename>
class Bar {
  Bar<int> *variables_to_modify;
  foo() {
    for (auto *c : *variables_to_modify)
      delete c;
  }
};

Testing:

The unit test covers the thread-localitity and nesting aspects of ThreadCrashReporter. There might be way to set up a lit-based integration test that would spawn clangd, send a message to it, signal it immediately and check the standard output, but this might be prone to raceconditions.

Diff Detail

Event Timeline

0x1eaf created this revision.Sep 9 2021, 6:52 AM
0x1eaf requested review of this revision.Sep 9 2021, 6:52 AM

Thanks for doing this, I think this is incredibly useful for debugging.
But also subtle, so please forgive a bunch of comments!

I don't think we're strictly in defined-behavior territory in much of what these signal handlers are doing, but neither is clang's existing crash handlers and they seem to work well (main difference is the thread-local access)

Also discussed this a bunch with @kadircet offline.

clang-tools-extra/clangd/JSONTransport.cpp
114

this is capturing by value, I don't think it needs to

117

why \r?

clang-tools-extra/clangd/TUScheduler.cpp
601

naming: "context" is a bit vague (and overloaded in clangd), and we often use "dump" for things in a debugging context.

Maybe dumpCurrentRequest()?

1428

i'd probably pass this in as a param, just to move the "we're printing to stderr" and "we're running in a signal handler" decisions closer together.

1429

the name of the action is available as CurrentRequest->Action, which I think is safe to either pass into this function or read directly here.

1431

nit: please indent the lines after the first to set this apart from other crash output a bit

1440

hmm, this is potentially really useful (could gather full reproducers) but also is going to dominate the output and make it annoying to see the stack traces/other details. I'm worried this may make this feature net-negative for a majority of users, who won't find the right parts of the output to report...

How critical do you think this is for debugging crashes in practice? (I suspect fairly important!)
Since we're taking liberties with async-signal-safety anyway, it might be possible to get this sent to a tempfile, which might be more useful. In any case, we might want full file dumps to be off by default outside of environments where anyone is likely to actually try to get the reproducers.

I'd probably suggest leaving this out of the initial patch entirely so we can focus on getting the rest of the mechanism right.

1440

another thought along the lines of reproducers (but nothing for this patch)...

If we're working on a file and have a preamble, then the headers that make up the preamble are relevant. In practice it's hard to reproduce bug reports we get because we need all their headers. If we're using a preamble then just dumping all the filenames it includes would make it possible for some tool to take the crash report and zip up a reproducer.

(This probably needs the ability to run all of the handlers for the current thread, not just the top of the stack, since we acquire the relevant preamble later)

clang-tools-extra/clangd/support/ThreadSignalHandler.cpp
13 ↗(On Diff #371582)

style nit: we tend do wrap the code in namespace clang { namespace clangd { instead, at least within clangd

15 ↗(On Diff #371582)

we assume thread_local in clangd (see support/Context.ccp)

It's possible that llvm::sys::ThreadLocal happens to play nicer with signal handlers in practice (none of these are completely safe!), but unless we've seen some concrete evidence I'd rather just have static thread_local ThreadSignalHandlerCallback* here

15 ↗(On Diff #371582)

we can quite easily support a stack of handlers, such that we dump *all* living handlers on the crashing thread.

just give each ThreadSignalHandler a ThreadSignalHandler* Next = nullptr member, and both the constructor and destructor swap(Next, CurrentCallback). Then CurrentCallback points to the head of a linked list of handlers.

Not sure if you want to have that in this patch, but it's pretty simple so I wouldn't object.

17 ↗(On Diff #371582)

A library may create a ThreadSignalHandler to dump context which may in turn install a global signal handler, which can be a surprising side-effect.
In particular, we do have users that embed ClangdServer in different environments (not via ClangdMain) and don't use the LLVM crash-handlers.

I think the cleanest thing would be to have ThreadSignalHandler.cpp expose a function RunThreadCrashHandlers() or so, which is basically PrintInputs, and have ClangdMain.cpp call llvm::sys::AddSignalHandler near where it calls PrintStackTraceOnErrorSignal today

22 ↗(On Diff #371582)

I think we should do this already, probably just starting with an allowlist of SIGBUS, SIGFPE, SIGILL, SIGSEGV (these are signals that are delivered to the thread that triggered them per the linux docs).

I'm also wondering how portable this assumption/mechanism is, e.g. whether we should enable it for certain platforms only. I guess it's probably true for unixy things, and apparently mostly true for windows too[1]? I'm a little concerned *something* may go wrong on windows, though.

[1] https://stackoverflow.com/questions/15879342/which-thread-are-signal-handlers-e-g-signalsigint-crtl-c-called-on

36 ↗(On Diff #371582)

we could consider atomic_signal_fence around these mutations, or making CurrentCallback atomic.

It doesn't get us out of all possible write-reordering problems (we could see inconsistent states of the data referred to by the handler) but may avoid some catastrophic failures

clang-tools-extra/clangd/support/ThreadSignalHandler.h
7 ↗(On Diff #371582)

Either here in the file comment or on the ThreadSignalHandler class we should definitely have an overview of what this is for/how it's used.

16 ↗(On Diff #371582)

I understand framing this as general signal handler machinery, but I think it might be easier to reason about if we give it a name more specific to its purpose. Like maybe ThreadCrashReporter?

The main reason is that in the general case signal handling calls to mind a bunch of concerns like async safety that we're able to be a bit cavalier about in the specific case where the signal is (probably) handled synchronously, we're crashing anyway, etc. We also don't really want to bother exposing info like *which* signal is being handled, but a "signal handler" interface looks strange without it.

21 ↗(On Diff #371582)

I'm not quite clear on why we need to support move-construction (and the move-constructor's interaction with static state!)

Oops, forgot one thing: you probably want to instrument the preamble-building in TUScheduler (or in Preamble.cpp?), the background indexing, and code completion. Those all run the clang parser and should be a rich source of clang bugs.

Testing idea: yes '[' | head -n 5000 | clangd --input-style=delimited crashes the main thread, because the JSON parser is recursive...

clang-tools-extra/clangd/support/ThreadSignalHandler.cpp
15 ↗(On Diff #371582)

Sorry, just realized you already suggested the stacking in the patch description!

Thank you for such a prompt and thorough review!

Please consider all the comments I haven't replied to acknowledged and I'm going to resolve the issues with the next update.

clang-tools-extra/clangd/JSONTransport.cpp
117

the message itself is terminated with \r\n in the protocol and I just mimicked it for no good reason — will correct

clang-tools-extra/clangd/TUScheduler.cpp
1429

I've noticed the the action name is always hardcoded as “Build AST ({version})” and just logged the version directly below, but I guess it's not forward compatible, so I'm going to log both…

1440

How critical do you think this is for debugging crashes in practice?

For us it would likely be of utmost importance as our VSCode extension routes requests to Glean until local clangd finishes the compilation, so clangd ends up mostly being used for modified files, and the crashes might be unreproducible on the original source…

I'd probably suggest leaving this out of the initial patch entirely so we can focus on getting the rest of the mechanism right.

Alright, will remove the file contents with the next update.

In any case, we might want full file dumps to be off by default outside of environments where anyone is likely to actually try to get the reproducers.

What would be the best way to re-add it in a subsequent revision: make it configurable via a command line option, or via an environment variable? And if it's just an env var, maybe we could include it conditionally in this revision?

If we're using a preamble then just dumping all the filenames it includes would make it possible for some tool to take the crash report and zip up a reproducer.

We had discussions within the team about potentially implementing a similar tool, but the rough idea was to rely on VCS to collect the changes as the delta would normally be smaller than bundling all the headers that the TU includes. But if the VCS-agnostic approach would be simpler and more portable — I'm all for it :)

clang-tools-extra/clangd/support/ThreadSignalHandler.cpp
15 ↗(On Diff #371582)

Having Next pointer chains makes me worried about out of order destruction, but if we are removing the move constructor and making the crash handler object scope-bound always, that could actually work. But if we want to keep the possibility of storing the handler in an object to preserve it across method calls, I guess we could just make this a doubly-linked list and cut out the link gracefully.

36 ↗(On Diff #371582)

sure, will add a seq_cst one at the end, same as in setCrashLogMessage()

clang-tools-extra/clangd/support/ThreadSignalHandler.h
16 ↗(On Diff #371582)

makes total sense, will rename it

21 ↗(On Diff #371582)

I just parroted the scope_exit implementation — can remove it

0x1eaf added inline comments.Sep 10 2021, 9:46 AM
clang-tools-extra/clangd/support/ThreadSignalHandler.cpp
22 ↗(On Diff #371582)

I've just looked into adding filtering by signal number and there doesn't seem to be a way to do this portably in LLVM. But it looks like it might be safe enough to skip the filtering at all:

  • From a cursory look at Windows Signals.inc it looks like the “signal” handling is implemented via SetUnhandledExceptionFilter() which is thread-directed, and the SetConsoleCtrlHandler() callback does not call the registered handlers.
  • And on Unix platforms the AddSignalHandler() callbacks are called only for KillSigs out of which only SIGQUIT seem to be process-directed and would be delivered to any thread that is not blocking it, but if the thread gets delivered to has a ThreadCrashReporter installed during the interrupt — it seems to make sense to run the handler and let it print the thread-local context information (at least there seems to be no harm in doing so).

What do you think?

Well, thanks for such a well-thought-out patch :-) I have to admit my first reaction was terror, but this seems solid to me.

clang-tools-extra/clangd/TUScheduler.cpp
1429

the action name is always hardcoded as “Build AST ({version})”

This shouldn't *always* be the case, there should be entries like "Hover" etc too, corresponding to WorkScheduler->runWithAST("Hover", ...) in ClangdServer.cpp.

However the "build AST" actions are the ones that run the clang parser, so they're the slowest and crashiest for sure!

1440

What would be the best way to re-add it in a subsequent revision: make it configurable via a command line option, or via an environment variable? And if it's just an env var, maybe we could include it conditionally in this revision?

Yeah an env var seems tempting to me. I can't put my finger on *why* it feels like an appropriate way to configure this vs a flag, but at least not having to plumb it around is nice.
I'd be happy with if (getenv("CLANGD_CRASH_DUMP_SOURCE")) or something like that in this patch.

the rough idea was to rely on VCS to collect the changes

Right, this is probably why we don't have such a thing yet - at work we can just dump the VCS info and be done. Being able to get self-contained bug reports from more diverse environments would be nice to have, but hasn't been urgent.

clang-tools-extra/clangd/support/ThreadSignalHandler.cpp
15 ↗(On Diff #371582)

I think simplicity is probably more important than being totally flexible here, so I'd just forbid moving and out-of-order destruction.
If we end up wanting to move these crash handlers across scopes I'd suggest we just wrap them in unique_ptr in those cases. (And make it the creator's job to ensure their lifetimes nest properly).

22 ↗(On Diff #371582)

This sounds fine to me.

In a perfect world I'd rather SIGQUIT didn't dump the state of an essentially-random thread (we always have lots). But SIGQUIT isn't the common case, it's confusing but not otherwise harmful, and it seems like it'd be hard to fix.

0x1eaf updated this revision to Diff 372510.Sep 14 2021, 10:25 AM

addressed review comments

0x1eaf marked 7 inline comments as done.Sep 14 2021, 10:27 AM
0x1eaf marked an inline comment as done.
0x1eaf updated this revision to Diff 372515.Sep 14 2021, 10:41 AM
0x1eaf retitled this revision from [RFC] Print current request context along with the stack trance in clangd to [clangd] Print current request context along with the stack trace.
0x1eaf edited the summary of this revision. (Show Details)

updated revision description for final review after renaming the class

0x1eaf marked 2 inline comments as done.Sep 14 2021, 10:42 AM

I've tried making an integration test in addition to the unit test, but I couldn't find a way to make lit ignore the crashed process exit status:

FAIL: Clangd :: crash.test (1049 of 1049)
******************** TEST 'Clangd :: crash.test' FAILED ********************
Script:
--
: 'RUN: at line 2';   yes '[' | head -n 50000 | sh -c "clangd --input-style=delimited 2>&1 || true"
--
Exit Code: 141

I've also tried using not to the same effect:

# RUN: yes '[' | head -n 50000 | not clangd --input-style=delimited

Is there some trick that I'm missing?

This is mostly LG with nits, but I think the code-completion handler is wrong, and the tests as written won't build on windows.

Oops, forgot one thing: you probably want to instrument the preamble-building in TUScheduler (or in Preamble.cpp?), the background indexing, and code completion.
Those all run the clang parser and should be a rich source of clang bugs.

You've added code completion (i.e. runWithPreamble), but not preamble-building or background-indexing.
This is fine, we can add the others later - just wanted to make sure we're on the same page.

I've tried making an integration test in addition to the unit test, but I couldn't find a way to make lit ignore the crashed process exit status:
: 'RUN: at line 2'; yes '[' | head -n 50000 | sh -c "clangd --input-style=delimited 2>&1 || true"
Exit Code: 141

I think this is related to the fact that tests run with set -o pipefail, and yes fails when head closes the pipe.
not yes doesn't seem to help because SIGPIPE is actually caught by the shell rather than yes, thus the 141 code. (Hey, signals again!)
This seems to work: (yes '[' || :) | head -n 50000 | clangd --input-style=delimited

(It is in principle possbile to disable pipefail in lit tests but awkward on a per-test level and nobody does it.)

clang-tools-extra/clangd/TUScheduler.cpp
560

Probably worth commenting "May only call from worker thread" or so.

(Unfortunately we have unusual concurrency constraints on the members - some such as FileInputs are locked when written on the worker thread, locked when read from other threads, unlocked when read from the worker thread, and never written from other threads. Because it's confusing it's probably worth being explicit).

1428

nit: I'd prefer an explicit type over auto& here. Probably StringRef?

(I'm not actually sure what type we're getting and whether the reference is dangling).

1663

This doesn't seem right.
This code runs on the PreambleThreads threadpool, not on the worker thread. So a) the state of the worker thread isn't really relevant and b) we're accessing it in a non-threadsafe way.

Instead we need to dump Name, Command, Contents etc separately again.
Maybe dumpCurrentRequest should be a free helper function that takes the individual parameters instead.

clang-tools-extra/clangd/support/ThreadCrashReporter.cpp
29

I'm not sure the FIFO behavior is much (or at all) better than LIFO[1], and it seems like it adds a bit of complexity (we'd only need the Next pointer, and there'd be less confusion direction of next vs previous).

[1] (insert joke about python having tracebacks instead of backtraces, because they're backwards...)

clang-tools-extra/clangd/support/ThreadCrashReporter.h
24

Some mention of constraints on the passed function?

e.g. "The callback is likely to be invoked in a signal handler. Most LLVM signal handling is not strictly async-signal-safe. However reporters should avoid accessing data structures likely to be in a bad state on crash."

26

doesn't seem to be any need for a copy here, pass by value and move instead?
(And probably use llvm::unique_function to avoid the copyable requirement)

36

This should first have a comment saying what it actually does!
"Calls all currently-active ThreadCrashReporters for the current thread".

Probably also something like "This function is intended to be called from signal handlers, but is not strictly async-signal-safe - see constructor comment."

clang-tools-extra/clangd/tool/ClangdMain.cpp
683

nit: seems a little cleaner to handle dropping the cookie arg here instead of in runCrashHandlers
pass +[](void*){ ThreadCrashReporter::runCrashHandlers(); }?

clang-tools-extra/clangd/unittests/ThreadCrashReporterTests.cpp
27

unfortunately SIGUSR1 is not required to be supported (or exist) and appears not to be on MSVC per the failing pre-merge checks. And in fact SetInfoSignalFunction is a no-op on windows.

We could #ifdef the test, but it seems like a real shame.
EXPECT_DEATH is a candidate but it doesn't interact well with threads.

Best I can think of is some combination/subset of:

  • a test like this with #ifdef SIGUSR1
  • a simple EXPECT_DEATH using SIGINT that doesn't spawn any extra threads
  • a threaded test that calls runCrashHandlers()directly rather than via a signal handler

(I think most of the cases tested here: scope, FIFO etc could be moved to the third option as they really have nothing to do with signals per se. It's nice to have *some* test that it does work with llvm's signal handler though)

0x1eaf updated this revision to Diff 378257.Oct 8 2021, 8:52 AM

Addressed feedback from the last round of comments. Sorry for the delay!

0x1eaf marked 9 inline comments as done.Oct 8 2021, 8:58 AM

I've opted for simulating signals in the unit test on Windows by manually calling the signal handler, and added a list test to test the crash handling integration end-to-end.

sammccall accepted this revision.Oct 11 2021, 1:50 AM

Looks great, thanks!
Happy to commit this for you if you don't have access.

clang-tools-extra/clangd/TUScheduler.cpp
1334

nit: this copies CurrentRequest->Name, which seems like a very avoidable allocation in a crash handler... just use if?

This revision is now accepted and ready to land.Oct 11 2021, 1:50 AM
0x1eaf updated this revision to Diff 379124.Oct 12 2021, 11:25 AM

Updated the AST worker crash handler to avoid std::string copy: used a char pointer instead.

0x1eaf marked an inline comment as done.Oct 12 2021, 11:36 AM

I don't have commit access, so would appreciate if you commit it, thanks!

Forgot to mention on Friday:

I've instrumented preamble building, but haven't instrumented background indexing which is outside of TUScheduler.cpp and would require exposing crashDump* helpers and preferably moving them out of TUScheduler.cpp. And I think it might be easier to review it separately. Speaking of the helpers, where would you prefer those to be: a new file+header under clangd/support, or in some existing place?

Are there any blockers to landing this?

I've instrumented preamble building, but haven't instrumented background indexing which is outside of TUScheduler.cpp and would require exposing crashDump* helpers and preferably moving them out of TUScheduler.cpp. And I think it might be easier to review it separately.

Sounds good!

Speaking of the helpers, where would you prefer those to be: a new file+header under clangd/support, or in some existing place?

I think it's too concrete/specific to live in support.
There's no perfect place, maybe Compiler.h along with the definition of ParseInputs isn't so bad? That file is kind of a random assortment of low-level stuff anyway.

Are there any blockers to landing this?

Nope, just my kids have been sick and I've been losing track of things left and right.
Sorry about that, will land this now!

thakis added a subscriber: thakis.Oct 25 2021, 7:42 PM

This doesn't build on windows: http://45.33.8.238/win/47615/step_4.txt

Please take a look and revert for now if it takes a while to fix.

This doesn't build on windows: http://45.33.8.238/win/47615/step_4.txt

Please take a look and revert for now if it takes a while to fix.

I pushed rGba94b8bdffb4 as a speculative fix. I'll keep an eye on that bot. Thanks for the heads up.

Hi,
It seems like this patch may cause timeouts. I've seen it myself and also in this bot:
http://lab.llvm.org:8011/#/builders/52/builds/11791

When I got the timeout in our downstream CI I think it was the Clangd :: crash.test that had problems.

This doesn't build on windows: http://45.33.8.238/win/47615/step_4.txt

Please take a look and revert for now if it takes a while to fix.

I pushed rGba94b8bdffb4 as a speculative fix. I'll keep an eye on that bot. Thanks for the heads up.

Thanks, now it builds. But the test added in this change fails: http://45.33.8.238/win/47645/step_9.txt

Here's another test failure: https://green.lab.llvm.org/green/job/clang-stage1-RA/24998
Can you revert if the fix is not trivial?

I think this should work-around end-to-end test failure on Windows:

diff --git a/clang-tools-extra/clangd/test/crash.test b/clang-tools-extra/clangd/test/crash.test
index 1197e1ab07c3..68ef54808f09 100644
--- a/clang-tools-extra/clangd/test/crash.test
+++ b/clang-tools-extra/clangd/test/crash.test
@@ -1,3 +1,4 @@
+# REQUIRES: shell
 # Overflow the recursive json parser, prevent `yes` error due to broken pipe and `clangd` SIGSEGV from being treated as a failure.
 # RUN: (yes '[' || :) | head -n 50000 | (clangd --input-style=delimited 2>&1 || :) | FileCheck %s
 #      CHECK: Signalled while processing message:

Alternatively, the test can be completely disabled for the time being.

I don't have commit access, should I create a new revision for review with the test change?

I think this should work-around end-to-end test failure on Windows:

diff --git a/clang-tools-extra/clangd/test/crash.test b/clang-tools-extra/clangd/test/crash.test
index 1197e1ab07c3..68ef54808f09 100644
--- a/clang-tools-extra/clangd/test/crash.test
+++ b/clang-tools-extra/clangd/test/crash.test
@@ -1,3 +1,4 @@
+# REQUIRES: shell
 # Overflow the recursive json parser, prevent `yes` error due to broken pipe and `clangd` SIGSEGV from being treated as a failure.
 # RUN: (yes '[' || :) | head -n 50000 | (clangd --input-style=delimited 2>&1 || :) | FileCheck %s
 #      CHECK: Signalled while processing message:

Alternatively, the test can be completely disabled for the time being.

I don't have commit access, should I create a new revision for review with the test change?

I just removed this test for now, we're also seeing problems on mac (I believe because stack overflow can do something other than signal), and flakiness on linux+tsan.
This functionality is covered by unittests that are not failing.

We should probably work on an integration test that is a bit more principled, probably using #pragma clang __debug llvm_fatal_error which calls abort(). It's annoying because we disable those pragmas for obvious reasons...

@smeenai thanks!

Thanks for the fixes! My bots are happy now.