This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Fix race condition between lldb-vscode and stop hooks executor
ClosedPublic

Authored by ilya-nozhkin on Feb 11 2022, 7:49 AM.

Details

Summary

The race is between these two pieces of code that are executed in two separate
lldb-vscode threads (the first is in the main thread and another is in the
event-handling thread):

// lldb-vscode.cpp
g_vsc.debugger.SetAsync(false);
g_vsc.target.Launch(launch_info, error);
g_vsc.debugger.SetAsync(true);
// Target.cpp
bool old_async = debugger.GetAsyncExecution();
debugger.SetAsyncExecution(true);
debugger.GetCommandInterpreter().HandleCommands(GetCommands(), exc_ctx,
                                                options, result);
debugger.SetAsyncExecution(old_async);

The sequence that leads to the bug is this one:

  1. Main thread enables synchronous mode and launches the process.
  2. When the process is launched, it generates the first stop event.
  3. This stop event is catched by the event-handling thread and DoOnRemoval is invoked.
  4. Inside DoOnRemoval, this thread runs stop hooks. And before running stop hooks, the current synchronization mode is stored into old_async (and right now it is equal to "false").
  5. The main thread finishes the launch and returns to lldb-vscode, the synchronization mode is restored to asynchronous by lldb-vscode.
  6. Event-handling thread finishes stop hooks processing and restores the synchronization mode according to old_async (i.e. makes the mode synchronous)
  7. And now the mode is synchronous while lldb-vscode expects it to be asynchronous. Synchronous mode forbids the process to broadcast public stop events, so, VS Code just hangs because lldb-vscode doesn't notify it about stops.

So, this diff makes the target intercept the first stop event if the process is
launched in the synchronous mode, thus preventing stop hooks execution.

The bug is only present on Windows because other platforms already
intercept this event using their own hijacking listeners.

So, this diff also fixes some problems with lldb-vscode tests on Windows to make
it possible to run the related test. Other tests still can't be enabled because
the debugged program prints something into stdout and LLDB can't intercept this
output and redirect it to lldb-vscode properly.

Diff Detail

Event Timeline

ilya-nozhkin requested review of this revision.Feb 11 2022, 7:49 AM
ilya-nozhkin created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2022, 7:49 AM
labath added a subscriber: labath.Feb 14 2022, 4:40 AM

I'm not sure this is fixing the problem at the right level. I mean, even if you fix the race on the synchronicity flag, it still seems like there is a zillion of other things that the two processes (event handler thread and stop hook executor) race on.

Isn't the right fix to make sure that the event gets broadcast only after the stop hooks finish their work? How does that work on other platforms?

it still seems like there is a zillion of other things that the two processes (event handler thread and stop hook executor) race on.

Stop hooks are executed in the event handler thread, so, these two particular things don't have any conflicts. Event handler thread (and stop hooks execution inside it) races with the main lldb-vscode thread (the one that handles VS Code requests).

Isn't the right fix to make sure that the event gets broadcast only after the stop hooks finish their work?

It would be cool because of some other reasons but it won't solve this particular race. I.e. we need some synchronization between stop hooks execution and the main lldb-vscode thread. So, even if the event is not broadcasted until stop hooks are executed (which means that they are executed by some other thread and it should not be the process' private state thread) then the main lldb-vscode thread will race with this thread that executes stop hooks.

How does that work on other platforms?

I've only tested it on Linux and Windows, so, I don't know whether this bug is present on other platforms. On Linux, the first stop event is pushed to the event queue of lldb.PlatformLinux.DebugProcess.hijack listener but then this listener gets unsubscribed from the process events and deleted, so, the first stop event is not handled at all, DoOnRemoval method is not invoked thus stop hooks are not executed.
In the command line mode, as I understand, the synchronicity flag is not switched as often as lldb-vscode switches it, so, it seems that the race is avoided by chance.

Ah, I'm sorry. It seems I misunderstood the nature of the problem. Let me try again.

In your description of the problem, I think the problematic part is the sequencing of steps 5 and 6. You're saying that the main thread returns from a "synchronous" launch operation even though the stop event is still being processed by the event handler thread. That doesn't seem right to me because the whole point of switching to synchronous mode was to ensure that, by the time the operation returns, the process is stopped and ready to accept additional commands. That won't work if the stop hooks are still running in the background.

So, I'd say that the bug is caused by the lack of proper ordering between these two events, and instead of making the code cope with it, the right fix is to add the synchronization. I think that could be achieved by "hijacking" the process events in the case of a synchronous launch, similar to how we do it for synchronous resumes (compare Process::Resume and Process:ResumeSynchronous) -- except that this may be slightly trickier to implement because some of our (platform-specific) launch code already hijacks these events.

Moved the body of this giant lambda to a separate method.
Also, renamed the locking method because "do with locked synchronization" looks like the synchronized mode will be locked. "Do with locked synchronicity" looks better IMO.

I think that could be achieved by "hijacking" the process events in the case of a synchronous launch, similar to how we do it for synchronous resumes (compare Process::Resume and Process:ResumeSynchronous) -- except that this may be slightly trickier to implement because some of our (platform-specific) launch code already hijacks these events.

Oh, I didn't notice your answer when I was updating the diff. Yeah, I think, it will work perfectly even if platform-specific code hijacks events (it is done in the same thread, so, no race should happen, we only need to catch any events that weren't catched by platform-specific code). I'll try to implement this approach.

jingham added a comment.EditedFeb 14 2022, 4:15 PM

I'm a little confused by the analysis so far. SBTarget.Launch calls Target::Launch. That sets up a hijacker for the "stop at the first instruction" event regardless of the Sync mode. Once that succeeds, if the Debugger is set for Synchronous execution Target::Launch does:

if (synchronous_execution) {
  // Now we have handled the stop-from-attach, and we are just
  // switching to a synchronous resume.  So we should switch to the
  // SyncResume hijacker.
  m_process_sp->RestoreProcessEvents();
  m_process_sp->ResumeSynchronous(stream);

and ResumeSynchronous sets up a Hijacker to wait for and consume the stop event for the first "user stop" after the launch:

ListenerSP listener_sp(
    Listener::MakeListener(g_resume_sync_name));
HijackProcessEvents(listener_sp);

Status error = PrivateResume();
if (error.Success()) {
  StateType state = WaitForProcessToStop(llvm::None, nullptr, true,
                                         listener_sp, stream);

This is the same routine used by all the "target running" commands in sync mode. WaitForProcessToStop won't return till the process has stopped, and the ResumeSync hijacker will consume the event.

So I don't see how your event listening thread could have heard about the second stop event as the events should not have been sent to it.

jingham requested changes to this revision.Feb 14 2022, 4:23 PM

Note, however, that if you are doing a Synchronous launch, you won't ever get an event telling you the process has stopped. Maybe that's what is causing the actual confusion? If you are waiting on a stop event after a synchronous Launch, you won't get one. But that's pretty much what you asked for: "run this command w/o my having to spin an event loop to handle it..." The way you know that a synchronous Lauch has stopped is because the command returned successfully.

I don't think this patch is solving the right problem.

This revision now requires changes to proceed.Feb 14 2022, 4:23 PM

FYI: I have a patch coming up that will add the ability to not have to play with the async mode when launching or attaching. I will try and post that diff here, it should be up today.

We were running into a race condition when we used "launchCommands" or "attachCommands" as they are not executed in sync mode, so we ended up getting events before the IDE was ready. I am adding code to wait for the process to stop (we always stop at entry when launching via lldb-vscode) after the "launchCommands" or "attachCommands", and we might be able to get rid of the setting of async mode before and after launch.

SBTarget.Launch calls Target::Launch. That sets up a hijacker for the "stop at the first instruction" event regardless of the Sync mode.

The problem that it sets up this hijacker too late. See this part of the log (lines starting with // are my comments):

// The process has been launched successfully, its internal hijacker is removed.
 0x000002875FCB1AE0 Broadcaster("lldb.process")::RestoreBroadcaster (about to pop listener("LaunchEventHijack")=0x000002875FFCF890)
 Process::SetPublicState (state = stopped, restarted = 0)
 Process::SetPublicState (stopped) -- unlocking run lock
 Process::lldb_private::Process::StartPrivateStateThread() starting private state thread

// Some internal magic happens inside the private state thread.
 Process::lldb_private::Process::ControlPrivateStateThread (signal = 4)
 Sending control event of type: 4.
 0000028760372F90 Broadcaster("lldb.process.internal_state_control_broadcaster")::BroadcastEvent (event_sp = {0000028760571460 Event: broadcaster = 000002876047C4C0 (lldb.process.internal_state_control_broadcaster), type = 0x00000004 (control-resume), data = {Generic Event Data}}, unique =0) hijack = 0000000000000000
 000002875FFD00D0 Listener('lldb.process.internal_state_listener')::AddEvent (event_sp = {0000028760571460})
 Process::lldb_private::Process::RunPrivateStateThread (arg = 000002876047C350, pid = 9944) thread starting...
 timeout = <infinite>, event_sp)...
 this = 0x000002875FFD00D0, timeout = <infinite> for lldb.process.internal_state_listener
 000002875FFD00D0 'lldb.process.internal_state_listener' Listener::FindNextEventInternal(broadcaster=000002876047C4C0, broadcaster_names=0000000000000000[0], event_type_mask=0x00000000, remove=1) event 0000028760571460
 Process::lldb_private::Process::RunPrivateStateThread (arg = 000002876047C350, pid = 9944) got a control event: 4
 000002875FFCF5D0 Listener::StartListeningForEvents (broadcaster = 000002876047C9C8, mask = 0x00000020) acquired_mask = 0x00000020 for Communication::SyncronizeWithReadThread
 timeout = <infinite>, event_sp)...
 this = 0x000002875FFD00D0, timeout = <infinite> for lldb.process.internal_state_listener
 Process::ShouldBroadcastEvent (00000287604D4CB0) => new state: stopped, last broadcast state: stopped - YES
 Process::lldb_private::Process::HandlePrivateEvent (pid = 9944) broadcasting new state stopped (old state stopped) to public

// Process broadcasts the stop event.
 000002875FCB1AE0 Broadcaster("lldb.process")::BroadcastEvent (event_sp = {00000287604D4CB0 Event: broadcaster = 000002876047C378 (lldb.process), type = 0x00000001 (state-changed), data = { process = 000002876047C350 (pid = 9944), state = stopped}}, unique =0) hijack = 0000000000000000

// It is catched by "lldb.Debugger"!
 000002875DB0BD40 Listener('lldb.Debugger')::AddEvent (event_sp = {00000287604D4CB0})

// And only now the target adds its hijacking listener!
 0x000002875FCB1AE0 Broadcaster("lldb.process")::HijackBroadcaster (listener("lldb.Target.Launch.hijack")=0x000002875FFCFE10)

FYI: I have a patch coming up that will add the ability to not have to play with the async mode when launching or attaching. I will try and post that diff here, it should be up today.

We were running into a race condition when we used "launchCommands" or "attachCommands" as they are not executed in sync mode, so we ended up getting events before the IDE was ready. I am adding code to wait for the process to stop (we always stop at entry when launching via lldb-vscode) after the "launchCommands" or "attachCommands", and we might be able to get rid of the setting of async mode before and after launch.

Oh, that's great to hear! It will probably solve this race condition on stop hooks as well.
Then, I think, I'd rather wait for your changes and run my test case against them.

Can anyone try out the following patch and see if this stops the race condition

https://reviews.llvm.org/D119797

It might solve the problem.

SBTarget.Launch calls Target::Launch. That sets up a hijacker for the "stop at the first instruction" event regardless of the Sync mode.

The problem that it sets up this hijacker too late. See this part of the log (lines starting with // are my comments):

// The process has been launched successfully, its internal hijacker is removed.
 0x000002875FCB1AE0 Broadcaster("lldb.process")::RestoreBroadcaster (about to pop listener("LaunchEventHijack")=0x000002875FFCF890)
 Process::SetPublicState (state = stopped, restarted = 0)
 Process::SetPublicState (stopped) -- unlocking run lock
 Process::lldb_private::Process::StartPrivateStateThread() starting private state thread

// Some internal magic happens inside the private state thread.
 Process::lldb_private::Process::ControlPrivateStateThread (signal = 4)
 Sending control event of type: 4.
 0000028760372F90 Broadcaster("lldb.process.internal_state_control_broadcaster")::BroadcastEvent (event_sp = {0000028760571460 Event: broadcaster = 000002876047C4C0 (lldb.process.internal_state_control_broadcaster), type = 0x00000004 (control-resume), data = {Generic Event Data}}, unique =0) hijack = 0000000000000000
 000002875FFD00D0 Listener('lldb.process.internal_state_listener')::AddEvent (event_sp = {0000028760571460})
 Process::lldb_private::Process::RunPrivateStateThread (arg = 000002876047C350, pid = 9944) thread starting...
 timeout = <infinite>, event_sp)...
 this = 0x000002875FFD00D0, timeout = <infinite> for lldb.process.internal_state_listener
 000002875FFD00D0 'lldb.process.internal_state_listener' Listener::FindNextEventInternal(broadcaster=000002876047C4C0, broadcaster_names=0000000000000000[0], event_type_mask=0x00000000, remove=1) event 0000028760571460
 Process::lldb_private::Process::RunPrivateStateThread (arg = 000002876047C350, pid = 9944) got a control event: 4
 000002875FFCF5D0 Listener::StartListeningForEvents (broadcaster = 000002876047C9C8, mask = 0x00000020) acquired_mask = 0x00000020 for Communication::SyncronizeWithReadThread
 timeout = <infinite>, event_sp)...
 this = 0x000002875FFD00D0, timeout = <infinite> for lldb.process.internal_state_listener
 Process::ShouldBroadcastEvent (00000287604D4CB0) => new state: stopped, last broadcast state: stopped - YES
 Process::lldb_private::Process::HandlePrivateEvent (pid = 9944) broadcasting new state stopped (old state stopped) to public

// Process broadcasts the stop event.
 000002875FCB1AE0 Broadcaster("lldb.process")::BroadcastEvent (event_sp = {00000287604D4CB0 Event: broadcaster = 000002876047C378 (lldb.process), type = 0x00000001 (state-changed), data = { process = 000002876047C350 (pid = 9944), state = stopped}}, unique =0) hijack = 0000000000000000

// It is catched by "lldb.Debugger"!
 000002875DB0BD40 Listener('lldb.Debugger')::AddEvent (event_sp = {00000287604D4CB0})

// And only now the target adds its hijacking listener!
 0x000002875FCB1AE0 Broadcaster("lldb.process")::HijackBroadcaster (listener("lldb.Target.Launch.hijack")=0x000002875FFCFE10)

That can't be it. Target::Launch consumes the "stop at entry" event when it calls WaitForProcessToStop. And then we set the Resume hijacker before we resume the process, so there weren't any other stop events around at that point.

It's also odd that we choose to broadcast a stop event when the state is stopped. We don't usually broadcast duplicate events. Somewhere we must be being helpful and sending out a priming event when the private state thread gets resumed. But I couldn't find where we were doing that on a quick pass.

Can anyone try out the following patch and see if this stops the race condition

https://reviews.llvm.org/D119797

It might solve the problem.

Yeah, now the test passes and VS Code doesn't fall into infinite waiting loop.
I'm still concerned about that weird event dispatching happening during the launch, but lldb-vscode now works, so, I can close this review if you want to investigate this bug sometime later.

Target::Launch consumes the "stop at entry" event when it calls WaitForProcessToStop.

Actually, there is a gap between returning from Process::Launch and setting up the target's hijacking listener. I think, the stop event sneaks past the target right during this time interval. So, Target::Launch can't consume it because there is nothing to consume (i.e. the event has been added to another listener's queue). But I agree that it's odd that this event is broadcasted at all.

SBTarget.Launch calls Target::Launch. That sets up a hijacker for the "stop at the first instruction" event regardless of the Sync mode.

The problem that it sets up this hijacker too late.

Yes, so maybe the fix is to set the hijacker early (?)

That is what happens in the linux case -- hijacking is set up in PlatformPOSIX::DebugProcess, before the process is launched. If windows does not do that, then that would explain the bug and the differences between the platforms.

jingham added a comment.EditedFeb 15 2022, 9:52 AM

SBTarget.Launch calls Target::Launch. That sets up a hijacker for the "stop at the first instruction" event regardless of the Sync mode.

The problem that it sets up this hijacker too late.

Yes, so maybe the fix is to set the hijacker early (?)

That is what happens in the linux case -- hijacking is set up in PlatformPOSIX::DebugProcess, before the process is launched. If windows does not do that, then that would explain the bug and the differences between the platforms.

Ah, I see. I was focusing on the second event, not the original stop one.

The current code is relying on getting the ProcessSP & calling HijackProcessEvents on it, but as you say, that's too late. Instead, we should put a Listener in the ProcessLaunchInfo before we do any of the CreateProcess stuff. All the calls to actually make the process in Target::Launch take a ProcessLaunchInfo, and should obey its listener. So that seems the right way to ensure that the process events get captured for this initial stop before they get a chance to leak out. And then we wouldn't have to rely on the individual platforms to set this up themselves.

ilya-nozhkin edited the summary of this revision. (Show Details)

Implemented the approach suggested by @labath. I.e. now the target's hijacking listener is activated before Process::Launch.

There are actually two ways of implementing that. The first of them (which is used in this diff) is to make the target activate the hijacking listener right after the process creation. The second one is to specify the hijacking listener in the launch info and then pass it to the platform, but it requires to modify all platforms to force them to activate the hijacking listener passed via launch info (currently, no platform actually does that). I've chosen the first way because it seems to be reliable and safe. I.e. the target won't depend on platforms, so, it'll be harder to break event dispatching again by some non-related modifications of platform's code.

I've also decided not to remove platform-specific hijacking listeners yet because some platforms define a hijacking listener even in asynchronous mode and then use it for waiting for the process to stop (for example, PlatformQemuUser does that). I'll try to test tomorrow whether it is possible to remove hijacking listeners at least from some subset of platforms.

Doing it in the common code is a great idea, and it will make the code more robust. However, instead of "sneaking" the listener through a member variable, it would be better to pass it through function arguments instead (normally in the LaunchInfo struct, and as a probably as separate argument for the last mile). I believe this is what Jim had in mind as well.

See inline comments for details.

lldb/source/Target/Process.cpp
2538–2539 ↗(On Diff #409091)

It would be better if this could be achieved by restructuring the code so that the hijacking is really scope-based. E.g., moving the "file doesn't exist" check up front and making it an early exit, moving some stuff into helper functions, or whatever.

lldb/source/Target/Target.cpp
202

Add an extra argument here.

3038–3039

Remove this. Make sure the listener in the launch_info is plumbed to the launch function.

3043

Remove hijack listener creation from DebugProcess implementations (you can replace with an assert if you want).

3080–3081

I guess we should just delete this now, as it is never correct to hijack events this late in the game.

Doing it in the common code is a great idea, and it will make the code more robust. However, instead of "sneaking" the listener through a member variable, it would be better to pass it through function arguments instead (normally in the LaunchInfo struct, and as a probably as separate argument for the last mile). I believe this is what Jim had in mind as well.

Yup!

See inline comments for details.

Applied requested changes.

I don't think CreateProcess needs a HijackListener, does it? It doesn't generate events, it just make the connection to the server. It is always followed by Launch or DebugProcess, which are the ones that do the real work.

lldb/include/lldb/Target/Target.h
587 ↗(On Diff #409404)

Probably worth saying what the hijack_listener is for as well.

lldb/packages/Python/lldbsuite/test/tools/lldb-vscode/lldbvscode_testcase.py
14

I don't know if it's worth working harder for this, but it would be confusing if lldb-vscode existed but wasn't executable, and we gave an error of "lldb-vscode must exist". Maybe just say "must exist and be executable"...

lldb/source/Target/Target.cpp
3033

In this context sync vrs. async is really about whether the caller expects to get a stop event for the first "real user stop", or just have Launch return when we are at the first user event w/o the caller having to spin the event loop. A real user stop is generally "process hit a breakpoint" or "process crashed", etc.

That means we should hijack the events up to the stop at the entry point in both sync and async modes. The odd case is when "stop at entry" is set. In that case the "launch success" stop is also the first "real user stop". So in the "stop at entry && async" mode we need to arrange for there to be a stop event that the caller can wait on.

The cleanest way to do this would be to always hijack the events, and then in the case where stop at entry is set in async mode, rebroadcast the stop event to the regular listener. You could have the DebugProcess & Launch return the last event SP for this purpose.

The way the old code did it was to just undo the hijack, and return w/o calling WaitForProcessToStop (line 3084 & following) under the assumption that this would leave the last stop event unfetched and available for the caller. But I'm not sure that's guaranteed to work.

I don't think CreateProcess needs a HijackListener, does it? It doesn't generate events, it just make the connection to the server. It is always followed by Launch or DebugProcess, which are the ones that do the real work.

CreateProcess is called from within DebugProcess, but DebugProcess is virtual/platform-specific. That makes CreateProcess a handy centralized place to do this, particularly if we're *always* going to be hijacking events for the initial creation (I'm not sure what's our attach story).

Made requested changes.

I agree with @jingham that if the listener is passed via launch info then we don't need to activate it in CreateProcess. I mean, activating it in CreateProcess was a hack that allowed me to hijack events regardless of platform implementation. But if the listener is passed via launch info then platforms should redirect it to CreateProcess anyway. So, why not activate the listener in the platform's code? On the one hand, it creates unnecessary code duplication but on the other hand, it makes responsibilities of CreateProcess clearer. A possible compromise is to create a new method Target::PrepareProcessForLaunch and set up the hijacker there.

Try out the https://reviews.llvm.org/D119797 patch and see if this fixes this issue? I have just updated it after responding to comments.

Try out the https://reviews.llvm.org/D119797 patch and see if this fixes this issue? I have just updated it after responding to comments.

Yes, it fixes this particular race. But I think, that the issue is not only with stop hooks and lldb-vscode. Some other tools can do something similar to lldb-vscode and get the same issue. So, I think, it's still worth fixing the event dispatching.

lldb/source/Target/Process.cpp
2458 ↗(On Diff #409655)

Btw, this seems to be the line that sends the public stop event. So, adding synchronicity check to the condition should fix the issue as well. But I'd keep the target's hijacking listener too because we still want to intercept all event during the launch, not to remove them.

Try out the https://reviews.llvm.org/D119797 patch and see if this fixes this issue? I have just updated it after responding to comments.

Yes, it fixes this particular race. But I think, that the issue is not only with stop hooks and lldb-vscode. Some other tools can do something similar to lldb-vscode and get the same issue. So, I think, it's still worth fixing the event dispatching.

Definitely worth fixing this too.

Made requested changes.

I agree with @jingham that if the listener is passed via launch info then we don't need to activate it in CreateProcess.

Sounds good. I'm happy with this if Jim is.

Yes, that looks great. Thanks for persisting!

jingham accepted this revision.Feb 18 2022, 9:06 AM
This revision is now accepted and ready to land.Feb 18 2022, 9:06 AM
ilya-nozhkin set the repository for this revision to rG LLVM Github Monorepo.Feb 18 2022, 3:34 PM

Do you need someone to commit this for you? (I can probably do that tomorrow)

Do you need someone to commit this for you? (I can probably do that tomorrow)

Yes, that would be nice. I don't have commit access.

Committed. Thanks for the great patch.