This is an archive of the discontinued LLVM Phabricator instance.

[lldb/Core] Fix a race in the Communication class
ClosedPublic

Authored by labath on Apr 2 2020, 5:13 AM.

Details

Summary

Communication::SynchronizeWithReadThread is called whenever a process
stops to ensure that we process all of its stdout before we report the
stop. If the process exits, we first call this method, and then close
the connection.

However, when the child process exits, the thread reading its stdout
will usually (but not always) read an EOF because the other end of the
pty has been closed. In response to an EOF, the Communication read
thread closes it's end of the connection too.

This can result in a race where the read thread is closing the
connection while the synchronizing thread is attempting to get its
attention via Connection::InterruptRead.

The fix is to hold the synchronization mutex while closing the
connection.

I've found this issue while tracking down a rare flake in some of the
vscode tests. I am not sure this is the cause of those failures (as I
would have expected this issue to manifest itself differently), but it
is an issue nonetheless.

The attached test demonstrates the steps needed to reproduce the race.
It will fail under tsan without this patch.

Diff Detail

Event Timeline

labath created this revision.Apr 2 2020, 5:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2020, 5:13 AM
Herald added a subscriber: mgorny. · View Herald Transcript
clayborg added inline comments.Apr 2 2020, 3:37 PM
lldb/source/Core/Communication.cpp
318

Can we just init "disconnect" right here?:

const bool disconnect = comm->GetCloseOnEOF();
322–324

Is there a reason we call AppendBytesToCache with zero bytes in the else? If we do need to call it (it might listen to the "status"?) then this:

if (bytes_read > 0)
  comm->AppendBytesToCache(buf, bytes_read, true, status);
else if ((bytes_read == 0) && status == eConnectionStatusEndOfFile) {
  disconnect = comm->GetCloseOnEOF();
  comm->AppendBytesToCache(buf, bytes_read, true, status);
}

can just be:

comm->AppendBytesToCache(buf, bytes_read, true, status);

bytes_read is a size_t so it can't be less than zero. Then we can move the "disconnect = comm->GetCloseOnEOF();" into the eConnectionStatusEndOfFile case in the switch below.

labath marked 3 inline comments as done.Apr 6 2020, 6:27 AM
labath added inline comments.
lldb/source/Core/Communication.cpp
318

That won't work because we there are paths out of this function which should not disconnect (even in the close-on-eof case). For example, the read thread can be disabled by setting comm->m_read_thread_enabled = false in which case we should exit even though we have not reached EOF.

322–324

AppendBytesToCache calls a callback and I believe the idea is to call it with a 0-byte buffer on EOF. But the refactor looks good. I've made the AppendBytesToCache call conditional on bytes_read > 0 || status == EOF to preserve the existing behavior of not calling the callback in case of errors.

labath updated this revision to Diff 255310.Apr 6 2020, 6:27 AM
labath marked an inline comment as done.

refactor EOF handling

clayborg accepted this revision.Apr 6 2020, 11:40 AM
This revision is now accepted and ready to land.Apr 6 2020, 11:40 AM
This revision was automatically updated to reflect the committed changes.

Hi Pavel,

After this landed I started seeing a bunch of reproducers fail with SIGPIPE. I ignored the signal and added a check for EPIPE to the ::write call in PipePosix.cpp. Below is the thread state:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGPIPE
  * frame #0: 0x00007fff6876c55e libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff6882f5c2 libsystem_pthread.dylib`_pthread_join + 347
    frame #2: 0x000000010115ef62 liblldb.11.0.0git.dylib`lldb_private::HostThreadPosix::Join(this=0x000000011e7ca490, result=0x00007ffeefbfe290) at HostThreadPosix.cpp:28:15
    frame #3: 0x000000010110b64f liblldb.11.0.0git.dylib`lldb_private::HostThread::Join(this=0x000000011e9f09a0, result=0x00007ffeefbfe290) at HostThread.cpp:21:27
    frame #4: 0x00000001012ea5ba liblldb.11.0.0git.dylib`lldb_private::Process::ControlPrivateStateThread(this=0x000000011e9f0818, signal=1) at Process.cpp:3642:30
    frame #5: 0x00000001012dbc81 liblldb.11.0.0git.dylib`lldb_private::Process::StopPrivateStateThread(this=0x000000011e9f0818) at Process.cpp:3593:5
    frame #6: 0x00000001012dc514 liblldb.11.0.0git.dylib`lldb_private::Process::Destroy(this=0x000000011e9f0818, force_kill=true) at Process.cpp:3317:7
    frame #7: 0x0000000100981cc1 liblldb.11.0.0git.dylib`lldb::SBProcess::Kill(this=0x000000011e767de0) at SBProcess.cpp:662:35

* thread #2
  * frame #0: 0x00007fff6876c4ce libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007fff6882aaa1 libsystem_pthread.dylib`_pthread_wqthread + 390
    frame #2: 0x00007fff68829b77 libsystem_pthread.dylib`start_wqthread + 15

* thread #3, name = '<lldb.process.internal-state(pid=47347)>'
  * frame #0: 0x00007fff6876c55e libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff6882f5c2 libsystem_pthread.dylib`_pthread_join + 347
    frame #2: 0x000000010115ef62 liblldb.11.0.0git.dylib`lldb_private::HostThreadPosix::Join(this=0x000000011e7ca510, result=0x0000000000000000) at HostThreadPosix.cpp:28:15
    frame #3: 0x000000010110b64f liblldb.11.0.0git.dylib`lldb_private::HostThread::Join(this=0x000000011e9f0e28, result=0x0000000000000000) at HostThread.cpp:21:27
    frame #4: 0x0000000100f46f4e liblldb.11.0.0git.dylib`lldb_private::Communication::StopReadThread(this=0x000000011e9f0de8, error_ptr=0x0000000000000000) at Communication.cpp:239:32
    frame #5: 0x00000001012e9d95 liblldb.11.0.0git.dylib`lldb_private::Process::ShouldBroadcastEvent(this=0x000000011e9f0818, event_ptr=0x000000011e76aa40) at Process.cpp:3390:27
    frame #6: 0x00000001012e5e4f liblldb.11.0.0git.dylib`lldb_private::Process::HandlePrivateEvent(this=0x000000011e9f0818, event_sp=std::__1::shared_ptr<lldb_private::Event>::element_type @ 0x000000011e76aa40 strong=2 weak=1) at Process.cpp:3697:33
    frame #7: 0x00000001012eb147 liblldb.11.0.0git.dylib`lldb_private::Process::RunPrivateStateThread(this=0x000000011e9f0818, is_secondary_thread=false) at Process.cpp:3891:7
    frame #8: 0x00000001012ea2fa liblldb.11.0.0git.dylib`lldb_private::Process::PrivateStateThread(arg=0x000000011e7c9a60) at Process.cpp:3789:25
    frame #9: 0x000000010110b45b liblldb.11.0.0git.dylib`lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostNativeThreadBase.cpp:68:10
    frame #10: 0x000000010726ab83 liblldb.11.0.0git.dylib`lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostThreadMacOSX.mm:68:10

* thread #4, name = '<lldb.comm.process.stdio>', stop reason = breakpoint 1.1
  * frame #0: 0x000000010116071a liblldb.11.0.0git.dylib`lldb_private::PipePosix::Write(this=0x000000011e7ca2b0, buf=43698118819080247, size=1, bytes_written=0x00007000076908b8) at PipePosix.cpp:312:7
    frame #1: 0x0000000101156d15 liblldb.11.0.0git.dylib`lldb_private::ConnectionFileDescriptor::Disconnect(this=0x000000011e7ca210, error_ptr=0x0000000000000000) at ConnectionFileDescriptorPosix.cpp:321:30
    frame #2: 0x0000000100f454e6 liblldb.11.0.0git.dylib`lldb_private::Communication::Disconnect(this=0x000000011e9f0de8, error_ptr=0x0000000000000000) at Communication.cpp:98:46
    frame #3: 0x0000000100f46d4e liblldb.11.0.0git.dylib`lldb_private::Communication::ReadThread(p=0x000000011e9f0de8) at Communication.cpp:377:13
    frame #4: 0x000000010110b45b liblldb.11.0.0git.dylib`lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostNativeThreadBase.cpp:68:10
    frame #5: 0x000000010726ab83 liblldb.11.0.0git.dylib`lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostThreadMacOSX.mm:68:10
    frame #6: 0x00007fff6882e109 libsystem_pthread.dylib`_pthread_start + 148
    frame #7: 0x00007fff68829b8b libsystem_pthread.dylib`thread_start + 15

Any chance you might know what's going on here?

Hi Pavel,

After this landed I started seeing a bunch of reproducers fail with SIGPIPE. I ignored the signal and added a check for EPIPE to the ::write call in PipePosix.cpp. Below is the thread state:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGPIPE
  * frame #0: 0x00007fff6876c55e libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff6882f5c2 libsystem_pthread.dylib`_pthread_join + 347
    frame #2: 0x000000010115ef62 liblldb.11.0.0git.dylib`lldb_private::HostThreadPosix::Join(this=0x000000011e7ca490, result=0x00007ffeefbfe290) at HostThreadPosix.cpp:28:15
    frame #3: 0x000000010110b64f liblldb.11.0.0git.dylib`lldb_private::HostThread::Join(this=0x000000011e9f09a0, result=0x00007ffeefbfe290) at HostThread.cpp:21:27
    frame #4: 0x00000001012ea5ba liblldb.11.0.0git.dylib`lldb_private::Process::ControlPrivateStateThread(this=0x000000011e9f0818, signal=1) at Process.cpp:3642:30
    frame #5: 0x00000001012dbc81 liblldb.11.0.0git.dylib`lldb_private::Process::StopPrivateStateThread(this=0x000000011e9f0818) at Process.cpp:3593:5
    frame #6: 0x00000001012dc514 liblldb.11.0.0git.dylib`lldb_private::Process::Destroy(this=0x000000011e9f0818, force_kill=true) at Process.cpp:3317:7
    frame #7: 0x0000000100981cc1 liblldb.11.0.0git.dylib`lldb::SBProcess::Kill(this=0x000000011e767de0) at SBProcess.cpp:662:35

* thread #2
  * frame #0: 0x00007fff6876c4ce libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007fff6882aaa1 libsystem_pthread.dylib`_pthread_wqthread + 390
    frame #2: 0x00007fff68829b77 libsystem_pthread.dylib`start_wqthread + 15

* thread #3, name = '<lldb.process.internal-state(pid=47347)>'
  * frame #0: 0x00007fff6876c55e libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff6882f5c2 libsystem_pthread.dylib`_pthread_join + 347
    frame #2: 0x000000010115ef62 liblldb.11.0.0git.dylib`lldb_private::HostThreadPosix::Join(this=0x000000011e7ca510, result=0x0000000000000000) at HostThreadPosix.cpp:28:15
    frame #3: 0x000000010110b64f liblldb.11.0.0git.dylib`lldb_private::HostThread::Join(this=0x000000011e9f0e28, result=0x0000000000000000) at HostThread.cpp:21:27
    frame #4: 0x0000000100f46f4e liblldb.11.0.0git.dylib`lldb_private::Communication::StopReadThread(this=0x000000011e9f0de8, error_ptr=0x0000000000000000) at Communication.cpp:239:32
    frame #5: 0x00000001012e9d95 liblldb.11.0.0git.dylib`lldb_private::Process::ShouldBroadcastEvent(this=0x000000011e9f0818, event_ptr=0x000000011e76aa40) at Process.cpp:3390:27
    frame #6: 0x00000001012e5e4f liblldb.11.0.0git.dylib`lldb_private::Process::HandlePrivateEvent(this=0x000000011e9f0818, event_sp=std::__1::shared_ptr<lldb_private::Event>::element_type @ 0x000000011e76aa40 strong=2 weak=1) at Process.cpp:3697:33
    frame #7: 0x00000001012eb147 liblldb.11.0.0git.dylib`lldb_private::Process::RunPrivateStateThread(this=0x000000011e9f0818, is_secondary_thread=false) at Process.cpp:3891:7
    frame #8: 0x00000001012ea2fa liblldb.11.0.0git.dylib`lldb_private::Process::PrivateStateThread(arg=0x000000011e7c9a60) at Process.cpp:3789:25
    frame #9: 0x000000010110b45b liblldb.11.0.0git.dylib`lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostNativeThreadBase.cpp:68:10
    frame #10: 0x000000010726ab83 liblldb.11.0.0git.dylib`lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostThreadMacOSX.mm:68:10

* thread #4, name = '<lldb.comm.process.stdio>', stop reason = breakpoint 1.1
  * frame #0: 0x000000010116071a liblldb.11.0.0git.dylib`lldb_private::PipePosix::Write(this=0x000000011e7ca2b0, buf=43698118819080247, size=1, bytes_written=0x00007000076908b8) at PipePosix.cpp:312:7
    frame #1: 0x0000000101156d15 liblldb.11.0.0git.dylib`lldb_private::ConnectionFileDescriptor::Disconnect(this=0x000000011e7ca210, error_ptr=0x0000000000000000) at ConnectionFileDescriptorPosix.cpp:321:30
    frame #2: 0x0000000100f454e6 liblldb.11.0.0git.dylib`lldb_private::Communication::Disconnect(this=0x000000011e9f0de8, error_ptr=0x0000000000000000) at Communication.cpp:98:46
    frame #3: 0x0000000100f46d4e liblldb.11.0.0git.dylib`lldb_private::Communication::ReadThread(p=0x000000011e9f0de8) at Communication.cpp:377:13
    frame #4: 0x000000010110b45b liblldb.11.0.0git.dylib`lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostNativeThreadBase.cpp:68:10
    frame #5: 0x000000010726ab83 liblldb.11.0.0git.dylib`lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostThreadMacOSX.mm:68:10
    frame #6: 0x00007fff6882e109 libsystem_pthread.dylib`_pthread_start + 148
    frame #7: 0x00007fff68829b8b libsystem_pthread.dylib`thread_start + 15

Any chance you might know what's going on here?

Hm... this probably didn't introduce the bug, but I can certainly see how it could cause an existing issue to start to appear more often.

I think I have an idea for what could be problem here -- a race between the Disconnect call in Communication::ReadThread and the one in Process::ShouldBroadcastEvent. Can you check if swapping the order of these two lines in Process::ShouldBroadcastEvent:

m_stdio_communication.Disconnect();
m_stdio_communication.StopReadThread();

makes the problem go away?

Hi Pavel,

After this landed I started seeing a bunch of reproducers fail with SIGPIPE. I ignored the signal and added a check for EPIPE to the ::write call in PipePosix.cpp. Below is the thread state:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGPIPE
  * frame #0: 0x00007fff6876c55e libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff6882f5c2 libsystem_pthread.dylib`_pthread_join + 347
    frame #2: 0x000000010115ef62 liblldb.11.0.0git.dylib`lldb_private::HostThreadPosix::Join(this=0x000000011e7ca490, result=0x00007ffeefbfe290) at HostThreadPosix.cpp:28:15
    frame #3: 0x000000010110b64f liblldb.11.0.0git.dylib`lldb_private::HostThread::Join(this=0x000000011e9f09a0, result=0x00007ffeefbfe290) at HostThread.cpp:21:27
    frame #4: 0x00000001012ea5ba liblldb.11.0.0git.dylib`lldb_private::Process::ControlPrivateStateThread(this=0x000000011e9f0818, signal=1) at Process.cpp:3642:30
    frame #5: 0x00000001012dbc81 liblldb.11.0.0git.dylib`lldb_private::Process::StopPrivateStateThread(this=0x000000011e9f0818) at Process.cpp:3593:5
    frame #6: 0x00000001012dc514 liblldb.11.0.0git.dylib`lldb_private::Process::Destroy(this=0x000000011e9f0818, force_kill=true) at Process.cpp:3317:7
    frame #7: 0x0000000100981cc1 liblldb.11.0.0git.dylib`lldb::SBProcess::Kill(this=0x000000011e767de0) at SBProcess.cpp:662:35

* thread #2
  * frame #0: 0x00007fff6876c4ce libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007fff6882aaa1 libsystem_pthread.dylib`_pthread_wqthread + 390
    frame #2: 0x00007fff68829b77 libsystem_pthread.dylib`start_wqthread + 15

* thread #3, name = '<lldb.process.internal-state(pid=47347)>'
  * frame #0: 0x00007fff6876c55e libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff6882f5c2 libsystem_pthread.dylib`_pthread_join + 347
    frame #2: 0x000000010115ef62 liblldb.11.0.0git.dylib`lldb_private::HostThreadPosix::Join(this=0x000000011e7ca510, result=0x0000000000000000) at HostThreadPosix.cpp:28:15
    frame #3: 0x000000010110b64f liblldb.11.0.0git.dylib`lldb_private::HostThread::Join(this=0x000000011e9f0e28, result=0x0000000000000000) at HostThread.cpp:21:27
    frame #4: 0x0000000100f46f4e liblldb.11.0.0git.dylib`lldb_private::Communication::StopReadThread(this=0x000000011e9f0de8, error_ptr=0x0000000000000000) at Communication.cpp:239:32
    frame #5: 0x00000001012e9d95 liblldb.11.0.0git.dylib`lldb_private::Process::ShouldBroadcastEvent(this=0x000000011e9f0818, event_ptr=0x000000011e76aa40) at Process.cpp:3390:27
    frame #6: 0x00000001012e5e4f liblldb.11.0.0git.dylib`lldb_private::Process::HandlePrivateEvent(this=0x000000011e9f0818, event_sp=std::__1::shared_ptr<lldb_private::Event>::element_type @ 0x000000011e76aa40 strong=2 weak=1) at Process.cpp:3697:33
    frame #7: 0x00000001012eb147 liblldb.11.0.0git.dylib`lldb_private::Process::RunPrivateStateThread(this=0x000000011e9f0818, is_secondary_thread=false) at Process.cpp:3891:7
    frame #8: 0x00000001012ea2fa liblldb.11.0.0git.dylib`lldb_private::Process::PrivateStateThread(arg=0x000000011e7c9a60) at Process.cpp:3789:25
    frame #9: 0x000000010110b45b liblldb.11.0.0git.dylib`lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostNativeThreadBase.cpp:68:10
    frame #10: 0x000000010726ab83 liblldb.11.0.0git.dylib`lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostThreadMacOSX.mm:68:10

* thread #4, name = '<lldb.comm.process.stdio>', stop reason = breakpoint 1.1
  * frame #0: 0x000000010116071a liblldb.11.0.0git.dylib`lldb_private::PipePosix::Write(this=0x000000011e7ca2b0, buf=43698118819080247, size=1, bytes_written=0x00007000076908b8) at PipePosix.cpp:312:7
    frame #1: 0x0000000101156d15 liblldb.11.0.0git.dylib`lldb_private::ConnectionFileDescriptor::Disconnect(this=0x000000011e7ca210, error_ptr=0x0000000000000000) at ConnectionFileDescriptorPosix.cpp:321:30
    frame #2: 0x0000000100f454e6 liblldb.11.0.0git.dylib`lldb_private::Communication::Disconnect(this=0x000000011e9f0de8, error_ptr=0x0000000000000000) at Communication.cpp:98:46
    frame #3: 0x0000000100f46d4e liblldb.11.0.0git.dylib`lldb_private::Communication::ReadThread(p=0x000000011e9f0de8) at Communication.cpp:377:13
    frame #4: 0x000000010110b45b liblldb.11.0.0git.dylib`lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostNativeThreadBase.cpp:68:10
    frame #5: 0x000000010726ab83 liblldb.11.0.0git.dylib`lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(arg=0x000000011e7ca4e0) at HostThreadMacOSX.mm:68:10
    frame #6: 0x00007fff6882e109 libsystem_pthread.dylib`_pthread_start + 148
    frame #7: 0x00007fff68829b8b libsystem_pthread.dylib`thread_start + 15

Any chance you might know what's going on here?

Hm... this probably didn't introduce the bug, but I can certainly see how it could cause an existing issue to start to appear more often.

I think I have an idea for what could be problem here -- a race between the Disconnect call in Communication::ReadThread and the one in Process::ShouldBroadcastEvent. Can you check if swapping the order of these two lines in Process::ShouldBroadcastEvent:

m_stdio_communication.Disconnect();
m_stdio_communication.StopReadThread();

makes the problem go away?

Thanks, that did the trick! (b424b0bf731d1fb338180553a86f912746f640ad)

Cool, I've also pushed 9321255b that fixes up other occurrences of this pattern, and adds an assertion to prevent it from happening again.