This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Avoid data race in IOHandlerProcessSTDIO
ClosedPublic

Authored by JDevlieghere on Mar 1 2022, 12:37 PM.

Details

Summary

This patch fixes a data race in IOHandlerProcessSTDIO. The race is happens between the main thread and the event handling thread. The main thread is running the IOHandler (IOHandlerProcessSTDIO::Run()) when an event comes in that makes us pop the process IO handler which involves cancelling the IOHandler (IOHandlerProcessSTDIO::Cancel). The latter calls SetIsDone(true) which modifies m_is_done. At the same time, we have the main thread reading the variable through GetIsDone().

This patch avoids the race by using a mutex to synchronize the two threads. On the event thread, in IOHandlerProcessSTDIO ::Cancel method, we obtain the lock before changing the value of m_is_done. On the main thread, in IOHandlerProcessSTDIO::Run(), we obtain the lock before reading the value of m_is_done. Additionally, we delay calling SetIsDone until after the loop exists, to avoid a potential race between the two writes.

  Write of size 1 at 0x00010b66bb68 by thread T7 (mutexes: write M2862, write M718324145051843688):
    #0 lldb_private::IOHandler::SetIsDone(bool) IOHandler.h:90 (liblldb.15.0.0git.dylib:arm64+0x971d84)
    #1 IOHandlerProcessSTDIO::Cancel() Process.cpp:4382 (liblldb.15.0.0git.dylib:arm64+0x5ddfec)
    #2 lldb_private::Debugger::PopIOHandler(std::__1::shared_ptr<lldb_private::IOHandler> const&) Debugger.cpp:1156 (liblldb.15.0.0git.dylib:arm64+0x3cb2a8)
    #3 lldb_private::Debugger::RemoveIOHandler(std::__1::shared_ptr<lldb_private::IOHandler> const&) Debugger.cpp:1063 (liblldb.15.0.0git.dylib:arm64+0x3cbd2c)
    #4 lldb_private::Process::PopProcessIOHandler() Process.cpp:4487 (liblldb.15.0.0git.dylib:arm64+0x5c583c)
    #5 lldb_private::Debugger::HandleProcessEvent(std::__1::shared_ptr<lldb_private::Event> const&) Debugger.cpp:1549 (liblldb.15.0.0git.dylib:arm64+0x3ceabc)
    #6 lldb_private::Debugger::DefaultEventHandler() Debugger.cpp:1622 (liblldb.15.0.0git.dylib:arm64+0x3cf2c0)
    #7 std::__1::__function::__func<lldb_private::Debugger::StartEventHandlerThread()::$_2, std::__1::allocator<lldb_private::Debugger::StartEventHandlerThread()::$_2>, void* ()>::operator()() function
.h:352 (liblldb.15.0.0git.dylib:arm64+0x3d1bd8)
    #8 lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(void*) HostNativeThreadBase.cpp:62 (liblldb.15.0.0git.dylib:arm64+0x4c71ac)
    #9 lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(void*) HostThreadMacOSX.mm:18 (liblldb.15.0.0git.dylib:arm64+0x29ef544)

  Previous read of size 1 at 0x00010b66bb68 by main thread:
    #0 lldb_private::IOHandler::GetIsDone() IOHandler.h:92 (liblldb.15.0.0git.dylib:arm64+0x971db8)
    #1 IOHandlerProcessSTDIO::Run() Process.cpp:4339 (liblldb.15.0.0git.dylib:arm64+0x5ddc7c)
    #2 lldb_private::Debugger::RunIOHandlers() Debugger.cpp:982 (liblldb.15.0.0git.dylib:arm64+0x3cb48c)
    #3 lldb_private::CommandInterpreter::RunCommandInterpreter(lldb_private::CommandInterpreterRunOptions&) CommandInterpreter.cpp:3298 (liblldb.15.0.0git.dylib:arm64+0x506478)
    #4 lldb::SBDebugger::RunCommandInterpreter(bool, bool) SBDebugger.cpp:1166 (liblldb.15.0.0git.dylib:arm64+0x53604)
    #5 Driver::MainLoop() Driver.cpp:634 (lldb:arm64+0x100006294)
    #6 main Driver.cpp:853 (lldb:arm64+0x100007344)

Diff Detail

Event Timeline

JDevlieghere created this revision.Mar 1 2022, 12:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 12:37 PM
JDevlieghere requested review of this revision.Mar 1 2022, 12:37 PM
JDevlieghere added a reviewer: mib.
labath added a comment.Mar 1 2022, 1:07 PM

These kinds of changes rarely fix a bug -- usually they just change a (detectable) data race into some nondeterministic runtime behavior.

That's because, even though e.g. IsActive can compute its result in a race-free manner, there's nothing preventing someone from invalidating it one nanosecond after the function returns (and before anyone manages to act on it). And if it isn't possible for someone to concurrently change the values that IsActive depends on, then we wouldn't have a data race in the first place.

Most of the time, that means the result of the function is useless. I haven't checked tried (yet) to figure out how could that problem manifest itself in this case (the iohandler logic is fairly complex), but I'm pretty sure that this should be fixed at a higher level.

lldb/source/Target/Process.cpp
4404

Locking a mutex in a signal handler is not usually safe.

JDevlieghere retitled this revision from [lldb] Protect control variables in IOHandler with a mutex to avoid a data race to [lldb] Avoid data race in IOHandlerProcessSTDIO.
JDevlieghere edited the summary of this revision. (Show Details)

These kinds of changes rarely fix a bug -- usually they just change a (detectable) data race into some nondeterministic runtime behavior.

That's because, even though e.g. IsActive can compute its result in a race-free manner, there's nothing preventing someone from invalidating it one nanosecond after the function returns (and before anyone manages to act on it). And if it isn't possible for someone to concurrently change the values that IsActive depends on, then we wouldn't have a data race in the first place.

Most of the time, that means the result of the function is useless. I haven't checked tried (yet) to figure out how could that problem manifest itself in this case (the iohandler logic is fairly complex), but I'm pretty sure that this should be fixed at a higher level.

I think in this particular case the race is fairly "harmless". In the read/write scenario, at worst we execute the run-loop in IOHandlerProcessSTDIO once more than we should. In case of a write/write race we are concurrently writing the same value. Anyway, that's not an excuse not to fix the underlying issue. I think the updated patch should take care of that.

This seems like a better approach, but I'm not sure how does it actually work (see inline comment). Are you sure that we're still sending input to the process (I'm not sure how much test coverage for this do we have)?

lldb/source/Target/Process.cpp
4342

I'm confused. How come this does not immediately terminate due to GetIsDone (through SetIsDone(true) via SetIsRunning(true) on line 4339) returning true?

JDevlieghere marked an inline comment as done.Mar 2 2022, 12:40 AM
JDevlieghere added inline comments.
lldb/source/Target/Process.cpp
4342

Yeah this is bogus, that line should read SetIsDone(!running);. I originally had these 3 lines inline and made a typo when extracting the function and didn't rerun the tests.

JDevlieghere marked an inline comment as done.

Fix typo in SetIsRunning.

Are you sure that we're still sending input to the process (I'm not sure how much test coverage for this do we have)?

I'll rerun the tests tomorrow with my typo and see if anything catches this. If not I can add a shell/pexpect test with the little test program I was using which echos whatever it reads via stdin.

labath added a comment.Mar 2 2022, 1:36 AM

Are you sure that we're still sending input to the process (I'm not sure how much test coverage for this do we have)?

I'll rerun the tests tomorrow with my typo and see if anything catches this. If not I can add a shell/pexpect test with the little test program I was using which echos whatever it reads via stdin.

Cool.

Unsurprisingly no tests failed with the typo. Added a test case to cover reading from stdin through the IOHandlerProcessSTDIO.

labath accepted this revision.Mar 2 2022, 11:21 AM

Looks good. Thanks for the test.

lldb/test/API/iohandler/stdio/TestIOHandlerProcessSTDIO.py
17

The default should be fine here.

This revision is now accepted and ready to land.Mar 2 2022, 11:21 AM

How is using m_mutex better than just using the std::atomic<bool>? Just protecting the modification of the value with a mutex doesn't seem like it would do much more than the atomic already did unless we are using the lock the protect the value for longer that just the modification.

lldb/source/Target/Process.cpp
4383

Won't this cause a deadlock? You aren't using a recursive mutex here and you are locking the "m_mutex" and then calling SetIsDone(true) which will try and lock the mutex as well?

4440–4441

Does this need to be a recursive mutex? We have the same thread locking this mutex multiple times it seems from the code?

I think you might be looking at a combination of the old and the new patch. The new mutex protects the whole Cancel and SetIsRunning. I don't think this needs to be a recursive mutex because these functions are not calling each other. They both indirectly protect access to SetIsDone.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 3:56 PM