This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Fix data race in ThreadList
ClosedPublic

Authored by augusto2112 on Aug 15 2023, 4:12 PM.

Details

Summary

ThreadSanitizer reports the following issue:

Write of size 8 at 0x00010a70abb0 by thread T3 (mutexes: write M0):
  #0 lldb_private::ThreadList::Update(lldb_private::ThreadList&) ThreadList.cpp:741 (liblldb.18.0.0git.dylib:arm64+0x5dedf4) (BuildId: 9bced2aafa373580ae9d750d9cf79a8f32000000200000000100000000000e00)
  #1 lldb_private::Process::UpdateThreadListIfNeeded() Process.cpp:1212 (liblldb.18.0.0git.dylib:arm64+0x53bbec) (BuildId: 9bced2aafa373580ae9d750d9cf79a8f32000000200000000100000000000e00)

Previous read of size 8 at 0x00010a70abb0 by main thread (mutexes: write M1):
  #0 lldb_private::ThreadList::GetMutex() const ThreadList.cpp:785 (liblldb.18.0.0git.dylib:arm64+0x5df138) (BuildId: 9bced2aafa373580ae9d750d9cf79a8f32000000200000000100000000000e00)
  #1 lldb_private::ThreadList::DidResume() ThreadList.cpp:656 (liblldb.18.0.0git.dylib:arm64+0x5de5c0) (BuildId: 9bced2aafa373580ae9d750d9cf79a8f32000000200000000100000000000e00)
  #2 lldb_private::Process::PrivateResume() Process.cpp:3130 (liblldb.18.0.0git.dylib:arm64+0x53cd7c) (BuildId: 9bced2aafa373580ae9d750d9cf79a8f32000000200000000100000000000e00)

Fix this by only using the mutex in ThreadList and removing the one in
process entirely.

Diff Detail

Event Timeline

augusto2112 created this revision.Aug 15 2023, 4:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 15 2023, 4:12 PM
augusto2112 requested review of this revision.Aug 15 2023, 4:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 15 2023, 4:12 PM
This revision is now accepted and ready to land.Aug 15 2023, 4:26 PM
This revision was automatically updated to reflect the committed changes.
labath added a subscriber: labath.Aug 22 2023, 2:52 AM

After this patch, python_api/run_locker/TestRunLocker.py becomes flaky. https://lab.llvm.org/buildbot/#/builders/68/builds/58456 is the first such failure, but there have been about a dozen failures since then. The backtraces on the buildbot page are fairly useless, but I was able to capture this backtrace locally:

include/c++/v1/vector:1537: assertion __n < size() failed: vector[] index out of bounds
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  0x000055dd180146be llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 46
1  0x000055dd18014d6c
2  0x00007fea9b3ce1c0
3  0x00007fea9b236347 raise + 167
4  0x00007fea9b237797 abort + 247
5  0x000055dd18862597
6  0x000055dd124c4563 lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadPc(std::__u::shared_ptr<lldb_private::Thread> const&, unsigned long) + 275
7  0x000055dd124c42fc lldb_private::process_gdb_remote::ProcessGDBRemote::DoUpdateThreadList(lldb_private::ThreadList&, lldb_private::ThreadList&) + 748
8  0x000055dd129c8687 lldb_private::Process::UpdateThreadListIfNeeded() + 247
9  0x000055dd12a39797 lldb_private::ThreadList::FindThreadByID(unsigned long, bool) + 55
10 0x000055dd12a396db lldb_private::ThreadList::GetSelectedThread() + 43
11 0x000055dd129a8dea lldb_private::ExecutionContext::ExecutionContext(lldb_private::Target*, bool) + 234
12 0x00007fea99840bf9 lldb::SBTarget::EvaluateExpression(char const*, lldb::SBExpressionOptions const&) + 281
13 0x00007fea99840a7b lldb::SBTarget::EvaluateExpression(char const*) + 187
14 0x00007fea9995eb77
15 0x000055dd185907cd
16 0x000055dd18545814 _PyObject_Call + 292
17 0x000055dd1865fecb _PyEval_EvalFrameDefault + 11515
18 0x000055dd1865cfa8 _PyEval_Vector + 184
19 0x000055dd186666c3
20 0x000055dd18666906
21 0x000055dd1865e56f _PyEval_EvalFrameDefault + 5023
22 0x000055dd1865cfa8 _PyEval_Vector + 184
23 0x000055dd186666c3
24 0x000055dd18666906
25 0x000055dd1865e56f _PyEval_EvalFrameDefault + 5023
26 0x000055dd1865cfa8 _PyEval_Vector + 184
27 0x000055dd1865fecb _PyEval_EvalFrameDefault + 11515
28 0x000055dd1865cfa8 _PyEval_Vector + 184
<a lot of additional python frames>

Given that the failure is in the DoUpdateThreadList function, I'm guessing that this patch removed/reduced some existing synchronization which prevented this code from executing concurrently with something else (I don't know what), although it's possible that the right fix (since updating the thread list while the process is running doesn't make sense) is to bail out of EvaluateExpression sooner (@jingham ?).

If you want to reproduce this, note that the bug reproduces relatively infrequently (~1% for me, though it seems to be a bit higher on the buildbot), so you may need to run it many times before you can catch it in action. The failing part is not always the same (e.g. sometimes it just hangs), but I expect the root cause to be the same. I've also confirmed that the failure rate of the test goes down to zero after reverting this patch.

Can you investigate (and possibly revert this patch in the mean time)?