This is an archive of the discontinued LLVM Phabricator instance.

[process] fix exec support on Linux
ClosedPublic

Authored by wallace on Dec 28 2020, 2:09 PM.

Details

Reviewers
clayborg
jingham
Summary

On Linux, when an exec happens and there are thread plans alive, a crash occurs because the base thread plan invokes GetPrivateStopInfo, which ends up invoking GetThread().GetPrivateStopInfo(), and as the thread is already dead, that call crashes.

The method ProcessGDBRemote::SetLastStopPacket is the one who clears the threads upon an exec and before the base thread plan resumes its work.

This doesn't happen on Darwin though. For some context, on Linux, after a process performs exec, the thread id doesn't change. On Darwin it does change, which helps LLDB clear the thread automatically.

A simple fix is to clear all thread plans in the place where threads are already being cleared.

For some final context, the existing test was not failing because there were no thread plans active before getting to the exec.

Diff Detail

Event Timeline

wallace requested review of this revision.Dec 28 2020, 2:09 PM
wallace created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptDec 28 2020, 2:09 PM
wallace edited the summary of this revision. (Show Details)Dec 28 2020, 2:13 PM
wallace updated this revision to Diff 313904.Dec 28 2020, 2:45 PM

improve test

labath added a subscriber: labath.Dec 29 2020, 12:16 AM

It seems Thread plans already have logic to handle "exec" events. It would be better to figure out why that logic is not sufficient instead of adding another exec handling hook.

I am also surprised by the usage of the word "race". It seems to me that problem (what exactly is the problem? How do the thread plans "fail" ?) should be reproducible deterministically. The plans run when the process is stopped, and they should be deterministic. And we can control what the inferior does. Can you describe the problem that you're fixing in more detail?

wallace planned changes to this revision.Dec 29 2020, 11:27 AM
wallace updated this revision to Diff 313997.Dec 29 2020, 11:59 AM

Updated the description of the diff. It was actually very easy to find a deterministic reproduction of the failure. And the diff is actually minimal now.

wallace edited the summary of this revision. (Show Details)Dec 29 2020, 12:07 PM

The new version seems ok-ish to me, but I'm leaving it open as I'd like Jim to weigh in here. The immediate cause of the crash is related to the (dangling?) m_thread pointer inside the completed "step instruction" plan. Normally, this pointer clears itself when the thread disappears but I guess this does not happen in the exec case. Jim should know what's the expected behavior here.

lldb/test/API/functionalities/exec/TestExec.py
109–111

A separate test case would be better, as (even with the comment) a random future modification of the test could negate this.

wallace added inline comments.Jan 4 2021, 11:04 AM
lldb/test/API/functionalities/exec/TestExec.py
109–111

Sure! I'll wait for @jingham to chime first

Walter and I identified this at work, definitely want Jim to chime in on this.

lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
2613–2615

These things seem like they could go into a new function on the lldb_private::Process base class like:

void Process::ProcessDidExec() {
  m_thread_list_real.Clear();
  m_thread_list.Clear();
  m_thread_plans.Clear();
}

I know we are encourage people to use the lldb-server debugging route, but having common code in ProcessGDBRemote that each Process subclass will need to figure out how to do correctly seems like an easy fix.

lldb/test/API/functionalities/exec/TestExec.py
110

Do we need this continue if we did the step above? How does this test still work if the step was supposed to step over the exec?

wallace added inline comments.Jan 4 2021, 2:40 PM
lldb/test/API/functionalities/exec/TestExec.py
110

It seems to me that the StepInstruction before happens before the exec is reached, and after this process.Continue() exec will be reached

@jingham, friendly ping :)

I still think we should add a Process::ProcessDidExec() as mentioned in the inline comments.

jingham added inline comments.Jan 11 2021, 3:27 PM
lldb/test/API/functionalities/exec/TestExec.py
109–111

Sure! I'll wait for @jingham to chime first

I'm a little confused about this crash.

Before lldb lets the target proceed, it copies the thread list before proceeding into a temporary list. Whenever you stop in the debugger, the first thing that happens is ProcessGDBRemote gets the ThreadID's from the stub and figures out which of those thread IDs still exist after the stop. If the process plugin sees a ThreadID that it also saw before the stop, then it should copy the ThreadSP of the thread over from the "seen at previous stop" list to the current thread list. At that point the Thread should be bona fide

Then all the ShouldStop type questions iterate over the current thread list, so if somebody asks the plans of the exec Thread from above which it found in the current thread list, ThreadPlan::GetThread will look it up by ID in the current thread list, and find and return that valid thread.

OTOH, if the thread ID didn't still exist, the process plugin should discard it as part of processing the stop, and nobody should be asking its thread plans questions since again you should go through the current thread list and it won't be on that list.

I don't see under what circumstances you are getting a query to the thread plan for a bogus thread in this instance.

So I'm a little worried that just throwing away the threads when we know we've exec'ed is papering over some flaw lower down in the stack. Can you give some more details on how exactly this is crashing, I don't have a Linux machine handy to try stuff out on?

Note, I'm also not at all clear what the StepInstruction in the test is for. At that point in the test, the target will have stopped at your first breakpoint in main.cpp, which is fair ways (more than one instruction) before the exec in main.cpp.

When you stopped before the step, the thread you stopped on will have one ThreadPlan on its thread plan stack: the ThreadPlanBase. That's the plan which handles all the "unexpected" stops - ones that aren't governed by a thread plan which is actually driving the thread For instance that handles hitting an unrelated breakpoint while doing a step-over, etc. It is always present, and can't be popped.

Doing a thread.StepInstruction at that point pushes the "StepOverInstruction" thread plan, runs the process till the StepOverInstruction thread plan says it is done, and then pops that thread plan. So when you stop after the instruction step, the thread plan stack for this thread should be the same as it was before the instruction step.

If this step-i is necessary to trigger the bug, then again I'm worried we're working around some logic goof rather than finding and fixing the actual problem.

clayborg added inline comments.Jan 11 2021, 3:37 PM
lldb/test/API/functionalities/exec/TestExec.py
111

On Darwin threads have unique identifiers and the thread ID before exec != thread ID after exec. On linux, we get the same thread ID for the thread that does exec: thread ID before exec == thread ID after exec.

So on Darwin when we stop, we will remove all thread plans that are not part of the current thread list and since the globally unique thread ID has changed, it will toss out any thread plans for the old thread. On linux this stays around and since the ID is the same, and since the thread lists gets cleared on exec, we then have a case where the thread plan outlives the thread shared pointers. The thread lists (real and user visible) are cleared in ProcessGDBRemote::SetLastStopPacket() when did_exec == true.

jingham added inline comments.Jan 11 2021, 3:52 PM
lldb/test/API/functionalities/exec/TestExec.py
111

On Darwin threads have unique identifiers and the thread ID before exec != thread ID after exec. On linux, we get the same thread ID for the thread that does exec: thread ID before exec == thread ID after exec.

So on Darwin when we stop, we will remove all thread plans that are not part of the current thread list and since the globally unique thread ID has changed, it will toss out any thread plans for the old thread. On linux this stays around and since the ID is the same, and since the thread lists gets cleared on exec, we then have a case where the thread plan outlives the thread shared pointers. The thread lists (real and user visible) are cleared in ProcessGDBRemote::SetLastStopPacket() when did_exec == true.

The ThreadPlans don't rely on thread shared pointers except as a cache. They work off the ThreadID as the primary bit of data. They have to do this to handle the case where a thread might not be reported on one stop, and then will be reported later on, as happens with OS Plugin threads in the xnu kernel.

The thread shared pointer cache variable gets cleared on "WillResume" and then when you stop, the first time it needs to get its Thread, it looks up its owner thread by ID from the Process ThreadList. So provided there's SOME thread with that ID after the stop, the thread plan really won't know any difference.

But more importantly, there should be no way to ask a ThreadPlan questions directly. You are always asking the Thread something (stop reason or ShouldStop or whatever), which then forwards the request to that Thread's ThreadPlan stack. So it shouldn't be possible to get to a ThreadPlan whose thread no longer exists. I'd like to know how that's happening, because that's going to be a wrong thing to do in any circumstances...

I've done a lightweight test and it seems that the BaseThreadPlan is being asked for the stop reason when the exec happens, but it holds a reference to the thread whose destructor has been called, which causes the crash. On Darwin, as Greg said, the BaseThreadPlan is deleted when the thread changes, so this doesn't happen.
Later this week I'll spend more time gathering logs and I'll share them here in a nice format.

Thanks for looking into this further! The thing to figure out is who still has a reference to either the Thread * or to the ThreadPlanStack over the destruction of the thread. That shouldn't be allowed to happen.

Jim

Sorry for returning late to this diff, but I have some additional information. This is what's happening:

  • Before the exec, the base thread plan holds a reference to the thread pointer
  • During the exec, the original thread is destroyed in ProcessGDBRemote::SetLastStopPacket, and later a new Thread pointer is created for the same tid. Notice that the base thread plan is still holding a reference to the previous pointer even though there's a new Thread pointer for the same tid. On Linux, exec preserves the tid. This step is where I think there's a wrong invariant.
  • Then, at some point, the base thread plan is asked its stop reason, and it'll try to use the destroyed pointer, thus crashing.

I've found three possible fixes for this (I'm pretty sure that you can come up with more):

  • The first one is what this diff does, which is clearing all the thread plans once the gdbremote client knows there's an exec.
  • The second one is modifying the ThreadPlan::GetThread method, which currently looks like this
if (m_thread)
    return *m_thread;

  ThreadSP thread_sp = m_process.GetThreadList().FindThreadByID(m_tid);
  m_thread = thread_sp.get();
  return *m_thread;

Changing it into

if (m_thread && m_thread->IsValid()) // IsValid will return false for a deleted thread
    return *m_thread;

  ThreadSP thread_sp = m_process.GetThreadList().FindThreadByID(m_tid);
  m_thread = thread_sp.get();
  return *m_thread;

also works, as it will make the ThreadPlan grab the new thread pointer.

  • Another solution is to create a method ThreadPlan::PrepareForExec() which just clears the m_thread pointer in the the thread plan, and call it from ProcessGDBRemote::SetLastStopPacket, where we handle the exec.

I prefer the first solution, as the thread plans are all cleared in preparation for the new process.

wallace added a subscriber: jingham.EditedJan 21 2021, 4:28 PM

I've tried to find a way to move the calls the way you mentioned, but it
doesn't seem trivial.

Some more information:

  • The invocation to the thread plan is done by Thread::ShouldStop, where it

does

// We're starting from the base plan, so just let it decide;
    if (current_plan->IsBasePlan()) {
      should_stop = current_plan->ShouldStop(event_ptr);

The interesting part of this call-stack starts
at Process::HandlePrivateEvent, which seems to be handling the Stop Event.
Where I think there's some inconsistent code is in
ThreadPlanBase::ShouldStop, because it tries to get the stop reason, which
fails for Linux, and then it processes eStopReasonExec a few lines below.
And see what it does:

case eStopReasonExec:
    // If we crashed, discard thread plans and stop.  Don't force the
    // discard, however, since on rerun the target may clean up this
    // exception and continue normally from there.
    LLDB_LOGF(
        log,
        "Base plan discarding thread plans for thread tid = 0x%4.4" PRIx64
        " (exec.)",
        m_tid);
    GetThread().DiscardThreadPlans(false);
    return true;

It does discard the thread plans for that thread! This makes me believe
that it should be fine to delete the thread plans in the first place. I
wasn't able to figure out more, but I can dig deeper if you give me a few
pointers. In any case, this last code of block makes me believe that
deleting the thread plans or reseting the thread pointer in the base thread
plan might be fine.

Btw, this is a bunch of logs I got, which start at a breakpoint, then
there's a "continue", which triggers what I'm saying. The backtrace is below

(lldb) bt
* thread #1, name = 'runner', stop reason = instruction step into
  * frame #0: 0x00007f1fe674838d libc.so.6`raise + 61
    frame #1: 0x0000000000400a3c runner`main(argc=2,
argv=0x00007fff4b78fc08) at runner.cpp:20
    frame #2: 0x00007f1fe6734555 libc.so.6`__libc_start_main + 245
    frame #3: 0x0000000000400919 runner`_start + 41
(lldb) thread plan list /////////////////////////////////////// See that
only the base plan is there
thread #1: tid = 0x2b72f1:
  Active plan stack:
    Element 0: Base thread plan.
  Completed plan stack:
    Element 0: Stepping one instruction past 0x00007f1fe6748387 stepping
into calls
    Element 1: Stepping one instruction past 0x00007f1fe6748387 stepping
into calls
(lldb) c
lldb             Process::Resume -- locking run lock
lldb             Process::PrivateResume() m_stop_id = 3, public state:
stopped private state: stopped
lldb             WillResume Thread #1 (0x0x7fcd30000da0): tid = 0x2b72f1,
pc = 0x7f1fe674838d, sp = 0x7fff4b78fad8, fp = 0x7fff4b78fb20, plan = 'base
plan', state = running, stop others = 0
lldb             Resuming thread: 2b72f1 with state: running.
lldb             0x4e7020
Listener::Listener('gdb-remote.resume-packet-sent')
lldb             0x4e7020 Listener::StartListeningForEvents (broadcaster =
0x4787a8, mask = 0x00010000) acquired_mask = 0x00010000 for
gdb-remote.resume-packet-sent
lldb             0x4e7020 Listener::StartListeningForEvents (broadcaster =
0x478ff8, mask = 0x00000004) acquired_mask = 0x00000004 for
gdb-remote.resume-packet-sent
lldb             0x494ab0
Broadcaster("lldb.process.gdb-remote.async-broadcaster")::BroadcastEvent
(event_sp = {0x7fcd40007cd0 Event: broadcaster = 0x478ff8
(lldb.process.gdb-remote.async-broadcaster), type = 0x00000001 (async
thread continue), data = {"c"}}, unique =0) hijack = (nil)
lldb             0x494bf0
Listener('lldb.process.gdb-remote.async-listener')::AddEvent (event_sp =
{0x7fcd40007cd0})
lldb             this = 0x00000000004E7020, timeout = 5000000 us for
gdb-remote.resume-packet-sent
b-remote.async>  0x494bf0 'lldb.process.gdb-remote.async-listener'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7fcd40007cd0
b-remote.async>  Process::SetPrivateState (running)
b-remote.async>  0x483f30
Broadcaster("lldb.process.internal_state_broadcaster")::BroadcastEvent
(event_sp = {0x7fcd40005380 Event: broadcaster = 0x477dd0
(lldb.process.internal_state_broadcaster), type = 0x00000001, data = {
process = 0x477ce0 (pid = 2847473), state = running}}, unique =0) hijack =
(nil)
b-remote.async>  0x4841b0
Listener('lldb.process.internal_state_listener')::AddEvent (event_sp =
{0x7fcd40005380})
b-remote.async>  0x485cf0 Broadcaster("gdb-remote.client")::BroadcastEvent
(event_sp = {0x7fcd40005500 Event: broadcaster = 0x4787a8
(gdb-remote.client), type = 0x00010000, data = <NULL>}, unique =0) hijack =
(nil)
b-remote.async>  0x4e7020
Listener('gdb-remote.resume-packet-sent')::AddEvent (event_sp =
{0x7fcd40005500})
intern-state     0x4841b0 'lldb.process.internal_state_listener'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7fcd40005380
lldb             0x4e7020 'gdb-remote.resume-packet-sent'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7fcd40005500
intern-state     Current Plan for thread 1(0x7fcd30000da0) (0x2b72f1,
running): base plan being asked whether we should report run.
lldb             0x4e7020 Listener::Clear('gdb-remote.resume-packet-sent')
intern-state     Process::ShouldBroadcastEvent (0x7fcd40005380) => new
state: running, last broadcast state: running - YES
lldb             0x4e7020
Listener::~Listener('gdb-remote.resume-packet-sent')
intern-state     Process::HandlePrivateEvent (pid = 2847473) broadcasting
new state running (old state stopped) to public
lldb             Process thinks the process has resumed.
intern-state     Process::HandlePrivateEvent updated m_iohandler_sync to 3
Process 2847473 resuming
intern-state     0x483df0 Broadcaster("lldb.process")::BroadcastEvent
(event_sp = {0x7fcd40005380 Event: broadcaster = 0x477d18 (lldb.process),
type = 0x00000001 (state-changed), data = { process = 0x477ce0 (pid =
2847473), state = running}}, unique =0) hijack = (nil)
(lldb) intern-state     0x358b00 Listener('lldb.Debugger')::AddEvent
(event_sp = {0x7fcd40005380})
intern-state     timeout = <infinite>, event_sp)...
intern-state     this = 0x00000000004841B0, timeout = <infinite> for
lldb.process.internal_state_listener
dbg.evt-handler  0x358b00 'lldb.Debugger'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7fcd40005380
dbg.evt-handler  Process::SetPublicState (state = running, restarted = 0)
dbg.evt-handler  this = 0x0000000000358B00, timeout = <infinite> for
lldb.Debugger


/////////////////////////////////////////////////////////////// This is
where we start handling the stop, see the ~Thread call below
b-remote.async>  this = 0x00007FCD30000DA0, pid = 2847473, tid = 2847473
b-remote.async>  0x7fcd30000da0 Thread::~Thread(tid = 0x2b72f1)
b-remote.async>  0x00007FCD30000DD8 Broadcaster::~Broadcaster("lldb.thread")
b-remote.async>  Process::SetPrivateState (stopped)
b-remote.async>  Process::SetPrivateState (stopped) stop_id = 4
b-remote.async>  0x483f30
Broadcaster("lldb.process.internal_state_broadcaster")::BroadcastEvent
(event_sp = {0x7fcd3c0736f0 Event: broadcaster = 0x477dd0
(lldb.process.internal_state_broadcaster), type = 0x00000001, data = {
process = 0x477ce0 (pid = 2847473), state = stopped}}, unique =0) hijack =
(nil)
b-remote.async>  0x4841b0
Listener('lldb.process.internal_state_listener')::AddEvent (event_sp =
{0x7fcd3c0736f0})
b-remote.async>  this = 0x0000000000494BF0, timeout = <infinite> for
lldb.process.gdb-remote.async-listener
intern-state     0x4841b0 'lldb.process.internal_state_listener'
Listener::FindNextEventInternal(broadcaster=(nil),
broadcaster_names=(nil)[0], event_type_mask=0x00000000, remove=1) event
0x7fcd3c0736f0
intern-state     0x7fcd302ac330
Listener::Listener('Communication::SyncronizeWithReadThread')
intern-state     0x7fcd302ac330 Listener::StartListeningForEvents
(broadcaster = 0x478228, mask = 0x00000020) acquired_mask = 0x00000020 for
Communication::SyncronizeWithReadThread
intern-state     0x7fcd302ac330
Listener::Clear('Communication::SyncronizeWithReadThread')
intern-state     0x7fcd302ac330
Listener::~Listener('Communication::SyncronizeWithReadThread')
intern-state     0x00007FCD302BC738 Broadcaster::Broadcaster("lldb.thread")

///////////////////////////////////////////////////// This is where the new
thread is created, but it points to the same tid
intern-state     0x7fcd302bc700 Thread::Thread(tid = 0x2b72f1)
intern-state     0x358b00 Listener::StartListeningForEvents (broadcaster =
0x7fcd302bc738, mask = 0x00000011) acquired_mask = 0x00000011 for
lldb.Debugger
intern-state     this = 0x00007FCD302BC700, pid = 2847473, tid = 2847473
intern-state     0x7fcd302bc700: tid = 0x2b72f1: stop info = <NULL>
(stop_id = 4)
intern-state     0x7fcd302bc700: tid = 0x2b72f1: stop info = exec (stop_id
= 4)
intern-state
intern-state     ThreadList::ShouldStop: 1 threads, 1 unsuspended threads
intern-state     Thread::ShouldStop(0x7fcd302bc700) for tid = 0x2b72f1
0x2b72f1, pc = 0x00007f3b751ce140
intern-state     ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
intern-state     Plan stack initial state:
  thread #1: tid = 0x2b72f1:
    Active plan stack:
      Element 0: Base thread plan.

intern-state     Plan base plan explains stop, auto-continue 0.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash
backtrace.
Stack dump without symbol names (ensure you have llvm-symbolizer in your
PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
./bin/lldb[0x298aca]
./bin/lldb[0x298ca4]
./bin/lldb[0x296ca0]
./bin/lldb[0x29a606]
/usr/local/fbcode/platform007/lib/libpthread.so.0(+0x12b80)[0x7fcd53167b80]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x47e5d96)[0x7fcd4c955d96]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x47e595a)[0x7fcd4c95595a]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x47cf23b)[0x7fcd4c93f23b]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x47e006c)[0x7fcd4c95006c]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x47253f5)[0x7fcd4c8953f5]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x4720f01)[0x7fcd4c890f01]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x472676e)[0x7fcd4c89676e]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x47257ed)[0x7fcd4c8957ed]
/home/wallace/llvm-sand/build/Debug/Linux-x86_64/llvm/bin/../lib/liblldb.so.12(+0x45856ec)[0x7fcd4c6f56ec]
/usr/local/fbcode/platform007/lib/libpthread.so.0(+0x76ce)[0x7fcd5315c6ce]
/usr/local/fbcode/platform007/lib/libc.so.6(clone+0x3f)[0x7fcd46d6135f]
Segmentation fault (core dumped)
//// This is the callstack I got from another run, but they are the same
* thread #4, name = 'intern-state', stop reason = signal SIGSEGV: invalid
address (fault address: 0x218)
  * frame #0: 0x00007f9983cccd96
liblldb.so.12`lldb_private::ThreadPlan::GetPrivateStopInfo(this=0x00007f99680014d0)
at ThreadPlan.h:521:24
    frame #1: 0x00007f9983ccc95a
liblldb.so.12`lldb_private::ThreadPlanBase::ShouldStop(this=0x00007f99680014d0,
event_ptr=0x00007f997404c310) at ThreadPlanBase.cpp:78:29
    frame #2: 0x00007f9983cb623b
liblldb.so.12`lldb_private::Thread::ShouldStop(this=0x00007f99682be2b0,
event_ptr=0x00007f997404c310) at Thread.cpp:871:35
    frame #3: 0x00007f9983cc706c
liblldb.so.12`lldb_private::ThreadList::ShouldStop(this=0x0000000000477fa8,
event_ptr=0x00007f997404c310) at ThreadList.cpp:330:48
    frame #4: 0x00007f9983c0c3f5
liblldb.so.12`lldb_private::Process::ShouldBroadcastEvent(this=0x0000000000477ce0,
event_ptr=0x00007f997404c310) at Process.cpp:3368:40
    frame #5: 0x00007f9983c07f01
liblldb.so.12`lldb_private::Process::HandlePrivateEvent(this=0x0000000000477ce0,
event_sp=std::__shared_ptr<lldb_private::Event,
__gnu_cxx::_S_atomic>::element_type @ 0x00007f997404c310) at
Process.cpp:3593:33
    frame #6: 0x00007f9983c0d76e
liblldb.so.12`lldb_private::Process::RunPrivateStateThread(this=0x0000000000477ce0,
is_secondary_thread=false) at Process.cpp:3787:7
    frame #7: 0x00007f9983c0c7ed
liblldb.so.12`lldb_private::Process::PrivateStateThread(arg=0x0000000000494d80)
at Process.cpp:3685:25
    frame #8: 0x00007f9983a6c6ec
liblldb.so.12`lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(arg=0x00000000004982e0)
at HostNativeThreadBase.cpp:68:10
    frame #9: 0x00007f998a4d36ce
libpthread.so.0`start_thread(arg=0x00007f996f7fe700) at
pthread_create.c:465:7
    frame #10: 0x00007f997e0d835f libc.so.6`__clone at clone.S:95

The ThreadPlanStack always has one base thread plan. You can't get rid of that one, the system relies on its always being there. Despite it's name, DiscardThreadPlans doesn't actually discard all the thread plans, just all but the base thread plan... So I think that's neither here nor there...

The thing I want to see is the sequence of the events "Make the new thread list after stopping for exec" and "Call ThreadPlan::GetThread()" the first time after the stop for the exec where we go from no cached thread to caching the thread from the old thread list.

The former is ProcessGDBRemote::UpdateThreadList.

You can figure out the latter by running the debugger on an lldb that is debugging a program that exec's. Have the inferior lldb drive its inferior to the state just before the exec. In the superior lldb, set a breakpoint on ThreadPlan::GetThread(). Keep continuing in the inferior lldb if you have to till you see it get the stop packet that corresponds to the exec. At this point all the thread plans should have no cached threads, so they are going to have to look them up. Keep going from there to the first time that ThreadPlan::GetThread is called and has to look up the thread with FindThreadByID call. If I understand what you are describing, that first call should pull up the old thread pointer and cache it.

If that happens before the call to ProcessGDBRemote::UpdateThreadList finishes that would be bad, and we have to stop it doing that. If what's going on is that we set up the new thread list, broadcast the event, and then change the thread list again in response to detecting that this is an exec, that would also be bad. We really should have a finalized thread list for the stop before we start trying to figure out what to do with the stop.

Jim

Jim, thanks for the pointers! I think we are getting close to the issue. After doing what you asked, I found out the following:

  • I set up the state of the lldb debugging the program that will exec right before it execs
  • Then I do "continue"
  • ThreadPlan::WillResume is invoked, which effectively clears out the m_thread cache. After this, I would expect no calls to ThreadPlan::GetThread until the the new thread list is created
  • Then suddenly ThreadList::ShouldReportRun is invoked, which is handling the resume event. It eventually asks the current ThreadPlan ShouldReportRun, which invokes GetPreviousPlan() if m_run_vote is eVoteNoOpinion
Vote ThreadPlan::ShouldReportRun(Event *event_ptr) {
  if (m_run_vote == eVoteNoOpinion) {
    ThreadPlan *prev_plan = GetPreviousPlan();
    if (prev_plan)
      return prev_plan->ShouldReportRun(event_ptr);
  }
  return m_run_vote;
}

This triggers ends up invoking GetThread()

ThreadPlan *GetPreviousPlan() { return GetThread().GetPreviousPlan(this); }

Which causes the incorrect value to be cached.

  • After this event is processed, the stop exec happens, but the ThreadPlanBase's m_thread cache is not null, which breaks the invariant that you mentioned

Keep continuing in the inferior lldb if you have to till you see it get the stop packet that corresponds to the exec. At this point all the thread plans should have no cached threads, so they are going to have to look them up.

I think that the ShouldReportRun is unavoidable at that step, so I thought of a couple of solutions.

  • One is to to modify GetPreviousPlan() to something like this
ThreadPlan *GetPreviousPlan() { return GetThread(/*cache_thread*/ false).GetPreviousPlan(this); }

This allows traversing the thread plan stack without caching the thread. This might be fine because ShouldReportRun is merely looking for any thread plan in the stack with an m_run_vote other than eVoteNoOpinion. It doesn't use the thread for any other reason.

  • Another solution is to force clear the m_thread caches in all the ThreadPlans as soon as there's a stop event.

What do you think?

Excellent, thanks for persisting on this!

I think your second idea sounds less error prone than having to figure out whether the cache is trustworthy at a particular point. Maybe even better than triggering off a stop event is to clear the caches before we do whatever might make the threads in the cache invalid. That should only happen in Process::UpdateThreadList, so we could just move the current virtual UpdateThreadList method -> DoUpdateThreadList, and add a non-virtual UpdateThreadList that clears the cache and then calls UpdateThreadList.

There's also a non-virtual UpdateThreadListIfNeeded, and we could do it there before calling the virtual UpdateThreadList. But then we'd have to ensure that the only way to call UpdateThreadList if through UpdateThreadListIfNeeded, which seems hard to ensure formally. So I think just wrapping the virtual method to clear the cache before updating the thread list is the most robust way to do this.

Jim

wallace updated this revision to Diff 318640.Jan 22 2021, 1:34 PM

Updated based on @jingham's idea, and added an independent test for this.

jingham accepted this revision.Jan 22 2021, 3:33 PM

Yes, this is how I should have done it originally. Thanks!

I had one suggestion for making the test more compact which you can do or not as you please. Other than the LGTM.

lldb/test/API/functionalities/exec/TestExec.py
151–176

I don't think you need to set the second breakpoint till after you've stopped at the first one, If not, you should be able to replace everything from the "exe = self.getBuildArtifact" to the self.assertTrue(len(threads == 1)) with:

(target, process, thread, breakpoint1) = lldbutil.run_to_source_breakpoint(self, 'Set breakpoint 1 here', lldb.SBFileSpec('main.cpp', False))
breakpoint2 = target.BreakpointCreateBySourceRegex(...)

and "threads[0]" below -> "thread", which is a lot easier to read.

This revision is now accepted and ready to land.Jan 22 2021, 3:33 PM
wallace added inline comments.Jan 25 2021, 11:17 AM
lldb/test/API/functionalities/exec/TestExec.py
151–176

nice! run_to_source_breakpoint is super handy

wallace updated this revision to Diff 319069.Jan 25 2021, 11:18 AM

apply suggestion

wallace closed this revision.Jan 25 2021, 2:10 PM