This is an archive of the discontinued LLVM Phabricator instance.

Fix use-after-free in ThreadPlan, and add test.
Needs ReviewPublic

Authored by comex on Aug 21 2020, 5:07 PM.

Details

Reviewers
jingham
clayborg
Summary

Background:

ThreadPlan objects store a cached pointer to the associated Thread. To quote
the code:

We don't cache the thread pointer over resumes. This
Thread might go away, and another Thread represent
// the same underlying object on a later stop.

This can happen only when using an operating system plugin with
os-plugin-reports-all-threads = false (a new feature); otherwise, the
ThreadPlan will be wiped away when the Thread is.

Previously, this cached pointer was unowned, and ThreadPlan attempted to
prevent it from becoming stale by invalidating it in WillResume, reasoning that
the list of threads would only change whwen the target is running. However, it
turns out that the pointer can be re-cached after it's invalidated but before
the target actually starts running. At least one path where this happens is
ThreadPlan::ShouldReportRun -> GetPreviousPlan -> GetThread.

It might be possible to fix this by invalidating the pointer from other places,
but that seems unnecessarily risky and complicated. Instead, just keep around
a ThreadSP and check IsValid(), which becomes false when Thread::DestroyThread()
is called.

Note: This does not create a retain cycle because Thread does not own
ThreadPlans. (Even if it did, Thread::DestroyThread resets all of the thread's
owned pointers.)

As for testing, I made a small change to the existing reports-all-threads test
which causes it to trigger the use-after-free without the rest of the commit.

Diff Detail

Event Timeline

comex created this revision.Aug 21 2020, 5:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2020, 5:07 PM
comex requested review of this revision.Aug 21 2020, 5:07 PM
comex retitled this revision from Fix use-after-free in ThreadPlan, and add test. Background: ThreadPlan objects store a cached pointer to the associated Thread. To quote the code: // We don't cache the thread pointer over resumes. This // Thread might go away, and another... to Fix use-after-free in ThreadPlan, and add test..Aug 21 2020, 5:09 PM
comex edited the summary of this revision. (Show Details)
JDevlieghere added inline comments.Aug 21 2020, 11:11 PM
lldb/include/lldb/Target/ThreadPlan.h
603

drive-by nit: since you're touching this, can you make it a Doxygen comment (///) above the variable?

I'm confused as to how this patch actually fixes the problem. When the thread gets removed from the thread list, it should get Destroy called on it - which should set m_destroy_called, causing IsValid to return false.. So I am not clear under what circumstances FindThreadByID will fail, but the cached thread shared pointer's IsValid is still true? If IsValid is holding true over the thread's removal from the thread list, then I'm worried that this change will keep us using the old ThreadSP that was reported the next time we stopped and this thread ID was represented by a different ThreadSP.

comex added a comment.Aug 25 2020, 4:30 AM

I'm confused as to how this patch actually fixes the problem. When the thread gets removed from the thread list, it should get Destroy called on it - which should set m_destroy_called, causing IsValid to return false.. So I am not clear under what circumstances FindThreadByID will fail, but the cached thread shared pointer's IsValid is still true? If IsValid is holding true over the thread's removal from the thread list, then I'm worried that this change will keep us using the old ThreadSP that was reported the next time we stopped and this thread ID was represented by a different ThreadSP.

Hmm… I’m confused by your comment. This patch isn’t meant to address a situation where IsValid is true but FindThreadByID fails. It addresses the situation where a ThreadPlan already has a cached thread pointer (and would like to reuse it without calling FindThreadByID at all), but IsValid is false because the thread was removed from the thread list since it was cached.

The existing code doesn’t check IsValid; it assumes that the thread list can’t be changed until a resume happens, at which point WillResume will be called and reset the cached thread pointer to null. However, in the buggy case I found, GetThread is called again after WillResume has already run. GetThread sets the cached thread pointer back to non-null, and then later when the thread list actually changes, the pointer becomes dangling.

With this patch, the pointer never gets reset to null, so it can end up pointing to a thread that has been removed from the list. But now it’s a shared_ptr, so it at least keeps the thread object alive. And every time GetThread is called, it checks (using IsValid) whether the thread has been removed. If so, GetThread throws out the cached pointer and falls back to calling FindThreadByID.

What's calling into the thread plans when WillResume has already been called? That seems wrong, since the thread list is in an uncertain state, having been cleared out for resume and not reset after the stop. It seems to me it would be a better fix to ensure that we aren't doing that.

comex added a comment.Aug 25 2020, 4:43 PM

The sequence I found is:

  • WillResume
  • DoResume sends eBroadcastBitAsyncContinue to ProcessGDBRemote::AsyncThread
  • AsyncThread calls process->SetPrivateState(eStateRunning);
  • …which sends eBroadcastBitStateChanged back to the main thread, handled by Process::HandlePrivateEvent
  • …which ends up in this stack, when it tries to figure out whether to report the state change to the user:
    • Process::HandlePrivateEvent ->
    • Process::ShouldBroadcastEvent ->
    • ThreadList::ShouldReportRun ->
    • Thread::ShouldReportRun ->
    • ThreadPlan::ShouldReportRun ->
    • ThreadPlan::GetPreviousPlan ->
    • ThreadPlan::GetThread

We should be able to calculate ShouldReportRun before we actually set the run going. That's better than just querying potentially stale threads. It would also be good to find a way to prevent ourselves from consulting the thread list after we've decided to invalidate it for run, but that's a second order consideration.

comex added a comment.Aug 26 2020, 6:08 PM

Disabling the thread list while the target is running sounds like a pretty complex change. For example, what should happen if a Python script calls lldb.process.GetThreadAtIndex(n) while the target is running, which currently works?

And is it really the right direction to be moving in? Long-term, wouldn't it be better if user-facing commands like thread list worked while the target is running? Or extra-long-term, one of the ideas on the LLDB projects page [1] is non-stop debugging (like GDB supports), where one thread is paused and can be inspected while other threads are still running. That would require a model where threads are created and destroyed on the fly, without a rigid sequence of "resume and thread list goes away; stop and thread list comes back".

FWIW, this bug causes intermittent crashes whenever I try to debug xnu, so I'd like to get it fixed relatively quickly if possible. Even just calculating ShouldReportRun earlier, while certainly doable, would be a considerably more complex change than this.

[1] https://lldb.llvm.org/status/projects.html#non-stop-debugging

I want to separate out two things here. One is whether lldb should internally ask questions of a thread once we've invalidated the thread list before running, the other is how we present threads to the user while the process is running.

I was only suggesting restricting the former. Once we've put the internal data structure that handles the thread list in an undefined state because we are about to resume, we shouldn't turn around and ask questions of it. That just seems like a bad practice.

The other question is what does it mean to hand out a thread list when the process is running. It has to be a historical artifact, right? We don't actually know that the threads we are handing out still exist, so I'm not really sure what handing them out would mean. For some platforms we might be able to track threads coming and going, but I'm pretty sure that we can't require that everywhere, os it can't be an essential part of lldb's design. In general, when the process is "continuing" lldb would like to change its behavior as little as possible, so unless there's a way to keep the thread list accurate while running that doesn't involve stopping and starting the process, we would rather not track this live. I know that both the Darwin user space and Darwin kernel don't provide such support.

And we can't answer any meaningful questions about the thread without pausing at least that thread. In non-stop mode, the threads that are running are still in an uncertain state, and only the threads that are stopped are known. So even when lldb supports keeping some threads running while others are stopped, we'll have to maintain a distinction between the stopped and the running threads, and be clear that to know something about a running thread we'll have to stop it. In fact, because the point of non-stop debugging is that there are some parts of the system that you really don't want to stop, in that mode we need to be even more careful to be clear about what operations do and don't require stopping the target.

But again, I don't think we need to have that discussion w.r.t. the current problem. This is all about how we treat our internal state, not what we show to users.