This is an archive of the discontinued LLVM Phabricator instance.

[lldb-server/linux] Fix waitpid for multithreaded forks
ClosedPublic

Authored by labath on Dec 29 2021, 5:54 AM.

Details

Summary

The lldb-server code is currently set up in a way that each
NativeProcess instance does its own waitpid handling. This works fine
for BSDs, where the code can do a waitpid(process_id), and get
information for all threads in that process.

The situation is trickier on linux, because waitpid(pid) will only
return information for the main thread of the process (one whose tid ==
pid). For this reason the linux code does a waitpid(-1), to get
information for all threads. This was fine while we were supporting just
a single process, but becomes a problem when we have multiple processes
as they end up stealing each others events.

There are two possible solutions to this problem:

  • call waitpid(-1) centrally, and then dispatch the events to the appropriate process
  • have each process call waitpid(tid) for all the threads it manages

This patch implements the second approach. Besides fitting better into
the existing design, it also has the added benefit of ensuring
predictable ordering for thread/process creation events (which come in
pairs -- one for the parent and one for the child). The first approach
OTOH, would make this ordering even more complicated since we would
have to keep the half-threads hanging in mid-air until we find the
process we should attach them to.

The downside to this approach is an increased number of syscalls (one
waitpid for each thread), but I think we're pretty far from optimizing
things like this, and so the cleanliness of the design is worth it.

The included test reproduces the circumstances which should demonstrate
the bug (which manifests as a hung test), but I have not been able to
get it to fail. The only place I've seen this failure modes are very
rare hangs in the thread sanitizer tests (tsan forks an addr2line
process to produce its error messages).

Diff Detail

Event Timeline

labath requested review of this revision.Dec 29 2021, 5:54 AM
labath created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptDec 29 2021, 5:54 AM

Thank you a lot for doing this. I love how the code becomes simpler, and the cost doesn't seem much.

lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
1909

Hmm, I'm a bit confused here. Why would the key already exist? I mean, you start with empty tid_events and catch only one event for every thread, correct?

1918

Are we talking about some kind of race here? Or some thread that appears in m_threads but is not returned by GetThreadByID()?

I was wondering if you could use thread pointers as keys.

labath added inline comments.Dec 30 2021, 1:53 AM
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
1909

It doesn't exist. It's just how the api is named. c++ std::map offers both emplace and try_emplace (which, ironically, don't differ in "trying" -- they only differ in the way they construct the pair). DenseMap offers only try_emplace (maybe for that reason).

1918

The problem is when a thread disappears. This can happen in case of a main thread exit or an execve, in which case we remove all non-main threads from the list. However, we can still have some pending events for the other threads. Now, I haven't managed to reproduce this in my experiments, but the manpage is adamant that a SIGKILL should immediately terminate a process. In my (limited) tests the debugger always got a PTRACE_EVENT_EXIT stop for each threads (which is again something that the manpage says should not happen), so we theoretically (with careful management of thread lifetimes) might ensure that a thread with pending events does not disappear, but depending on it doesn't seem like a good idea.

mgorny added inline comments.Dec 30 2021, 2:26 AM
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
1918

Hmm, so it could disappear while MonitorCallback() is executing; do I understand correctly?

labath added inline comments.Dec 30 2021, 3:18 AM
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
1918

Yes. I meant disappearing from the list we maintain, not from the system.

mgorny accepted this revision.Dec 31 2021, 4:08 AM

LGTM.

This revision is now accepted and ready to land.Dec 31 2021, 4:08 AM
This revision was automatically updated to reflect the committed changes.