Page MenuHomePhabricator

[lldb] [Process] Watch for fork/vfork notifications
ClosedPublic

Authored by mgorny on Mar 17 2021, 3:19 PM.

Details

Summary

Watch for fork(2)/vfork(2) (also fork/vfork-style clone(2) on Linux)
notifications and explicitly detach the forked child process, and add
initial tests for these cases. The code covers FreeBSD, Linux
and NetBSD process plugins. There is no new user-visible functionality
provided -- this change lays foundations over subsequent work on fork
support.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It includes refactoring clone/fork/vfork event monitoring into a single function. Right now, only forks are supported — the handler creates a local NativeProcessLinux instance (@labath, any suggestions how to make this less hacky?), duplicates parent's breakpoints into it (just like the child gets parent's memory) and uses it to clear these breakpoints and detach the child process.

I don't find this particularly hacky. However, I have doubts about the breakpoint clearing aspect. If we add that now, I think it will be tricky to move that process to the client in the future (how will the client know whether the breakpoints were auto-removed?) Given that breakpoints in forked process have been borked since forever, I don't see harm in keeping that broken for a little while longer. Even a client with extremely limited of multiple processes (like the one we had in mind) should not have a problem with sending the appropriate z packets before it detaches. This patch could still come first (it's great that you've separated that out) -- it just needs to ensure that there are no breakpoints in the child path which would need clearing.

Btw, have you by any chance looked at how gdb determines which clone()s constitute a new thread?

lldb/include/lldb/Host/common/NativeProcessProtocol.h
401–404 ↗(On Diff #331609)

That's cause you were using an enum to define constants. If you were using it to define a type, then I would be fine (though I might ask to make it a scoped enum).

145 ↗(On Diff #333124)

I certainly don't think we need to try to "fix" older clients by doing stuff to the server. (And on top of that, I am less worried about people using older clients, than I am about older servers -- those can be harder to update.)

lldb/source/Host/linux/Host.cpp
323

Why is this checking for zero ? (When it's being set to INVALID_PID above?)

lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
459

I dunno. I recommend debugging it. It smells like some kind of undefined behavior.

898

if (pgid_ret != child_pid) ?

lldb/test/Shell/Subprocess/fork-follow-parent.test
7

There are several ways around that, but I don't know which one is the best, as I'm not sure what exactly you're trying to test. You could try to capture this nondeterminism with CHECK-DAGs, but CHECK-DAGs don't combine well with CHECK-NOTs. Another option is to make this predictable by adding synchronization to the inferior. If you just wanted to check that the child process finishes correctly, you could have the parent process check its exit status and match that...

It includes refactoring clone/fork/vfork event monitoring into a single function. Right now, only forks are supported — the handler creates a local NativeProcessLinux instance (@labath, any suggestions how to make this less hacky?), duplicates parent's breakpoints into it (just like the child gets parent's memory) and uses it to clear these breakpoints and detach the child process.

I don't find this particularly hacky. However, I have doubts about the breakpoint clearing aspect. If we add that now, I think it will be tricky to move that process to the client in the future (how will the client know whether the breakpoints were auto-removed?) Given that breakpoints in forked process have been borked since forever, I don't see harm in keeping that broken for a little while longer. Even a client with extremely limited of multiple processes (like the one we had in mind) should not have a problem with sending the appropriate z packets before it detaches. This patch could still come first (it's great that you've separated that out) -- it just needs to ensure that there are no breakpoints in the child path which would need clearing.

Actually, that's not a problem at all. If client doesn't indicate fork-events support via qSupported, we handle breakpoints server-side. If it does, we deliver fork stop reason and then the client is responsible for dealing with breakpoints.

Btw, have you by any chance looked at how gdb determines which clone()s constitute a new thread?

If I read the code correctly, GDB always assumes PTRACE_EVENT_CLONE is a new thread, and does not support spawning processes via it.

mgorny added inline comments.Mar 30 2021, 6:20 AM
lldb/test/Shell/Subprocess/fork-follow-parent.test
7

The goal is to check that parent is stopped before printing but the child is not. I suppose your last suggestion, i.e. checking for exit status, should be good enough.

It includes refactoring clone/fork/vfork event monitoring into a single function. Right now, only forks are supported — the handler creates a local NativeProcessLinux instance (@labath, any suggestions how to make this less hacky?), duplicates parent's breakpoints into it (just like the child gets parent's memory) and uses it to clear these breakpoints and detach the child process.

I don't find this particularly hacky. However, I have doubts about the breakpoint clearing aspect. If we add that now, I think it will be tricky to move that process to the client in the future (how will the client know whether the breakpoints were auto-removed?) Given that breakpoints in forked process have been borked since forever, I don't see harm in keeping that broken for a little while longer. Even a client with extremely limited of multiple processes (like the one we had in mind) should not have a problem with sending the appropriate z packets before it detaches. This patch could still come first (it's great that you've separated that out) -- it just needs to ensure that there are no breakpoints in the child path which would need clearing.

Actually, that's not a problem at all. If client doesn't indicate fork-events support via qSupported, we handle breakpoints server-side. If it does, we deliver fork stop reason and then the client is responsible for dealing with breakpoints.

Yeah, but then we have to maintain two copies of breakpoint-unsetting code in perpetuity. Given that we've managed to come this far with this being completely broken, I think we can wait a little longer until the client support is in place.

Btw, have you by any chance looked at how gdb determines which clone()s constitute a new thread?

If I read the code correctly, GDB always assumes PTRACE_EVENT_CLONE is a new thread, and does not support spawning processes via it.

Ah, cute. I guess someone should let them know about this (and maybe also tell kernel folks that differentiating cloned processes is really hard). :)

Yeah, but then we have to maintain two copies of breakpoint-unsetting code in perpetuity. Given that we've managed to come this far with this being completely broken, I think we can wait a little longer until the client support is in place.

Ok, I guess I can do that.

What about debug registers on FreeBSD? This system is pretty unique because child processes inherit dbregs from parent, and therefore we need to clear them. GDB doesn't do that and watchpoints crash forked children. Should we keep the clearing part as a hack specific to FreeBSD, or have lldb generic clear all hardware breakpoints and watchpoints on fork? Independently of that, I'm going to file a bug for FreeBSD to see if this is really the intended behavior.

mgorny updated this revision to Diff 334296.Mar 30 2021, 4:09 PM
mgorny marked 4 inline comments as done.

Removed software breakpoint cleanup. Modified tests not to expect child's output at any specific point.

mgorny edited the summary of this revision. (Show Details)Mar 30 2021, 4:09 PM
labath added a comment.Apr 1 2021, 1:46 AM

What about debug registers on FreeBSD? This system is pretty unique because child processes inherit dbregs from parent, and therefore we need to clear them. GDB doesn't do that and watchpoints crash forked children. Should we keep the clearing part as a hack specific to FreeBSD, or have lldb generic clear all hardware breakpoints and watchpoints on fork?

I think we can keep the freebsd-specific clearing code (in the server). Clearing all hardware breakpoints in the client would be pretty logical, but if we wanted to make that work consistently, then we might need to add code to /other/ operating system implementations to /set/ the breakpoints (so that the client can clear them properly)...

lldb/include/lldb/Host/linux/Host.h
14

These spaces are pretty common in lldb, but that's not actually consistent with how the rest of llvm formats them...

lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp
726

And the netbsd comment might apply to freebsd as well...

lldb/source/Plugins/Process/FreeBSD/NativeRegisterContextFreeBSD_arm64.cpp
290–291
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
649–667

It looks like these two (and probably also PTRACE_EVENT_CLONE) could be bundled together.

661

If this fails (I guess that can only happen if the entire process was SIGKILLED), how does continuing with pid=0 help?
I'm not sure that resuming the parent (as the clone case does) is significantly better, but it seems like we should at least be consistent...

888

auto *

913

I think we have some macro or attribute for that these days...

918

Copying the breakpoints is not needed yet, right? We could just group these two (fork/vfork) paths for now...

934

llvm_unreachable("unknown clone_info.event");

937

erase(find_it) would be slightly more efficient. (And it might be better to put this before the switch, to make sure it executes in case of early exits.)

lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.cpp
763

For NetBSD, it might be cleaner to keep waiting for the parent only, and then do a separate waitpid for the child once you get its process id. That would make the sequence of events (and the relevant code) predictable.

I think this is how this interface was meant to be used, even on linux, but linux makes it tricky because it's not possible to wait for all threads belonging to a given process in a single call -- one would have to iterate through the threads and wait for each one separately (it might be interesting to try to implement and benchmark that).

lldb/test/Shell/Subprocess/Inputs/clone.c
23 ↗(On Diff #334296)

Better include an alignas directive, just in case.

lldb/test/Shell/Subprocess/Inputs/vfork.c
1 ↗(On Diff #334296)

Maybe merge these three files into one, parameterized by a define or something?

lldb/test/Shell/Subprocess/clone-follow-parent-wp.test
2

Why only these three arches?

lldb/test/Shell/Subprocess/fork-follow-parent.test
2

This (and maybe others) should have UNSUPPORTED: system-windows, I guess.

mgorny added inline comments.Apr 1 2021, 3:52 PM
lldb/test/Shell/Subprocess/clone-follow-parent-wp.test
2

Right, I was supposed to replace this with something better later. It's supposed to limit the test to arches where we support watchpoints.

mgorny updated this revision to Diff 334947.Apr 2 2021, 6:27 AM
mgorny marked 11 inline comments as done.

Update the Linux plugin and tests, mostly.

TODO: FreeBSD, NetBSD and watchpoint feature for tests.

mgorny added inline comments.Apr 2 2021, 7:11 AM
lldb/include/lldb/Host/linux/Host.h
14

Will remove. However, note that this causes clang-format to reorder the includes.

lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.cpp
763

I've been thinking about it and I suppose it makes sense. I was considering what would be better to future-proof it for the future full multiprocess support but I suppose we could just loop over all process classes and let every one wait for its own process rather than creating a dedicated dispatcher that calls waitpid(-1...).

mgorny updated this revision to Diff 334957.Apr 2 2021, 7:51 AM
mgorny edited the summary of this revision. (Show Details)

Simplified FreeBSD/NetBSD to use per-PID waitpid(). Removed arch restrictions from wp tests — made them consistent with Python tests.

mgorny marked 5 inline comments as done.Apr 2 2021, 7:52 AM

@labath, I think I've made all the requested changes.

labath added a comment.Apr 8 2021, 4:05 AM

@labath, I think I've made all the requested changes.

This looks pretty good now. Just some small nits inline...

lldb/include/lldb/Host/linux/Host.h
14

And it should do that -- that's part of the point :)

lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp
723

Is the loop still necessary?

976–982

assert(event&PL_FLAG_FORKED) would be shorter, but I am not sure if we need even that, as this is already checked in the caller. You could just drop the entire event argument, if it is unused..

lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.h
125

move this next to the other Monitor functions ?

lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
653

Maybe obsolete by now, but I just realized what you actually meant here.

There will always be some divergence between the two sequence of events. Letting this be handled from the main loop may make it smaller though. I don't really mind doing it that way, as long as the (v)fork and clone cases do it the same way.

lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.cpp
763

I think that the looping would be the cleanest solution for the BSDs. For linux, we will need to call waitpid centrally, and then somehow choose which instance to notify.

764

same here

lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.h
118

same here

lldb/test/API/functionalities/gdb_remote_client/TestMultiprocess.py
3

bad merge

lldb/test/Shell/Subprocess/Inputs/fork.cpp
31–43

How about:

#if def TEST_CLONE
  pid_t pid = clone(child_func, &stack[sizeof(stack)], 0, NULL);
#elif defined TEST_FORK
  pid_t pid = TEST_FORK(); // defined to fork of vfork
  if (pid == 0)
    _exit(child_func(NULL));
#endif

?

lldb/test/Shell/Subprocess/clone-follow-parent.test
3

The UNSUPPORTED:windows is _not_ required if you explicitly list supported operating systems. :)

mgorny marked 3 inline comments as done.Apr 8 2021, 4:23 AM

Answered where answer was due, will update the rest once I finish retesting the multiprocess patch.

lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp
723

I am not sure but I think there's no harm in keeping it, and it might save us from some hard-to-debug corner conditions in the future.

976–982

The distinction is necessary to know how to deal with software breakpoints, and it will be used in the followup patch (D99864, when I extend it to non-Linux targets). I think removing event at this point to reintroduce it would be a wasted effort.

That said, I guess this is yet another case for llvm_unreachable(). Also, I think it's worth keeping assert()s like this as they make sure we don't break stuff accidentally when changing code on the caller's end.

labath added inline comments.Apr 8 2021, 4:44 AM
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp
723

I don't mind that much, though I would be surprised if this can help with anything. IIUC, the BSD kernel always sends one notification when the process stops, and I believe this code is written in a way which expects that (only linux has special code to delay some actions until every thread registers as stopped). If, for whatever reason, we end up getting two notifications here, I expect things would blow up spectacularly...

976–982

True, but it's even better when the code is written such that the asserts/unreachables are unnecessary. One way to do that might be to do the decoding in the caller:

if (info.pl_flags & PL_FLAG_FORKED) {
  MonitorClone(info.pl_child_pid, /*bool is_vfork*/ fork_flags & PL_FLAG_VFORKED);
  return;
}

It's hard to say whether that would work without seeing the real implementation.

mgorny added inline comments.Apr 8 2021, 5:23 AM
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp
976–982

In the long run, we would probably also want a separate event for posix_spawn on NetBSD. I've figured out that reusing native constants avoids reinventing the wheel unnecessarily.

mgorny marked 10 inline comments as done.Apr 8 2021, 5:38 AM

Ok, I think I've implemented all the requested changes. I'm going to test it on all three platforms now and attach the patch if it miraculously still works ;-).

lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp
976–982

Hmm, though I guess it doesn't matter for FreeBSD right now and since this is entirely platform-specific code, I'll simplify it for now.

mgorny updated this revision to Diff 336107.Apr 8 2021, 7:54 AM

Implemented @labath's requests.

labath accepted this revision.Apr 8 2021, 8:11 AM

ship it

This revision is now accepted and ready to land.Apr 8 2021, 8:11 AM
This revision was landed with ongoing or failed builds.Apr 8 2021, 9:49 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2021, 9:49 AM

test/Shell/Subprocess/vfork-follow-parent-wp.test is failing on GreenDragon:

http://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/31016/testReport/junit/lldb-shell/Subprocess/vfork_follow_parent_wp_test/

I'm going to temporarily disable the test on Darwin. Happy to provide help to investigate :-)

mgorny added a comment.Apr 8 2021, 3:33 PM

test/Shell/Subprocess/vfork-follow-parent-wp.test is failing on GreenDragon:

http://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/31016/testReport/junit/lldb-shell/Subprocess/vfork_follow_parent_wp_test/

I'm going to temporarily disable the test on Darwin. Happy to provide help to investigate :-)

I guess that vfork()-ed process doesn't start with cleared dbregs on Darwin. I suppose you don't handle forks at all, so I suppose skipping/xfailing the test is the thing to do — knowing that there's a potential issue but extremely unlikely in real life.

Forks are handled explicitly on FreeBSD, Linux and NetBSD right now but I've left these tests open to all platforms since they merely verify that LLDB doesn't explode when the debugged program forks, and that it (probably) doesn't try to debug the child accidentally.

I've reverted this (121cff78a8032a73aa4fb820625dc1ecae8e3997) because it made the linux bot super-flaky. I think it's likely that this affects _only_ linux (and the BSD code is fine -- though we don't have a bot to verify that), but I didn't want to experiment with partial reverts while the bots are wonky. However, it may be interesting the re-land the linux support in a separate commit, once we figure out the problem -- my money is on the nondeterminism of thread creation.

I don't really understand what's going on, but in the one case I was able to catch this "in the act" the inferior process had one thread which was stopped one byte after the breakpoint location (there's no valid instruction there). This leads me to believe that the FixupBreakpointPCAsNeeded logic did not fire for some reason.

I've reverted this (121cff78a8032a73aa4fb820625dc1ecae8e3997) because it made the linux bot super-flaky. I think it's likely that this affects _only_ linux (and the BSD code is fine -- though we don't have a bot to verify that), but I didn't want to experiment with partial reverts while the bots are wonky. However, it may be interesting the re-land the linux support in a separate commit, once we figure out the problem -- my money is on the nondeterminism of thread creation.

How about recommitting without the clone() change, i.e. assuming that clone always means thread, and seeing if that causes trouble still?

I've reverted this (121cff78a8032a73aa4fb820625dc1ecae8e3997) because it made the linux bot super-flaky. I think it's likely that this affects _only_ linux (and the BSD code is fine -- though we don't have a bot to verify that), but I didn't want to experiment with partial reverts while the bots are wonky. However, it may be interesting the re-land the linux support in a separate commit, once we figure out the problem -- my money is on the nondeterminism of thread creation.

How about recommitting without the clone() change, i.e. assuming that clone always means thread, and seeing if that causes trouble still?

I think I have it figured out -- see the inlined comment. It seems this was one of those (rare) cases where enabling logging actually makes a race _more_ likely...

lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
904–905

This is resuming the thread a second time. If the thread manages to stop between the resume inside AddThread, and here, we will miss the breakpoint (and subsequently crash).

Deleting these two lines should fix things.

mgorny marked an inline comment as done.Apr 13 2021, 5:35 AM
mgorny added inline comments.
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
904–905

Hmm, I wonder why it ended up being added here. Probably my mistake, sorry, and thanks for debugging it. I'm going to test and reland it shortly!