Watch for fork(2)/vfork(2) (also fork/vfork-style clone(2) on Linux)
notifications and explicitly detach the forked child process, and add
initial tests for these cases. The code covers FreeBSD, Linux
and NetBSD process plugins. There is no new user-visible functionality
provided -- this change lays foundations over subsequent work on fork
support.
Details
- Reviewers
labath emaste krytarowski - Commits
- rGc8d18cba4e2f: Reland "[lldb] [Process] Watch for fork/vfork notifications" for Linux
rG7da3b44d67f8: Reland "[lldb] [Process] Watch for fork/vfork notifications" for NetBSD
rG63d75641054a: Reland "[lldb] [Process] Watch for fork/vfork notifications" for FreeBSD
rGa345419ee030: [lldb] [Process] Watch for fork/vfork notifications
Diff Detail
Event Timeline
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
---|---|---|
2150 | This is not intended to send errors like this, I've just added it to test logging. | |
lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp | ||
187 ↗ | (On Diff #331393) | @labath, I'm not convinced this is the best way of doing it. Do you have any suggestions? |
Add a helper function to get PID (TGID) from TID. Recognize fork notifications from parent (SIGTRAP) and from child (SIGSTOP with a TGID != GetID()) and call a helper for them. The helper doesn't do anything except for logging right now.
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
---|---|---|
459 | For some reason, this doesn't print the correct tgid value (plain printf works). Any clue what's wrong here? |
Seems mostly ok. The getPidForTid business does not seem completely ideal, though. I think it may be cleaner to keep the threads for which we have received the creation event in some kind of a purgatory list until we receive the corresponding parent event (which will clarify what kind of a child we are dealing with).
We actually did something like that a very long time ago, but I removed it for being too complicated. Now, I think it may be worth going back to it.
BTW, this made me realize of a funky scenario we may need to handle somehow. I think it's possible (on linux) that two threads will initiate a fork operation concurrently, and so by the time that everything stops, we end up with three (or more) processes...
lldb/include/lldb/Host/common/NativeProcessProtocol.h | ||
---|---|---|
401–404 | The llvm way to do this is via an enum class in combination with LLVM_MARK_AS_BITMASK_ENUM. | |
lldb/source/Host/linux/Host.cpp | ||
319 | just say lldb_private::getPIDForTID | |
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
459 | what does it print? | |
651 | I guess we should also do something like the WaitForNewThread call in the clone case. | |
lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp | ||
187 ↗ | (On Diff #331393) | I'd probably do this by creating a new virtual function (with a signature like vector<string>(ArrayRef<StringRef>)`), which would receive the client features and return the server features. The base qSupported handler would be implemented in terms of this function. This would also allow us to avoid sending features that clearly have no place in the platform connections. |
I'm actually working on such a thing right now, though initially for forks. I can surely extend it to threads later. I will update the diff with it in a few minutes.
lldb/include/lldb/Host/common/NativeProcessProtocol.h | ||
---|---|---|
401–404 | Heh, and you've just told me to use constexprs instead of enums ;-). | |
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
459 | tid 118029 belongs to a different tgid 4295085325, assuming child When tgid was 0, it also printed this huge number. If I cast it to int, it prints fine. But then, first arg is also lldb::pid_t and it prints fine. | |
651 | Yeah, I was wondering about it. On the other hand, won't the main loop handle it anyway? Does doing it explicitly have any real gain? One thing I was worried about is that the case of parent signal first is much more common than child signal first, so having a different logic on both branches would make it more likely for bugs in the child-first branch not to be noticed. | |
lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp | ||
187 ↗ | (On Diff #331393) | Yeah, that's better than my ideas ;-). |
Actually wait for two signals. Store the extra info on parent signal, and use it when handling the fork. For now, the handler just detaches the child and resumes the parent.
Switched Extension to bitmask-enum, fixed prototype and decl for getPIDForTID and switched server features into a virtual as suggested by @labath.
The gdb model - since gdb only supports one debugee per gdb - is to either follow the fork or follow the parent. It would be more in keeping with lldb's model to make a new target for the child side of the fork, and use that to follow the child. That way you can continue to debug both the parent and the child processes. It doesn't look like you've gotten that far yet, but IMO that's the direction we should be going for lldb.
Now using the new thingie for threads. getTIDForPID() temporarily commented out but I plan to use it to distinguish clone() for new thread from clone() for new process.
Long-term, sure. Between life and deadlines, we're only working on the simpler one process model right now.
Add a super-trivial test that the debugger doesn't lose parent after the fork.
Next on the list: clear software breakpoints in the detached fork.
This is no longer true -- gdb supports multiple inferiors these days (I'm not sure when exactly did it grow that support). The automatic detaching of the fork child can be disabled by the "detach-on-fork" setting, at which point you get two inferiors for each fork.
Michal doesn't appear to be interested in this feature right now, just the ability to follow the child process. I think that's fair, as the two inferior setup would be more complicated. However, I am keeping this scenario in mind and making sure that it's possible to support it in the future. The idea is that at the gdb protocol level we will have the notion of two inferiors (using the same syntax that gdb does), but the only thing that the client (and server, mostly) support is to immediately detach from one of the processes after a fork. Later this could be changed to do something more interesting.
Oh, I'm glad to hear that, both about gdb and what sounds like a good plan for lldb going forward.. We just need to make sure supporting both sides of the fork stays the natural direction, even if we don't support it right off.
Using background apps to create sandbox barriers for lower-permissions tasks is becoming a more and more common architecture. Having a framework for doing "task level" debugging that can cross these process barriers as the tasks move back and forth would be a great addition to lldb, but for that we need to be able to keep both sides of the fork under lldb's control.
I'm certainly not planning to put any obstacles to full multiprocess debugging in the future, and I'm going to try to pave the way for it whenever possible. One thing I'm considering right now is adding some support for tracking additional NativeProcess* instances to LLGS. While it's not technically necessary, I think it will make clearing software breakpoints before detaching cleaner. However, I'm not yet sure how to integrate vfork into this later.
So I've decided to circle back a bit and start by adding server-side detaching-on-fork support. This means that this patch doesn't depend on multiprocess support anymore.
It includes refactoring clone/fork/vfork event monitoring into a single function. Right now, only forks are supported — the handler creates a local NativeProcessLinux instance (@labath, any suggestions how to make this less hacky?), duplicates parent's breakpoints into it (just like the child gets parent's memory) and uses it to clear these breakpoints and detach the child process. I suppose this will make things easier when we want to make the class more persistent in the future.
TODO: vfork, clone
Included vfork(). Still have to look into other clone() variants, then other platforms.
lldb/test/Shell/Subprocess/fork-follow-parent.test | ||
---|---|---|
7 | The tests sometimes fail if child starts after the breakpoint is hit (and therefore this check happens before stop reason below). Any suggestion how to make it work independently of where 'function run in child' happens? |
lldb/include/lldb/Host/common/NativeProcessProtocol.h | ||
---|---|---|
145 | Long term, these functions are probably unnecessary. It just occurred to me to check how GDB handles it, and it handles clearing and restoring breakpoints on client side (i.e. by sending z and Z packets). |
lldb/include/lldb/Host/common/NativeProcessProtocol.h | ||
---|---|---|
145 | Hmm, actually that depends on how much compatibility with old client versions we want to preserve. If we want forks/vforks to stop being broken with breakpoints when using an old client, we need to keep them as a fallback logic for when fork/vfork-events aren't supported by client. |
Now includes initial FreeBSD support. The watchpoint test still fails, we probably need to clean dbregs on fork too.
I don't find this particularly hacky. However, I have doubts about the breakpoint clearing aspect. If we add that now, I think it will be tricky to move that process to the client in the future (how will the client know whether the breakpoints were auto-removed?) Given that breakpoints in forked process have been borked since forever, I don't see harm in keeping that broken for a little while longer. Even a client with extremely limited of multiple processes (like the one we had in mind) should not have a problem with sending the appropriate z packets before it detaches. This patch could still come first (it's great that you've separated that out) -- it just needs to ensure that there are no breakpoints in the child path which would need clearing.
Btw, have you by any chance looked at how gdb determines which clone()s constitute a new thread?
lldb/include/lldb/Host/common/NativeProcessProtocol.h | ||
---|---|---|
145 | I certainly don't think we need to try to "fix" older clients by doing stuff to the server. (And on top of that, I am less worried about people using older clients, than I am about older servers -- those can be harder to update.) | |
401–404 | That's cause you were using an enum to define constants. If you were using it to define a type, then I would be fine (though I might ask to make it a scoped enum). | |
lldb/source/Host/linux/Host.cpp | ||
323 | Why is this checking for zero ? (When it's being set to INVALID_PID above?) | |
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
459 | I dunno. I recommend debugging it. It smells like some kind of undefined behavior. | |
924 | if (pgid_ret != child_pid) ? | |
lldb/test/Shell/Subprocess/fork-follow-parent.test | ||
7 | There are several ways around that, but I don't know which one is the best, as I'm not sure what exactly you're trying to test. You could try to capture this nondeterminism with CHECK-DAGs, but CHECK-DAGs don't combine well with CHECK-NOTs. Another option is to make this predictable by adding synchronization to the inferior. If you just wanted to check that the child process finishes correctly, you could have the parent process check its exit status and match that... |
Actually, that's not a problem at all. If client doesn't indicate fork-events support via qSupported, we handle breakpoints server-side. If it does, we deliver fork stop reason and then the client is responsible for dealing with breakpoints.
Btw, have you by any chance looked at how gdb determines which clone()s constitute a new thread?
If I read the code correctly, GDB always assumes PTRACE_EVENT_CLONE is a new thread, and does not support spawning processes via it.
lldb/test/Shell/Subprocess/fork-follow-parent.test | ||
---|---|---|
7 | The goal is to check that parent is stopped before printing but the child is not. I suppose your last suggestion, i.e. checking for exit status, should be good enough. |
Yeah, but then we have to maintain two copies of breakpoint-unsetting code in perpetuity. Given that we've managed to come this far with this being completely broken, I think we can wait a little longer until the client support is in place.
Btw, have you by any chance looked at how gdb determines which clone()s constitute a new thread?
If I read the code correctly, GDB always assumes PTRACE_EVENT_CLONE is a new thread, and does not support spawning processes via it.
Ah, cute. I guess someone should let them know about this (and maybe also tell kernel folks that differentiating cloned processes is really hard). :)
Ok, I guess I can do that.
What about debug registers on FreeBSD? This system is pretty unique because child processes inherit dbregs from parent, and therefore we need to clear them. GDB doesn't do that and watchpoints crash forked children. Should we keep the clearing part as a hack specific to FreeBSD, or have lldb generic clear all hardware breakpoints and watchpoints on fork? Independently of that, I'm going to file a bug for FreeBSD to see if this is really the intended behavior.
Removed software breakpoint cleanup. Modified tests not to expect child's output at any specific point.
I think we can keep the freebsd-specific clearing code (in the server). Clearing all hardware breakpoints in the client would be pretty logical, but if we wanted to make that work consistently, then we might need to add code to /other/ operating system implementations to /set/ the breakpoints (so that the client can clear them properly)...
lldb/include/lldb/Host/linux/Host.h | ||
---|---|---|
14 | These spaces are pretty common in lldb, but that's not actually consistent with how the rest of llvm formats them... | |
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp | ||
737 | And the netbsd comment might apply to freebsd as well... | |
lldb/source/Plugins/Process/FreeBSD/NativeRegisterContextFreeBSD_arm64.cpp | ||
290–291 ↗ | (On Diff #334296) | |
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
647–665 | It looks like these two (and probably also PTRACE_EVENT_CLONE) could be bundled together. | |
659 | If this fails (I guess that can only happen if the entire process was SIGKILLED), how does continuing with pid=0 help? | |
914 | auto * | |
939 | I think we have some macro or attribute for that these days... | |
944 | Copying the breakpoints is not needed yet, right? We could just group these two (fork/vfork) paths for now... | |
960 | llvm_unreachable("unknown clone_info.event"); | |
963 | erase(find_it) would be slightly more efficient. (And it might be better to put this before the switch, to make sure it executes in case of early exits.) | |
lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.cpp | ||
784 ↗ | (On Diff #334296) | For NetBSD, it might be cleaner to keep waiting for the parent only, and then do a separate waitpid for the child once you get its process id. That would make the sequence of events (and the relevant code) predictable. I think this is how this interface was meant to be used, even on linux, but linux makes it tricky because it's not possible to wait for all threads belonging to a given process in a single call -- one would have to iterate through the threads and wait for each one separately (it might be interesting to try to implement and benchmark that). |
lldb/test/Shell/Subprocess/Inputs/clone.c | ||
23 ↗ | (On Diff #334296) | Better include an alignas directive, just in case. |
lldb/test/Shell/Subprocess/Inputs/vfork.c | ||
2 | Maybe merge these three files into one, parameterized by a define or something? | |
lldb/test/Shell/Subprocess/clone-follow-parent-wp.test | ||
1 ↗ | (On Diff #334296) | Why only these three arches? |
lldb/test/Shell/Subprocess/fork-follow-parent.test | ||
2 | This (and maybe others) should have UNSUPPORTED: system-windows, I guess. |
lldb/test/Shell/Subprocess/clone-follow-parent-wp.test | ||
---|---|---|
1 ↗ | (On Diff #334296) | Right, I was supposed to replace this with something better later. It's supposed to limit the test to arches where we support watchpoints. |
Update the Linux plugin and tests, mostly.
TODO: FreeBSD, NetBSD and watchpoint feature for tests.
lldb/include/lldb/Host/linux/Host.h | ||
---|---|---|
14 | Will remove. However, note that this causes clang-format to reorder the includes. | |
lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.cpp | ||
784 ↗ | (On Diff #334296) | I've been thinking about it and I suppose it makes sense. I was considering what would be better to future-proof it for the future full multiprocess support but I suppose we could just loop over all process classes and let every one wait for its own process rather than creating a dedicated dispatcher that calls waitpid(-1...). |
Simplified FreeBSD/NetBSD to use per-PID waitpid(). Removed arch restrictions from wp tests — made them consistent with Python tests.
This looks pretty good now. Just some small nits inline...
lldb/include/lldb/Host/linux/Host.h | ||
---|---|---|
14 | And it should do that -- that's part of the point :) | |
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp | ||
734–737 | Is the loop still necessary? | |
990–996 | assert(event&PL_FLAG_FORKED) would be shorter, but I am not sure if we need even that, as this is already checked in the caller. You could just drop the entire event argument, if it is unused.. | |
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.h | ||
124 | move this next to the other Monitor functions ? | |
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
651 | Maybe obsolete by now, but I just realized what you actually meant here. There will always be some divergence between the two sequence of events. Letting this be handled from the main loop may make it smaller though. I don't really mind doing it that way, as long as the (v)fork and clone cases do it the same way. | |
lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.cpp | ||
760 ↗ | (On Diff #334957) | same here |
784 ↗ | (On Diff #334296) | I think that the looping would be the cleanest solution for the BSDs. For linux, we will need to call waitpid centrally, and then somehow choose which instance to notify. |
lldb/source/Plugins/Process/NetBSD/NativeProcessNetBSD.h | ||
117 ↗ | (On Diff #334957) | same here |
lldb/test/API/functionalities/gdb_remote_client/TestMultiprocess.py | ||
3 ↗ | (On Diff #334957) | bad merge |
lldb/test/Shell/Subprocess/Inputs/fork.cpp | ||
30–42 ↗ | (On Diff #334957) | How about: #if def TEST_CLONE pid_t pid = clone(child_func, &stack[sizeof(stack)], 0, NULL); #elif defined TEST_FORK pid_t pid = TEST_FORK(); // defined to fork of vfork if (pid == 0) _exit(child_func(NULL)); #endif ? |
lldb/test/Shell/Subprocess/clone-follow-parent.test | ||
2 ↗ | (On Diff #334957) | The UNSUPPORTED:windows is _not_ required if you explicitly list supported operating systems. :) |
Answered where answer was due, will update the rest once I finish retesting the multiprocess patch.
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp | ||
---|---|---|
734–737 | I am not sure but I think there's no harm in keeping it, and it might save us from some hard-to-debug corner conditions in the future. | |
990–996 | The distinction is necessary to know how to deal with software breakpoints, and it will be used in the followup patch (D99864, when I extend it to non-Linux targets). I think removing event at this point to reintroduce it would be a wasted effort. That said, I guess this is yet another case for llvm_unreachable(). Also, I think it's worth keeping assert()s like this as they make sure we don't break stuff accidentally when changing code on the caller's end. |
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp | ||
---|---|---|
734–737 | I don't mind that much, though I would be surprised if this can help with anything. IIUC, the BSD kernel always sends one notification when the process stops, and I believe this code is written in a way which expects that (only linux has special code to delay some actions until every thread registers as stopped). If, for whatever reason, we end up getting two notifications here, I expect things would blow up spectacularly... | |
990–996 | True, but it's even better when the code is written such that the asserts/unreachables are unnecessary. One way to do that might be to do the decoding in the caller: if (info.pl_flags & PL_FLAG_FORKED) { MonitorClone(info.pl_child_pid, /*bool is_vfork*/ fork_flags & PL_FLAG_VFORKED); return; } It's hard to say whether that would work without seeing the real implementation. |
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp | ||
---|---|---|
990–996 | In the long run, we would probably also want a separate event for posix_spawn on NetBSD. I've figured out that reusing native constants avoids reinventing the wheel unnecessarily. |
Ok, I think I've implemented all the requested changes. I'm going to test it on all three platforms now and attach the patch if it miraculously still works ;-).
lldb/source/Plugins/Process/FreeBSD/NativeProcessFreeBSD.cpp | ||
---|---|---|
990–996 | Hmm, though I guess it doesn't matter for FreeBSD right now and since this is entirely platform-specific code, I'll simplify it for now. |
test/Shell/Subprocess/vfork-follow-parent-wp.test is failing on GreenDragon:
I'm going to temporarily disable the test on Darwin. Happy to provide help to investigate :-)
I guess that vfork()-ed process doesn't start with cleared dbregs on Darwin. I suppose you don't handle forks at all, so I suppose skipping/xfailing the test is the thing to do — knowing that there's a potential issue but extremely unlikely in real life.
Forks are handled explicitly on FreeBSD, Linux and NetBSD right now but I've left these tests open to all platforms since they merely verify that LLDB doesn't explode when the debugged program forks, and that it (probably) doesn't try to debug the child accidentally.
I've reverted this (121cff78a8032a73aa4fb820625dc1ecae8e3997) because it made the linux bot super-flaky. I think it's likely that this affects _only_ linux (and the BSD code is fine -- though we don't have a bot to verify that), but I didn't want to experiment with partial reverts while the bots are wonky. However, it may be interesting the re-land the linux support in a separate commit, once we figure out the problem -- my money is on the nondeterminism of thread creation.
I don't really understand what's going on, but in the one case I was able to catch this "in the act" the inferior process had one thread which was stopped one byte after the breakpoint location (there's no valid instruction there). This leads me to believe that the FixupBreakpointPCAsNeeded logic did not fire for some reason.
How about recommitting without the clone() change, i.e. assuming that clone always means thread, and seeing if that causes trouble still?
I think I have it figured out -- see the inlined comment. It seems this was one of those (rare) cases where enabling logging actually makes a race _more_ likely...
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
---|---|---|
930–931 | This is resuming the thread a second time. If the thread manages to stop between the resume inside AddThread, and here, we will miss the breakpoint (and subsequently crash). Deleting these two lines should fix things. |
lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp | ||
---|---|---|
930–931 | Hmm, I wonder why it ended up being added here. Probably my mistake, sorry, and thanks for debugging it. I'm going to test and reland it shortly! |
Long term, these functions are probably unnecessary. It just occurred to me to check how GDB handles it, and it handles clearing and restoring breakpoints on client side (i.e. by sending z and Z packets).