This is an archive of the discontinued LLVM Phabricator instance.

[lldb/Test] Use a process group for subprocesses
AbandonedPublic

Authored by JDevlieghere on Jul 14 2020, 2:34 PM.

Details

Reviewers
labath
teemperor
Summary

Use a process group for processes spawned by the test suite so we can send a signal to the whole group. This ensures any child processes are terminated as well.

Diff Detail

Event Timeline

JDevlieghere created this revision.Jul 14 2020, 2:34 PM

We're still seeing a lot of zombies from the test suite on the bots. I'm hoping this might get rid of one possible source.

hmm... I have a lot thoughts here..

  • setsid is overkill. If you want to create process group, create a process group (setpgid), not a session.
  • this solution does not seem very windows-friendly. In fact, I'd be surprised if it works there at all.
  • going for os.kill will also not work with remote test suite runs. In fact, the way it is implemented now means we could start killing random processes on the host.
  • are these really zombies (dead but not waited-for processes) or just live processes stuck in infinite loops and similar? Because if they are zombies (mmm... brainz), then I doubt sending signals to them will help -- they're already dead. The problem is that someone is not reaping them. I'm not sure what can be done about that, as normally init should take care of this...
  • even if they are live processes, I am doubtful that this will make a difference. I don't think many of the processes we spawn create additional subprocesses. In fact, the only one I know of is TestChangeProcessGroup.py, which also creates a new process group... so this test would need special handling anyway.
  • creating lots of process groups can actually cause us to leak _more_ processes in case the test run is interrupted with ^C or similar (SIGINT goes to the foreground process group)

Overall, I think we should try to understand the problem a bit better before we start shotgun debugging this. Which tests are creating these zombies? What's special about them? Do they always do that or only occasionally? etc.

PS: When we were running a mac bot (it seems only macs are vulnerable to zombie infestations), we configured it to reboot itself each midnight. Not the most elegant solution, but if the problem really is about init refusing to reap zombies, then it may be the only thing that we can do...

lldb/packages/Python/lldbsuite/test/lldbtest.py
894

Right this is pretty much equivalent to what we've had before. I guess you wanted to say os.killpg(os.getpgid(pid), SIGTERM). os.kill(-os.getpgid(pid)) might also work, but the first one seems more explicit.

JDevlieghere abandoned this revision.Jul 15 2020, 8:34 AM

hmm... I have a lot thoughts here..

  • setsid is overkill. If you want to create process group, create a process group (setpgid), not a session.
  • this solution does not seem very windows-friendly. In fact, I'd be surprised if it works there at all.
  • going for os.kill will also not work with remote test suite runs. In fact, the way it is implemented now means we could start killing random processes on the host.
  • are these really zombies (dead but not waited-for processes) or just live processes stuck in infinite loops and similar? Because if they are zombies (mmm... brainz), then I doubt sending signals to them will help -- they're already dead. The problem is that someone is not reaping them. I'm not sure what can be done about that, as normally init should take care of this...
  • even if they are live processes, I am doubtful that this will make a difference. I don't think many of the processes we spawn create additional subprocesses. In fact, the only one I know of is TestChangeProcessGroup.py, which also creates a new process group... so this test would need special handling anyway.
  • creating lots of process groups can actually cause us to leak _more_ processes in case the test run is interrupted with ^C or similar (SIGINT goes to the foreground process group)

You make some compelling arguments :-) Given that we were already creating a process group on fork, I'll keep the change on line 902 that kills the process group instead of just the pid.

Overall, I think we should try to understand the problem a bit better before we start shotgun debugging this. Which tests are creating these zombies? What's special about them? Do they always do that or only occasionally? etc.

I and many people on the team have already tried that and so far we've come up with nothing. We don't know which tests are leaking the zombies. It doesn't happen when you just run the test suite over and over again. But by the end of the week you suddenly have a few hundred... I have a machine that I only SSH in and rarely reboot. It's not uncommon for it to have more than a 1000 zombies.

PS: When we were running a mac bot (it seems only macs are vulnerable to zombie infestations), we configured it to reboot itself each midnight. Not the most elegant solution, but if the problem really is about init refusing to reap zombies, then it may be the only thing that we can do...

GreenDragon has a "zombie detector" which periodically checks the number and reboots when it exceeds a certain value.

Given that we were already creating a process group on fork, I'll keep the change on line 902 that kills the process group instead of just the pid.

That seems ok -- all of those arguments (except maybe the last one) don't really apply there. Though I am tempted to remove the setpgid call from the forking method too (or even deleting the entire forking method). :P

Given that we were already creating a process group on fork, I'll keep the change on line 902 that kills the process group instead of just the pid.

That seems ok -- all of those arguments (except maybe the last one) don't really apply there. Though I am tempted to remove the setpgid call from the forking method too (or even deleting the entire forking method). :P

You got me at "delete the entire [thing]" ;-)