This is an archive of the discontinued LLVM Phabricator instance.

[lit] Fix the `--max-time` flag feature which was completely broken.
AcceptedPublic

Authored by delcypher on May 22 2018, 11:01 AM.

Details

Summary

[lit] Fix the --max-time flag feature which was completely broken.

For both the multiprocess mode and in-process sequential mode using
--max-time didn't work. For the former calling a.successful(),
when the result was not ready would lead an exception being raised.
For the latter max_time was being completely ignored.

Two test cases have been added which test this basic functionality.

This implementation is for from ideal

  • For the multiprocess mode, tests that started executing but didn't finish are marked as UNRESOLVED. In reality they should really be marked as TIMEOUT.
  • For the single process mode lit's per test timeout feature has been used as a hacky way to make sure --max-time is properly enforced. This feature currently requires the python psutil module to be installed, so --max-time and --single-process used together now requires the python psutil module to be installed.

Diff Detail

Event Timeline

delcypher created this revision.May 22 2018, 11:01 AM
rnk added inline comments.May 22 2018, 11:25 AM
utils/lit/lit/run.py
127

If you are running the tests in process, there is no need to use this global child_lit_config. It is only needed so that mulitprocessing can pickle the data and set it up in the child process. We should be able to update self.lit_confi.maxIndividualTest time directly, and it will take effect.

delcypher added inline comments.May 23 2018, 1:07 AM
utils/lit/lit/run.py
127

I don't think use of child_lit_config matters here. worker_initializer() is not called on this path so AFAICT there's no copying going on. In execute_tests_sequentially_in_process() I have to modify the child_lit_config global because worker_run_one_test() (which we call directly) expects it to be set.

In execute_tests_sequentially_in_process() I don't have to use child_lit_config so for clarity I will change this and add a comment about`child_lit_config`.

delcypher updated this revision to Diff 148171.May 23 2018, 1:24 AM

Avoid use of child_lit_config where possible and add a comment explaining why we assign to the child_lit_config global.

rnk accepted this revision.May 29 2018, 1:51 PM

lgtm

This revision is now accepted and ready to land.May 29 2018, 1:51 PM

@rnk Hmm this isn't quite ready to land. I've just noticed that the shtest-max-global-time-multiprocess.py leaves the 1_infinite_loop.py process (and the shell - bash in my case - that spawned it) lying around. This will cause problems on the bots so we can't commit this.

My guess is calling pool.terminate() followed by pool.wait() doesn't actually do proper clean up. This problem doesn't seem very easy to fix. When a user presses CTRL+C this calls abort_now(). On POSIX systems this sends SIGKILL to process ID 0 which causes all processes in the same group to be killed.
We can't use that strategy to fix this problem. Even if we do everything we want lit to do (e.g. write out an XML report) and then call abort_now() by using atexit.register(abort_now) this doesn't work. This breaks the testing because the lit process testing a running copy of lit are in the same process group.
Calling abort_now() in the inner running copy of lit also kills the outer running copy. This leads to the running the lit test suite failing.

One approach to fixing this could be to have lit run each test in its own process group. Then when we need to kill things we kill each group. This approach means we can kill lit's child processes without killing the main lit process itself. It also potentially means we could fix
the per-test-timeout feature to kill tests this way and thus avoiding our dependency on the psutil module. This approach has some downsides though.

  • Won't work on Windows. We'd need to come up with something else.
  • SIGKILL'ing the process group of the main lit process will no longer kill all child processes. I'm not sure if this a problem or not.

Another approach is to depend on the pustil module again and have it walk through all child processes of lit and kill them. This is the simplest short term solution. This probably the right thing to do because --max-time doesn't seem that widely used. Otherwise someone would have already tried to fix --max-time before me.

What do you think?

rnk added a reviewer: thakis.May 30 2018, 11:36 AM
rnk added a comment.May 30 2018, 11:40 AM

+@thakis, who has been prototyping some lit optimizations, I think.

Another approach is to depend on the pustil module again and have it walk through all child processes of lit and kill them. This is the simplest short term solution. This probably the right thing to do because --max-time doesn't seem that widely used. Otherwise someone would have already tried to fix --max-time before me.

That sounds like a good way to go.

The multiprocessing shutdown behavior leads a lot to be desired, and it's basically impossible to write reliable tests for it. I regularly have problems interrupting lit test execution here on Windows. It worked better when I used mintty, but I switched back to cmd after updating to Windows 10. I think improving all that behavior is a separate issue that's out of scope for this, and probably a higher priority.

In D47210#1116629, @rnk wrote:

+@thakis, who has been prototyping some lit optimizations, I think.

Another approach is to depend on the pustil module again and have it walk through all child processes of lit and kill them. This is the simplest short term solution. This probably the right thing to do because --max-time doesn't seem that widely used. Otherwise someone would have already tried to fix --max-time before me.

That sounds like a good way to go.

The multiprocessing shutdown behavior leads a lot to be desired, and it's basically impossible to write reliable tests for it. I regularly have problems interrupting lit test execution here on Windows. It worked better when I used mintty, but I switched back to cmd after updating to Windows 10. I think improving all that behavior is a separate issue that's out of scope for this, and probably a higher priority.

Okay. Is there a portable and reliable way to catch pool.terminate() being called from the parent process in the worker processes? Performing process traversal and kill in the parent process is likely to be racey so it's better if we can do it in the worker processes.

If we did the process and kill work in the parent process then it might be very unreliable because the only place we can do it is just before calling pool.terminate(). We can't do it after because the child processes have already lost their parent and so won't be found by traversal of the process tree.
If do it before calling pool.terminate() there's a risk the pool might (depending on how it's implemented) respawn worker processes if it detects they died and start redoing work.

rnk added a comment.May 31 2018, 1:29 PM

Okay. Is there a portable and reliable way to catch pool.terminate() being called from the parent process in the worker processes? Performing process traversal and kill in the parent process is likely to be racey so it's better if we can do it in the worker processes.

I don't think so. I think terminate is intended to be unclean, i.e. you might get SIGKILL like behavior on some platforms.

If we did the process and kill work in the parent process then it might be very unreliable because the only place we can do it is just before calling pool.terminate(). We can't do it after because the child processes have already lost their parent and so won't be found by traversal of the process tree.
If do it before calling pool.terminate() there's a risk the pool might (depending on how it's implemented) respawn worker processes if it detects they died and start redoing work.

It might not be as racy as you think. pids are not available for reuse until after the parent process has waited on them. This is how you get "zombie" processes in Unix, I think. So, taking the interrupt in the parent, iterating, killing each in turn, then waiting, shouldn't be racy: either the child worker will finish before it receives the kill signal, or it will take the kill signal. Then we wait on it.

You will need a way to cancel the creation of new workers. Does the .close() method do that?

In D47210#1118125, @rnk wrote:

Okay. Is there a portable and reliable way to catch pool.terminate() being called from the parent process in the worker processes? Performing process traversal and kill in the parent process is likely to be racey so it's better if we can do it in the worker processes.

I don't think so. I think terminate is intended to be unclean, i.e. you might get SIGKILL like behavior on some platforms.

Looking at the implementation though it looks like multiprocessing.Pool.terminate() calls .terminate() on multiprocessing.Process which looks like it sends SIGTERM to worker processes.
So it looks like I could handle killing child processes in the worker processes, at least on POSIX systems anyway.

If we did the process and kill work in the parent process then it might be very unreliable because the only place we can do it is just before calling pool.terminate(). We can't do it after because the child processes have already lost their parent and so won't be found by traversal of the process tree.
If do it before calling pool.terminate() there's a risk the pool might (depending on how it's implemented) respawn worker processes if it detects they died and start redoing work.

It might not be as racy as you think. pids are not available for reuse until after the parent process has waited on them. This is how you get "zombie" processes in Unix, I think. So, taking the interrupt in the parent, iterating, killing each in turn, then waiting, shouldn't be racy: either the child worker will finish before it receives the kill signal, or it will take the kill signal. Then we wait on it.

You will need a way to cancel the creation of new workers. Does the .close() method do that?

No it doesn't. All this does is prevent new tasks being added to the pool.

This was the problem (creation of new workers) I was concerned with. The implementation of multiprocessing.Pool (at least for Python 2.7 on macOS) spawns a thread to periodically keep the pool alive by respawning missing worker processes.
I can certainly do some hacks that abuses knowledge of the internal workings of multiprocessing.Pool to make sure the pool doesn't respawn when I kill processes but this would be very fragile.

rnk added a comment.Jun 14 2018, 2:01 PM

Sorry, I haven't had time to read up and respond to this. It's very likely that I will go on vacation shortly, and then not have time to come back to this until August.

delcypher marked an inline comment as done.Jan 18 2019, 10:31 AM
delcypher added inline comments.
utils/lit/lit/run.py
159

Note to self. We need test.config.parallelism_group = None here when this patch is applied upstream.

Is there any plan to fix this patch and land it? We use --max-time in our tests and intermittently run into the Python assertion. We can work around the problem by killing lit.py with another timer, but then we lose the report that lit.py produces at the end of the run.

Is there any plan to fix this patch and land it? We use --max-time in our tests and intermittently run into the Python assertion. We can work around the problem by killing lit.py with another timer, but then we lose the report that lit.py produces at the end of the run.

@bryanpkc

No. I don't have time to work on it. There is a fundamental problem with this patch in that long running tasks do not killed properly (see https://reviews.llvm.org/D47210#1115648 ).
Until that problem is solved we cannot land this patch.