This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
utils/lit/
-
lit/
-
lit/
1/3
run.py
-
tests/
-
Inputs/shtest-max-global-time/
-
shtest-max-global-time/
-
0_short.py
-
1_infinite_loop.py
-
lit.cfg
-
shtest-max-global-time-multiprocess.py
-
shtest-max-global-time-single-process.py

Differential D47210

[lit] Fix the `--max-time` flag feature which was completely broken.
AcceptedPublic

Authored by delcypher on May 22 2018, 11:01 AM.

Download Raw Diff

Details

Reviewers

rnk
• ddunbar
kubamracek
zturner
thakis
serge-sans-paille

Summary

[lit] Fix the --max-time flag feature which was completely broken.

For both the multiprocess mode and in-process sequential mode using
--max-time didn't work. For the former calling a.successful(),
when the result was not ready would lead an exception being raised.
For the latter max_time was being completely ignored.

Two test cases have been added which test this basic functionality.

This implementation is for from ideal

For the multiprocess mode, tests that started executing but didn't finish are marked as UNRESOLVED. In reality they should really be marked as TIMEOUT.
For the single process mode lit's per test timeout feature has been used as a hacky way to make sure --max-time is properly enforced. This feature currently requires the python psutil module to be installed, so --max-time and --single-process used together now requires the python psutil module to be installed.

Diff Detail

Event Timeline

delcypher created this revision.May 22 2018, 11:01 AM

rnk added inline comments.May 22 2018, 11:25 AM

utils/lit/lit/run.py
127	If you are running the tests in process, there is no need to use this global child_lit_config. It is only needed so that mulitprocessing can pickle the data and set it up in the child process. We should be able to update self.lit_confi.maxIndividualTest time directly, and it will take effect.

delcypher added inline comments.May 23 2018, 1:07 AM

utils/lit/lit/run.py
127	I don't think use of `child_lit_config` matters here. `worker_initializer()` is not called on this path so AFAICT there's no copying going on. In `execute_tests_sequentially_in_process()` I have to modify the `child_lit_config` global because `worker_run_one_test()` (which we call directly) expects it to be set. In `execute_tests_sequentially_in_process()` I don't have to use `child_lit_config` so for clarity I will change this and add a comment about`child_lit_config`.

Avoid use of child_lit_config where possible and add a comment explaining why we assign to the child_lit_config global.

Ping.

lgtm

This revision is now accepted and ready to land.May 29 2018, 1:51 PM

@rnk Hmm this isn't quite ready to land. I've just noticed that the shtest-max-global-time-multiprocess.py leaves the 1_infinite_loop.py process (and the shell - bash in my case - that spawned it) lying around. This will cause problems on the bots so we can't commit this.

My guess is calling pool.terminate() followed by pool.wait() doesn't actually do proper clean up. This problem doesn't seem very easy to fix. When a user presses CTRL+C this calls abort_now(). On POSIX systems this sends SIGKILL to process ID 0 which causes all processes in the same group to be killed.
We can't use that strategy to fix this problem. Even if we do everything we want lit to do (e.g. write out an XML report) and then call abort_now() by using atexit.register(abort_now) this doesn't work. This breaks the testing because the lit process testing a running copy of lit are in the same process group.
Calling abort_now() in the inner running copy of lit also kills the outer running copy. This leads to the running the lit test suite failing.

One approach to fixing this could be to have lit run each test in its own process group. Then when we need to kill things we kill each group. This approach means we can kill lit's child processes without killing the main lit process itself. It also potentially means we could fix
the per-test-timeout feature to kill tests this way and thus avoiding our dependency on the psutil module. This approach has some downsides though.

Won't work on Windows. We'd need to come up with something else.
SIGKILL'ing the process group of the main lit process will no longer kill all child processes. I'm not sure if this a problem or not.

Another approach is to depend on the pustil module again and have it walk through all child processes of lit and kill them. This is the simplest short term solution. This probably the right thing to do because --max-time doesn't seem that widely used. Otherwise someone would have already tried to fix --max-time before me.

What do you think?

rnk added a reviewer: thakis.May 30 2018, 11:36 AM

+@thakis, who has been prototyping some lit optimizations, I think.

Another approach is to depend on the pustil module again and have it walk through all child processes of lit and kill them. This is the simplest short term solution. This probably the right thing to do because --max-time doesn't seem that widely used. Otherwise someone would have already tried to fix --max-time before me.

That sounds like a good way to go.

The multiprocessing shutdown behavior leads a lot to be desired, and it's basically impossible to write reliable tests for it. I regularly have problems interrupting lit test execution here on Windows. It worked better when I used mintty, but I switched back to cmd after updating to Windows 10. I think improving all that behavior is a separate issue that's out of scope for this, and probably a higher priority.

In D47210#1116629, @rnk wrote:

+@thakis, who has been prototyping some lit optimizations, I think.

Another approach is to depend on the pustil module again and have it walk through all child processes of lit and kill them. This is the simplest short term solution. This probably the right thing to do because --max-time doesn't seem that widely used. Otherwise someone would have already tried to fix --max-time before me.

That sounds like a good way to go.

The multiprocessing shutdown behavior leads a lot to be desired, and it's basically impossible to write reliable tests for it. I regularly have problems interrupting lit test execution here on Windows. It worked better when I used mintty, but I switched back to cmd after updating to Windows 10. I think improving all that behavior is a separate issue that's out of scope for this, and probably a higher priority.

Okay. Is there a portable and reliable way to catch pool.terminate() being called from the parent process in the worker processes? Performing process traversal and kill in the parent process is likely to be racey so it's better if we can do it in the worker processes.

If we did the process and kill work in the parent process then it might be very unreliable because the only place we can do it is just before calling pool.terminate(). We can't do it after because the child processes have already lost their parent and so won't be found by traversal of the process tree.
If do it before calling pool.terminate() there's a risk the pool might (depending on how it's implemented) respawn worker processes if it detects they died and start redoing work.

In D47210#1117993, @delcypher wrote:

Okay. Is there a portable and reliable way to catch pool.terminate() being called from the parent process in the worker processes? Performing process traversal and kill in the parent process is likely to be racey so it's better if we can do it in the worker processes.

I don't think so. I think terminate is intended to be unclean, i.e. you might get SIGKILL like behavior on some platforms.

If we did the process and kill work in the parent process then it might be very unreliable because the only place we can do it is just before calling pool.terminate(). We can't do it after because the child processes have already lost their parent and so won't be found by traversal of the process tree.
If do it before calling pool.terminate() there's a risk the pool might (depending on how it's implemented) respawn worker processes if it detects they died and start redoing work.

It might not be as racy as you think. pids are not available for reuse until after the parent process has waited on them. This is how you get "zombie" processes in Unix, I think. So, taking the interrupt in the parent, iterating, killing each in turn, then waiting, shouldn't be racy: either the child worker will finish before it receives the kill signal, or it will take the kill signal. Then we wait on it.

You will need a way to cancel the creation of new workers. Does the .close() method do that?

In D47210#1118125, @rnk wrote:

In D47210#1117993, @delcypher wrote:

Okay. Is there a portable and reliable way to catch pool.terminate() being called from the parent process in the worker processes? Performing process traversal and kill in the parent process is likely to be racey so it's better if we can do it in the worker processes.

I don't think so. I think terminate is intended to be unclean, i.e. you might get SIGKILL like behavior on some platforms.

Looking at the implementation though it looks like multiprocessing.Pool.terminate() calls .terminate() on multiprocessing.Process which looks like it sends SIGTERM to worker processes.
So it looks like I could handle killing child processes in the worker processes, at least on POSIX systems anyway.

If we did the process and kill work in the parent process then it might be very unreliable because the only place we can do it is just before calling pool.terminate(). We can't do it after because the child processes have already lost their parent and so won't be found by traversal of the process tree.
If do it before calling pool.terminate() there's a risk the pool might (depending on how it's implemented) respawn worker processes if it detects they died and start redoing work.

It might not be as racy as you think. pids are not available for reuse until after the parent process has waited on them. This is how you get "zombie" processes in Unix, I think. So, taking the interrupt in the parent, iterating, killing each in turn, then waiting, shouldn't be racy: either the child worker will finish before it receives the kill signal, or it will take the kill signal. Then we wait on it.

You will need a way to cancel the creation of new workers. Does the .close() method do that?

No it doesn't. All this does is prevent new tasks being added to the pool.

This was the problem (creation of new workers) I was concerned with. The implementation of multiprocessing.Pool (at least for Python 2.7 on macOS) spawns a thread to periodically keep the pool alive by respawning missing worker processes.
I can certainly do some hacks that abuses knowledge of the internal workings of multiprocessing.Pool to make sure the pool doesn't respawn when I kill processes but this would be very fragile.

Sorry, I haven't had time to read up and respond to this. It's very likely that I will go on vacation shortly, and then not have time to come back to this until August.

delcypher marked an inline comment as done.Jan 18 2019, 10:31 AM

delcypher added inline comments.

utils/lit/lit/run.py
159	Note to self. We need `test.config.parallelism_group = None` here when this patch is applied upstream.

Herald added a reviewer: serge-sans-paille. · View Herald TranscriptJan 18 2019, 10:31 AM

Is there any plan to fix this patch and land it? We use --max-time in our tests and intermittently run into the Python assertion. We can work around the problem by killing lit.py with another timer, but then we lose the report that lit.py produces at the end of the run.

In D47210#1385463, @bryanpkc wrote:

Is there any plan to fix this patch and land it? We use --max-time in our tests and intermittently run into the Python assertion. We can work around the problem by killing lit.py with another timer, but then we lose the report that lit.py produces at the end of the run.

@bryanpkc

No. I don't have time to work on it. There is a fundamental problem with this patch in that long running tasks do not killed properly (see https://reviews.llvm.org/D47210#1115648 ).
Until that problem is solved we cannot land this patch.

Revision Contents

Path

Size

utils/

lit/

run.py

64 lines

tests/

Inputs/

shtest-max-global-time/

0_short.py

4 lines

1_infinite_loop.py

9 lines

lit.cfg

17 lines

shtest-max-global-time-multiprocess.py

17 lines

shtest-max-global-time-single-process.py

22 lines

Diff 148171

utils/lit/lit/run.py

import os		import os
		import math
import sys		import sys
import threading		import threading
import time		import time
import traceback		import traceback
try:		try:
import Queue as queue		import Queue as queue
except ImportError:		except ImportError:
import queue		import queue
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	def execute_tests_in_pool(self, jobs, max_time):
for test_index, test in enumerate(self.tests)]		for test_index, test in enumerate(self.tests)]
pool.close()		pool.close()

# Wait for all results to come in. The callback that runs in the		# Wait for all results to come in. The callback that runs in the
# parent process will update the display.		# parent process will update the display.
for a in async_results:		for a in async_results:
if deadline:		if deadline:
a.wait(deadline - time.time())		a.wait(deadline - time.time())
		if not a.ready():
		# We reached the deadline
		assert time.time() >= deadline
		self.notify_max_total_time_reached()
		# FIXME: We should find a way to find the currently
		# running tests and mark them as TIMEOUT. Right now
		# because they don't get proceessed they get marked as
		# UNRESOLVED which is wrong.
		pool.terminate()
		pool.join()
		break
else:		else:
# Python condition variables cannot be interrupted unless		# Python condition variables cannot be interrupted unless
# they have a timeout. This can make lit unresponsive to		# they have a timeout. This can make lit unresponsive to
# KeyboardInterrupt, so do a busy wait with a timeout.		# KeyboardInterrupt, so do a busy wait with a timeout.
while not a.ready():		while not a.ready():
a.wait(1)		a.wait(1)
if not a.successful():		if not a.successful():
a.get() # Exceptions raised here come from the worker.		a.get() # Exceptions raised here come from the worker.
if self.hit_max_failures:		if self.hit_max_failures:
break		break
except:		except:
# Stop the workers and wait for any straggling results to come in		# Stop the workers and wait for any straggling results to come in
# if we exited without waiting on every async result.		# if we exited without waiting on every async result.
pool.terminate()		pool.terminate()
raise		raise
finally:		finally:
pool.join()		pool.join()

		def execute_tests_sequentially_in_process(self, max_time):
		deadline = None
		rnkUnsubmitted Not Done Reply Inline Actions If you are running the tests in process, there is no need to use this global child_lit_config. It is only needed so that mulitprocessing can pickle the data and set it up in the child process. We should be able to update self.lit_confi.maxIndividualTest time directly, and it will take effect. rnk: If you are running the tests in process, there is no need to use this global child_lit_config.
		delcypherAuthorUnsubmitted Not Done Reply Inline Actions I don't think use of `child_lit_config` matters here. `worker_initializer()` is not called on this path so AFAICT there's no copying going on. In `execute_tests_sequentially_in_process()` I have to modify the `child_lit_config` global because `worker_run_one_test()` (which we call directly) expects it to be set. In `execute_tests_sequentially_in_process()` I don't have to use `child_lit_config` so for clarity I will change this and add a comment about`child_lit_config`. delcypher: I don't think use of `child_lit_config` matters here. `worker_initializer()` is not called on…
		if max_time:
		deadline = time.time() + max_time

		# `worker_run_one_test()` currently requires the `child_lit_config`
		# global to be set up.
		global child_lit_config
		child_lit_config = self.lit_config

		# Execute tests sequentially.
		for test_index, test in enumerate(self.tests):
		if max_time is not None:
		# FIXME: This is a hack. We set the per test timeout to be the
		# minimum of the remaining time to the deadline and the current
		# timeout value. This is only safe to do because we are
		# running tests sequentially, and so modifying the global
		# lit_config object is fine.
		#
		# Using `maxIndividualTestTime` means we inherit its dependencies.
		# This means that we need Python's `psutil` in order for this to work
		# right now.
		remaining_time = int(math.ceil(deadline - time.time()))
		if remaining_time <= 0:
		self.notify_max_total_time_reached()
		break
		if self.lit_config.maxIndividualTestTime > 0:
		self.lit_config.maxIndividualTestTime = min(
		self.lit_config.maxIndividualTestTime,
		remaining_time
		)
		else:
		self.lit_config.maxIndividualTestTime = remaining_time
		result = worker_run_one_test(test_index, test)
		delcypherAuthorUnsubmitted Done Reply Inline Actions Note to self. We need `test.config.parallelism_group = None` here when this patch is applied upstream. delcypher: Note to self. We need `test.config.parallelism_group = None` here when this patch is applied…
		self.consume_test_result(result)
		if max_time is not None:
		if time.time() >= deadline:
		self.notify_max_total_time_reached()
		break

		def notify_max_total_time_reached(self):
		self.lit_config.warning(
		'Used up max time. '
		'Not running any more tests'
		)

def execute_tests(self, display, jobs, max_time=None):		def execute_tests(self, display, jobs, max_time=None):
"""		"""
execute_tests(display, jobs, [max_time])		execute_tests(display, jobs, [max_time])

Execute each of the tests in the run, using up to jobs number of		Execute each of the tests in the run, using up to jobs number of
parallel tasks, and inform the display of each individual result. The		parallel tasks, and inform the display of each individual result. The
provided tests should be a subset of the tests available in this run		provided tests should be a subset of the tests available in this run
object.		object.
Show All 16 Lines	def execute_tests(self, display, jobs, max_time=None):

# Save the display object on the runner so that we can update it from		# Save the display object on the runner so that we can update it from
# our task completion callback.		# our task completion callback.
self.display = display		self.display = display

self.failure_count = 0		self.failure_count = 0
self.hit_max_failures = False		self.hit_max_failures = False
if self.lit_config.singleProcess:		if self.lit_config.singleProcess:
global child_lit_config		self.execute_tests_sequentially_in_process(max_time)
child_lit_config = self.lit_config
for test_index, test in enumerate(self.tests):
result = worker_run_one_test(test_index, test)
self.consume_test_result(result)
else:		else:
self.execute_tests_in_pool(jobs, max_time)		self.execute_tests_in_pool(jobs, max_time)

# Mark any tests that weren't run as UNRESOLVED.		# Mark any tests that weren't run as UNRESOLVED.
for test in self.tests:		for test in self.tests:
if test.result is None:		if test.result is None:
test.setResult(lit.Test.Result(lit.Test.UNRESOLVED, '', 0.0))		test.setResult(lit.Test.Result(lit.Test.UNRESOLVED, '', 0.0))

▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

utils/lit/tests/Inputs/shtest-max-global-time/0_short.py

This file was added.

				# RUN: %{python} %s
				from __future__ import print_function

				print("short program")

utils/lit/tests/Inputs/shtest-max-global-time/1_infinite_loop.py

This file was added.

				# RUN: %{python} %s
				from __future__ import print_function

				import sys

				print("Running infinite loop")
				sys.stdout.flush() # Make sure the print gets flushed so it appears in lit output.
				while True:
				pass

utils/lit/tests/Inputs/shtest-max-global-time/lit.cfg

This file was added.

				# -- Python --
				import os
				import sys

				import lit.formats

				config.name = 'max_global_time'

				config.test_format = lit.formats.ShTest(execute_external=True)
				config.suffixes = ['.py']

				config.test_source_root = os.path.dirname(__file__)
				config.test_exec_root = config.test_source_root
				config.target_triple = '(unused)'
				src_root = os.path.join(config.test_source_root, '..')
				config.environment['PYTHONPATH'] = src_root
				config.substitutions.append(('%{python}', sys.executable))

utils/lit/tests/shtest-max-global-time-multiprocess.py

This file was added.

				# RUN: not %{lit} \
				# RUN: %{inputs}/shtest-max-global-time/0_short.py \
				# RUN: %{inputs}/shtest-max-global-time/1_infinite_loop.py \
				# RUN: -j1 -v --max-time 5 > %t.out 2> %t.err
				# RUN: FileCheck --check-prefix=CHECK-OUT < %t.out %s
				# RUN: FileCheck --check-prefix=CHECK-ERR < %t.err %s

				# CHECK-OUT: PASS: max_global_time :: 0_short.py

				# FIXME: This test should really be a TIMEOUT because we started running
				# it and had to kill it.
				# CHECK-OUT: Unresolved Tests (1):
				# CHECK-OUT-NEXT: max_global_time :: 1_infinite_loop.py

				# CHECK-OUT: Expected Passes : 1
				# CHECK-OUT: Unresolved Tests : 1
				# CHECK-ERR: warning: Used up max time. Not running any more tests

utils/lit/tests/shtest-max-global-time-single-process.py

This file was added.

				# FIXME: The current implementation of `--max-time` when using
				# a single process uses the per test timeout feature which
				# requires the psutil python module.

				# REQUIRES: python-psutil
				# PR33944
				# XFAIL: windows

				# RUN: not %{lit} \
				# RUN: %{inputs}/shtest-max-global-time/0_short.py \
				# RUN: %{inputs}/shtest-max-global-time/1_infinite_loop.py \
				# RUN: -j1 -v --max-time 5 --single-process > %t.out 2> %t.err
				# RUN: FileCheck --check-prefix=CHECK-OUT < %t.out %s
				# RUN: FileCheck --check-prefix=CHECK-ERR < %t.err %s

				# CHECK-OUT: PASS: max_global_time :: 0_short.py

				# CHECK-OUT: TIMEOUT: max_global_time :: 1_infinite_loop.py

				# CHECK-OUT: Expected Passes : 1
				# CHECK-OUT: Individual Timeouts: 1
				# CHECK-ERR: warning: Used up max time. Not running any more tests