This is an archive of the discontinued LLVM Phabricator instance.

Add per-test timeouts to lit
AbandonedPublic

Authored by jroelofs on Dec 9 2014, 1:58 PM.

Details

Summary

This is different than the option '--max-time', which limits total testing time.

Diff Detail

Event Timeline

jroelofs updated this revision to Diff 17095.Dec 9 2014, 1:58 PM
jroelofs retitled this revision from to Add per-test timeouts to lit.
jroelofs updated this object.
jroelofs edited the test plan for this revision. (Show Details)
jroelofs added reviewers: ddunbar, EricWF.
jroelofs added a subscriber: Unknown Object (MLST).
jroelofs added inline comments.
lit/Test.py
222 ↗(On Diff #17095)

Not sure this is the right place for this. Is there somewhere better it belongs?

jroelofs added inline comments.Dec 9 2014, 4:01 PM
lit/util.py
170 ↗(On Diff #17095)

Something here apparently doesn't work. Foolishly, I had only tested it with timeout=0. Currently debugging.

ddunbar edited edge metadata.Dec 9 2014, 4:15 PM

I would love to have this feature, so +1 on that.

I would prefer that it not introduce a new failure code though, and just
use FAIL with some extra message output.

It might be worth considering whether or not we should run the subprocesses
in their own process group if this is enabled, so that we can kill the
entire process group. I'm not sure if there is any downside to that, and I
would like to guarantee that if we have timeouts we never accidentally
leave a stray runaway subprocess, which might otherwise bog down a CI setup.

It looks like you are doing some more work on this so I will hold further
review until then.

  • Daniel
jroelofs updated this revision to Diff 17104.Dec 9 2014, 4:36 PM
jroelofs edited edge metadata.

Now with a version that actually works....

jroelofs updated this revision to Diff 17106.Dec 9 2014, 4:52 PM

Here's a version of it with the TIMEOUT result stuff removed.

I'm not sure how to grab the output that happens before a timeout is reached.

Good point about the subprocess group thing. I'm not sure how to do that portably though... looks like it's only supported on unixes: https://docs.python.org/2/library/os.html

jroelofs added inline comments.Dec 9 2014, 5:10 PM
utils/lit/lit/run.py
179 ↗(On Diff #17106)

More specifically, I mean, how do I grab the rest of the output that happened before the timeout here?

In addition to the inline comments, could you include some tests that exercise this code?

utils/lit/lit/Test.py
221 ↗(On Diff #17106)

I'd rather just change executeCommand to not need this than have it added here.

utils/lit/lit/main.py
212

Can we just spell this "--timeout", it's pretty obvious. If you were worried about the conflict with --max-time I'd rather rename that one to --max-testing-time or something similar, as its not a commonly used option (and I wonder if anyone other than me ever uses it). :)

utils/lit/lit/util.py
168

Can this code be factored out into a helper class? Then you wouldn't need to use the hack of wrapping timed_out in a list just so you can modify it.

170

Technically, you shouldn't need to do this, and it is probably a good safety check *not to*.

If you kill the process, then the p.wait() below should return the status showing that the process was terminated (and if it hangs, there is a problem).

jroelofs updated this revision to Diff 17108.Dec 9 2014, 5:57 PM

Some refactorings as @ddunbar suggested.

jroelofs added inline comments.Dec 9 2014, 5:59 PM
utils/lit/lit/main.py
212

sounds good

utils/lit/lit/util.py
156

thanks for the suggestion... I like this a lot more than what I had before.

Looks good, some nits inline. I would like to see a test or two before this goes in, though.

utils/lit/lit/util.py
11

This import isn't needed anymore.

156

Can this be lifted to the top-level of this file just to avoid the break in the code?

160

I would prefer if this class assumed timeout was non-None and the caller assumed the responsibility of handling the optionality of timeout (and just didn't create an instance)...

With this change the watchdog implementation gets a tad simpler.

161

This might be more readable as a regular method on the class instead of a nested function.

166

Sic.

171

This is currently broken, this should be self.timer, but with the change above there should only be a single assignment anyway.

178

There isn't a good reason to use a getter method here, clients can just use the property directly.

186

I'd add a newline (or two) at the start of this string, as there isn't much guarantee the existing stderr data has one.

Address more of ddunbar's comments, including adding some tests.

(doing this over email because phabricator is in the middle of some maintenance)

Jon

jroelofs updated this revision to Diff 17118.Dec 10 2014, 11:08 AM

Woo, phab is back up. Uploading the patch from the email.

@ddunbar, 3.6 branch is coming very soon (tomorrow?)... any chance we can get this rolling in time for that? Are there any more changes you'd like to see in this patch?

Hi Jonathan,

Thanks, this looks a lot better. I've noticed two more potential problems:

  1. Watchdog.watch() has an unlikely race where the timer may have already

fired -- it should handle this case by just killing the new proc
immediately I would guess.

  1. I tried your test test suite with execute_external=True, and it didn't

work for me. Does this work for you? I looked into it a little bit, and it
seems like the .communicate() method isn't returning in this case, even
though the subprocess does get killed. I haven't looked more than that...

  • Daniel
jroelofs updated this revision to Diff 24913.May 4 2015, 1:52 PM

Rebased.

@ddunbar I'm not sure I understand either of your two comments there. I think the first should already be handled, and the second seems to do the right thing for me when I use --timeout 1 --param=execute_external=True.

ping: @ddunbar, can you clarify your review comments please?

@danalbert any chance you can spot the race condition that @ddunbar hinted at in Watchdog.watch()? I don't see it.

Hi Jonathan,

The race I was talking about in watch() is this:

  1. Client opens Watchdog(...)
  2. Watchdog.handler() fires, does nothing because self.popens is empty.
  3. Client calls Watchdog.watch()
  4. Now that process will never be killed by the timeout. --

For the second comment, you can't just set a param to change
execute_external, you need to change it in the test suites lit.cfg file. If
you can't reproduce with that I am not sure...

  • Daniel

Ah, I see it now. Thanks. I'll re-work that part of Watchdog's interface.

jroelofs updated this revision to Diff 26270.May 21 2015, 12:57 PM

Fixed the Watchdog race.

Can't reproduce the other issue @ddunbar mentioned.

jroelofs updated this revision to Diff 26597.May 27 2015, 8:32 AM

re-upload with more context

EricWF edited edge metadata.May 27 2015, 9:02 AM

Misc comments. More to come.

utils/lit/lit/util.py
19

Why not start the timer when watch(...) is called for the first time? It seems like some code takes steps between creating the Watchdog and the call to watch(...). Those steps may unfairly count against the test timeout.

31

What thread of execution does Threading.timer execute this function in? I think we would need to add a lock to Watchdog if the handler runs in a separate thread.

45

How can self.timer be None?

200

How does executeCommand communicate to the caller that a timeout has occurred?

jroelofs added inline comments.May 27 2015, 10:06 AM
utils/lit/lit/TestRunner.py
308

Here too.

utils/lit/lit/util.py
19

sounds reasonable.

31

Good point.

I think the timer executes its action in the "other" thread (i.e. not the one that create the timer).

45

oops... leftover crud.

200

Some text about it is added to 'err' and returned. Because the timeout happened, the exitcode should be nonzero too. See below, line ~216. Also in executeScriptInternal.

jroelofs updated this revision to Diff 26613.May 27 2015, 10:06 AM
jroelofs edited edge metadata.

address @EricWF's review comments.

Leading question: If a shell command guarded by a Watchdog exits scope via an exception should Watchdog.cancel() be called? I think the Watchdog should provide a factory function that can be used in a with statement. The definition would looks something like this:

@contextmanager
def createWatchdog(timeout, popen=None):
    wg = Watchdog(timeout)
    if popen is not None:
      wg.watch(popen)
    yield wg
    wg.cancel()

I think the usage pattern for Watchdog should be:

p = subprocess.Popen(cmd)
with createWatchdog(timeout, p) as wg:
    out,err = p.communicate()
    exitCode = p.wait()

What do you think?

utils/lit/lit/TestRunner.py
294

Should this call to cancel be exception safe?

utils/lit/lit/util.py
35

I'm not sure if this matters, but I want to clarify what is considered a "timeout failure". For example, what if the timers invokes the handler but every open process has already finished? Should this be considered a timeout?

226

Should this call to cancel be exception safe?

Leading question: If a shell command guarded by a Watchdog exits scope via an exception should Watchdog.cancel() be called? I think the Watchdog should provide a factory function that can be used in a with statement. The definition would looks something like this:

@contextmanager
def createWatchdog(timeout, popen=None):
    wg = Watchdog(timeout)
    if popen is not None:
      wg.watch(popen)
    yield wg
    wg.cancel()

I think the usage pattern for Watchdog should be:

p = subprocess.Popen(cmd)
with createWatchdog(timeout, p) as wg:
    out,err = p.communicate()
    exitCode = p.wait()

What do you think?

Leading question: If a shell command guarded by a Watchdog exits scope via an exception should Watchdog.cancel() be called?

I think in this case, the watchdog should kill all the processes it was watching.

I think the Watchdog should provide a factory function that can be used in a with statement.

Oooh, I like it.

utils/lit/lit/util.py
35

I consider a "timeout failure" to happen when the timer runs out before cancel() is called, regardless of whether the handler had to kill anything.

226

'exception safe' in what sense... do you expect the timer cancellation to throw?

or did you mean 'thread safe'?

jroelofs updated this revision to Diff 26716.May 28 2015, 12:01 PM

Re-work Watchdog to support being used as a context manager.

jroelofs updated this revision to Diff 26719.May 28 2015, 12:03 PM

delete dead code

ping.

@EricWF I don't remember what we last discussed about this... any more thoughts on it?

Other than the inline comment this LGTM but I'll let @ddunbar give it the thumbs up.

utils/lit/lit/util.py
83

It is incorrect to re-raise the passed in exception.
https://docs.python.org/2/reference/datamodel.html#object.__exit__

jroelofs updated this revision to Diff 27136.Jun 4 2015, 11:33 AM

Addressed the bug @EricWF pointed out.

Hi Jonathan,

When I try your patch on OS X, the test suite just hangs. Unfortunately I
don't have much time to investigate...

  • Daniel
jroelofs updated this revision to Diff 29090.Jul 6 2015, 8:55 AM

I investigate a little bit, and I believe what happens is that executeCommand() is hanging in p.communicate(), because the file descriptors haven't necessarily been closed. It did look like I could fix this by explicitly forcing a close on the file descriptors in _safely_kill(), e.g.:

os.close(popen.stdout.fileno())
os.close(popen.stderr.fileno())

but I'm not confident that is reliable.

How are you running the testsuite? I've been copy-pasting the command that the makefile runs, and adding --timeout=1 to it.

Getting the kind of behavior right is tricky, esp. while trying to be cross platform, and getting it right in Python where it isn't obvious what layers like subprocess() are doing makes me nervous. Also, I know there are some gotchas here on Windows that I would want to know are covered. Do you have access to all three major platforms to test this logic on?

Tested on Darwin14, and Ubuntu10.04... Seems to work there for me without closing the FDs. I don't have a windows box to check this on.

(Python is 2.7.9 on the Darwin machine, and 2.7.3 on the Ubuntu machine.)

Also, note that kill() will deadlock if it is ever called since both it and _handler() try to take the lock. In this example, it means that if I Ctrl-C lit after it hangs in the above then it will deadlock in the context manager exit function.

This revision fixes the deadlock on kill(). Ctrl-C seems to DTRT.

I was just running the top-level suite, with lit -sv tests.

However, I was able to reproduce the deadlock I saw by just adding this to

util.py:

+if name == 'main':
+ print 'running watchdog test.'
+ executeCommand(['/bin/bash',
+
'/Volumes/Data/ddunbar/llvm/utils/lit/tests/Inputs/timeout/Output/infloop.py.script'],
+ None, None, 0.1)

+ raise SystemExit

This was with Python 2.7.6, which might matter.

  • Daniel
jroelofs updated this revision to Diff 29119.Jul 6 2015, 12:49 PM
jroelofs updated this revision to Diff 29120.

Sorry for the diff spam.

utils/lit/lit/util.py
37

I don't know enough about python implementation details to know if this is 'correct' cross-platform-wise, but it appears to work on Darwin, Linux, and Cygwin on Windows.

I got ahold of a Cygwin+Windows machine and was able to test your deadlock reproducer on it with this patch. I don't have a full development environment on that machine, so I can't run all the other tests that need FileCheck, nor the full llvm testsuite, but this looks promising.

jroelofs abandoned this revision.Jul 14 2015, 4:13 PM