Page MenuHomePhabricator

test runner: switch to pure-Python timeout mechanism
ClosedPublic

Authored by tfiala on Sep 23 2015, 11:08 PM.

Details

Summary

For all the parallel test runners, provide a pure-Python mechanism for timing out dotest inferior processes that run for too long.

Stock OS X and Windows do not have a built-in timeout helper app. This allows those systems to support timeouts. It also transforms the timeout from being implemented via a system process to costing a single thread.

Tested on Linux and OS X with positive timeout verification.

Greg: this also comments out one of the curses cleanup calls into keyboard handling. I found that always hung the test shutdown on the curses runner. (Attached with a python debugger to verify).

Diff Detail

Event Timeline

tfiala updated this revision to Diff 35588.Sep 23 2015, 11:08 PM
tfiala retitled this revision from to test runner: switch to pure-Python timeout mechanism.
tfiala updated this object.
tfiala added reviewers: zturner, clayborg, labath.
tfiala added a subscriber: lldb-commits.
tfiala added inline comments.
test/curses_results.py
223 ↗(On Diff #35588)

Greg, let me know if there's something more sensible I can do here rather than comment it out. It never seemed to return and just kept looping.

labath edited edge metadata.Sep 24 2015, 3:36 AM

I don't want to stand in the way of progress (and I do think that getting rid of the timeout dependency is progress), but this implementation regresses in a couple of features compared to using timeout:

  • timeout tries (with moderate success) to cleanup children spawned by the main process. your implementation will (afaik) kill only the main process. This is especially important for build bots, since leaking processes will starve the bot's resources after a while (and we have had problems with this on our darwin build bot).
  • we intentionally had timeout terminate the child processes with SIGQUIT, because this produces core files of the terminated processes. I have found this very useful when diagnosing the sources of hanging tests. I'd like to have the possibility to do this, even if it will not be enabled by default.

Do you think that the benefits of this change outweigh these issues? If so, and they don't cause our bots to break, then we can put it in (and i'll probably implement the code dumping feature when I find some time), but I though you should be aware of these downsides.

zturner edited edge metadata.Sep 24 2015, 8:26 AM
zturner added a subscriber: zturner.

If I understand correctly, the process hierarchy used to look like this:

python - dotest.py

__ python - multiprocessing fork
__ python - lldbtest
__ inferior executable

And now looks like this:

python - dotest.py

|__ python - lldbtest
      |__ inferior executable

In either case, the process.terminate() runs in the python - lldbtest
process, and terminates the inferior. So I don't see how the behavior is
any different than before. Do you mean with gtimeout it's creating a
separate process group for the inferior so that if the inferior itself
spawns children, those will get cleaned up too? Do any inferiors actually
do that? And if so, we could always use the same logic from python of
creating the inferior in a separate process group

I don't want to stand in the way of progress (and I do think that getting rid of the timeout dependency is progress), but this implementation regresses in a couple of features compared to using timeout:

  • timeout tries (with moderate success) to cleanup children spawned by the main process. your implementation will (afaik) kill only the main process. This is especially important for build bots, since leaking processes will starve the bot's resources after a while (and we have had problems with this on our darwin build bot).

Fair enough. Either yesterday or the day before, I put up a change that ensures the dotest inferiors are in a separate process group. I could (on non-Windows) send that process group a SIGQUIT. The reason I prefer not to use SIGQUIT is that it is catchable, and thus can be ignored. Once something can be ignored, the runner needs to be able to handle the timed out child really not ending with a SIGQUIT. That means either more logic around timing out on the "wait for death", or possibly not reaping. In the case of a test runner, I'd probably put the focus on making sure the test runner makes it through ahead of getting the coredump from a process that times out.

That said, my initial implementation of this did do the following:

  • Start with a SIGQUIT (via subprocess.Process.terminate(), so does something reasonable on Windows). Wait for 5 seconds to see if it wraps up (by way of waiting for the communicate() thread to die).
  • If still not dead, send a SIGKILL (via subprocess.Process.kill()).

I thought the added complication wasn't worth it, but it's not that complicated. And would not regress behavior on the Linux side. The only diff would be, on non-Windows, send a kill to the process group (which has already been created for the inferior dotest), and just use the terminate() call on Windows.

I can give that a whirl. It's a bit more code but is not going to be too bad.

  • we intentionally had timeout terminate the child processes with SIGQUIT, because this produces core files of the terminated processes. I have found this very useful when diagnosing the sources of hanging tests. I'd like to have the possibility to do this, even if it will not be enabled by default.

    Do you think that the benefits of this change outweigh these issues? If so, and they don't cause our bots to break, then we can put it in (and i'll probably implement the code dumping feature when I find some time), but I though you should be aware of these downsides.

I haven't followed the recent changes too closely, but at some point the hierarchy was:

dosep.py
  dosep.py fork
    timeout
      dotest.py (lldb client)
        lldb-server
          inferior
        other executables which were sometimes spawned by the tests

Timeout was making sure none of its children survived after the test timed out. I don't know how it did that, but I presume it ran them in a separate process group. This was not always successful (some of our tests like to create process groups as well), but it was pretty good at compartmenalizing the tests and making sure we leave no processes hanging around.

I don't want to stand in the way of progress (and I do think that getting rid of the timeout dependency is progress), but this implementation regresses in a couple of features compared to using timeout:

  • timeout tries (with moderate success) to cleanup children spawned by the main process. your implementation will (afaik) kill only the main process. This is especially important for build bots, since leaking processes will starve the bot's resources after a while (and we have had problems with this on our darwin build bot).

Fair enough. Either yesterday or the day before, I put up a change that ensures the dotest inferiors are in a separate process group. I could (on non-Windows) send that process group a SIGQUIT. The reason I prefer not to use SIGQUIT is that it is catchable, and thus can be ignored. Once something can be ignored, the runner needs to be able to handle the timed out child really not ending with a SIGQUIT. That means either more logic around timing out on the "wait for death", or possibly not reaping. In the case of a test runner, I'd probably put the focus on making sure the test runner makes it through ahead of getting the coredump from a process that times out.

None of the processes we run catch SIGQUIT, so it's mostly safe, but we can do the QUIT+sleep+KILL combo to be safe.

The problem with the race conditions is that they are very hard to reproduce. I often need to run the test suite at full speed dozens of times to be able to catch it happening, and then the only way of diagnosing the issue is digging through the core files.

Thanks for working on this once again :)

If I understand correctly, the process hierarchy used to look like this:

python - dotest.py

__ python - multiprocessing fork
__ python - lldbtest
__ inferior executable

And now looks like this:

python - dotest.py

|__ python - lldbtest
      |__ inferior executable

In either case, the process.terminate() runs in the python - lldbtest
process, and terminates the inferior. So I don't see how the behavior is
any different than before. Do you mean with gtimeout it's creating a
separate process group for the inferior so that if the inferior itself
spawns children, those will get cleaned up too?

The man page for the timeout (this one here: http://linux.die.net/man/1/timeout) doesn't explicitly call out that it creates a process group, but that wouldn't at all be out of character for a command like that. Interestingly, it does call out the issue that a using a catchable signal means it can't guarantee that it really shuts down that app if it handles the signal and essentially ignores it.

do that? And if so, we could always use the same logic from python of
creating the inferior in a separate process group

We have that logic in for non-Windows. That's the change I sent you a ping on to see if you wanted to add the Windows flag for creating a separate process group. Since I no longer recall the semantics of that on the Windows side, I wasn't sure if that was useful to you.

I think sending the signal to the process group (since we have one) is probably the right way to go here. But I also think we get into the realm of "did we really kill it" if we start playing around with catchable signals. Also, there is going to be a divergence on how that works on Unix-like vs. Windows. The process group kill command is os.killpg(), which is listed as Unix-only.

I'll come up with the alternative and we can see where we go from there.

Thanks for working on this once again :)

Sure thing!

I'm ok with leaving some straggling processes on Windows for now, because
it's obviously better than what we have currently, which is nothing. I can
implement some sort of helper module at some point that exposes a new
createProcess function that supports this for Windows and Unix with a
single interface once it becomes a problem.

Which reminds me of another thing: Since you're working heavily on the tet
suite, one thing I've always wanted is what I just mentioned: Something
like an lldbxplat module which is essentially a helper module that exposes
a single interface for doing things that differ depending on the platform.
This would essentially let us remove almost every single occurrence of `if
sys.name.startswith('win32')` from the entire test suite, which would
really help maintainability.

It's something I'll get to myself one day if nobody else beats me, but
since you're making a lot of great improvements to the test suite recently,
figured I'd throw it out there.

Which reminds me of another thing: Since you're working heavily on the tet
suite, one thing I've always wanted is what I just mentioned: Something
like an lldbxplat module which is essentially a helper module that exposes
a single interface for doing things that differ depending on the platform.
This would essentially let us remove almost every single occurrence of `if
sys.name.startswith('win32')` from the entire test suite, which would
really help maintainability.

That's a good idea. I'm (somewhat slowly) attempting to work in some clean up each time I touch it. I'll hit that at some point. I'll start winding down on changes to the test infrastructure soon-ish. If I don't hit that by the time I stop heavy changes, definitely feel free to jump on it!

I do intend to get back to this after resolving the cmake build issue on OS X. Hopefully by tonight.

test/curses_results.py
223 ↗(On Diff #35588)

Greg got rid of this call this morning. There was a 'q' command intended to be there that wasn't yet, which prevented us from exiting at the end. For now we're just taking this out.

That's a good idea. I'm (somewhat slowly) attempting to work in some clean up each time I touch it. I'll hit that at some point. I'll start winding down on changes to the test infrastructure soon-ish. If I don't hit that by the time I stop heavy changes, definitely feel free to jump on it!

Okay, I couldn't help myself. I'm taking a bit more time on this as I set it up much more nicely. I'm pulling all the process management into another module and breaking out some platform-specific bits into a per-platform helper.

I won't have something up until tomorrow later on at the earliest.

+100, great :) Once it's in it will be much more easy to press for others
to move their platform specific bits into this module, or to do it myself
when I'm writing platform specific stuff.

+100, great :)

:-)

tfiala updated this revision to Diff 35788.Sep 25 2015, 6:24 PM
tfiala edited edge metadata.

Work in progess.

This patch is working and timing out consistently on Linux (on a Jenkins bot), and is running normally on OS X. I haven't coerced a timeout on OS X yet.

I'll be adding some tests for the timeout and core generation, and plan to add a --no-core-for-timeout option since I don't think that is universally desirable. (Right now it always attempts to generates cores via SIGQUIT on Posix-like systems, but the code is there to do a SIGTERM instead if no cores are desired. I just haven't hooked it up).

Zachary, you can see what I've got in place on the Windows side. If you know some of that is wrong, feel free to let me now ahead of time and I can change it.

Hope to get those tests in tonight and have a formal "eval this patch" setup.

tfiala added inline comments.Sep 25 2015, 6:30 PM
test/dosep.py
245

This message should include the filename.

test/process_control.py
273 ↗(On Diff #35788)

This will want to store the popen_process.using_process_groups property like I do on the Posix side if the terminate would do something different based on the presence or absence of using process groups. Oversight for not storing.

I won't be able to have a serious look until Monday, as I'm still remote.
Hopefully you arent working on weekends :)

I won't be able to have a serious look until Monday, as I'm still remote.

Oh no worries.

Hopefully you arent working on weekends :)

:-P

tfiala updated this revision to Diff 35814.Sep 26 2015, 4:37 PM

Minor refactor, moving new components to test/test_runner/lib.

Will be adding test runner tests to test/test_runner/test and will make sure the normal test runner doesn't try to run them. (These will be test infrastructure tests, i.e. tests for the test infrastructure, not tests that run on them).

This is vs. llvm.org lldb trunk r248676.

Note I've also now had positive timeout confirmation on OS X. As I suspected, it is less convenient there to have the soft kill do a SIGQUIT, since on OS X this brings up the crash log generator dialog where now user input is required (and very bad if the tests are running in a session that doesn't have access to a window server). I'll be working in a --no-core-on-timeout and --core-on-timeout option, and have it default to "no" on OS X and "yes" everywhere else. When no cores are desired, we'll generate a SIGTERM instead of a SIGQUIT on the soft terminate request.

When no cores are desired, we'll generate a SIGTERM instead of a SIGQUIT on the soft terminate request.

That on the POSIX-y systems, of course. It would be cool if Zachary can work in the mini crashdumps or similar on Windows.

tfiala updated this revision to Diff 35829.EditedSep 27 2015, 10:28 PM

Tests added. Ready for review.

The change now has two levels of terminate:

  • soft terminate, which uses a signal or process control mechanism to tell the process to end. It optionally can use a mechanism that triggers a core dump / crashlog / minidump if supported by the underlying platform. RIght now, only Linux of the Posix-y platforms requests the core (via SIGQUIT). The others use SIGTERM. I will plumb through an option for that later (as mentioned in prior comments in this review). But I'm now at max time I can spend on this right now and this'll need to do. There is also a timeout on how long the process driver will allow the soft timeout to take to wrap up. This is relevant for core dumps that might have a huge amount of memory to dump to disk. It defaults to 10 seconds, but can be overridden by an environment variable, LLDB_TEST_SOFT_TERMINATE_TIMEOUT. That env var will be converted to a float from whatever text is in it. It represents the number of seconds within which any soft terminate option must wrap up. If a process is actively blocking/ignoring that termination signal, then this represents the total time that will be "wasted" waiting to figure this out.
  • hard terminate: after the soft terminate attempt is made, plus the soft terminate timeout, then if the process is still not eliminated, the hard terminate mechanism will be used. On Posix-y systems, this is a SIGKILL.

@zturner, you will want to run the tests in lldb/test/test_runner/test. Do that with:
cd lldb/test/test_runner/test
python process_control_tests.py

In some cases where docs were clear, I did the bit that seemed to make sense on the Windows side. In others, I left it for you to fill in. Feel free to skip some/any tests that don't make sense if, for example, you only end up wanting to support one level of killing on Windows. (We can probably work in a "has soft terminate mechanism" query into the process_control.ProcessHelper class, which has per-platform derived classes, if we want to go that way).

Also note I will start migrating test infrastructure bits into the lldb/test/test_runner/lib directory over time. The two new files from this change are going in there.

I'd like to get this in sooner than later as it resolves numerous hangs on OS X where we don't have a timeout guard. I've tried it on Linux (4, 8 and 24 core machines) as well as OS X (8 core MBPs), and have run the process_control_tests on them without issue.

Will test this out tomorrow

Looks good to me. :)

If you do end up adding a command line option, I think you can go for --enable-core-on-timeout and default to off for all platforms. I suspect I am the only one using this, and it makes a much safer default.

Looks good to me. :)

If you do end up adding a command line option, I think you can go for --enable-core-on-timeout and default to off for all platforms. I suspect I am the only one using this, and it makes a much safer default.

Okay, sounds good. I expect some additions that will be needed from Zachary, so if we want to default it off and add a flag, I'll probably get that part in with the initial check-in.

Sorry, our desks were reconfigured over the weekend, so I just now got my
computer turned on. I'm syncing code and hopefully will have a working
build soon.

Sorry, our desks were reconfigured over the weekend, so I just now got my
computer turned on. I'm syncing code and hopefully will have a working
build soon.

Sounds good. I expect you'll find a few things you'll want to add/adjust. If you get those patches to me, I'll include them in the check-in.

Sorry, our desks were reconfigured over the weekend, so I just now got my
computer turned on. I'm syncing code and hopefully will have a working
build soon.

Sounds good. I expect you'll find a few things you'll want to add/adjust. If you get those patches to me, I'll include them in the check-in.

Also note the test code for this check-in assumes there is a python in the path when it builds the command to run the inferior test subject. (This is the thing we run against to verify that the ProcessDriver --- aka child process runner with timeout support --- gets return codes, witnesses timeouts, witnesses children processes that choose to ignore soft terminate signals and gets the bad guy with a more lethal (hard) termination option, etc.). You might need to tweak that in the test runner. That python invocation doesn't require any special lldb support - it is just testing that a process can be launched, and that process happens to be a python interpreter session running the inferior.py test subject. So I figured that was probably fine as is.

Can you rebase against ToT? I'm having trouble applying the patch.

Sorry, ignore me. My brain is just off, it's working

Sorry, ignore me. My brain is just off, it's working

Okay, cool. Yeah I can update it if needed, I think I saw a 1-line dosep.py go in between my last update late last night and early this morning. But it sounds like you made it past there.

This is what I get.

d:\src\llvm\tools\lldb\test>cd test_runner

d:\src\llvm\tools\lldb\test\test_runner>cd test

d:\src\llvm\tools\lldb\test\test_runner\test>c:\Python27_LLDB\x86\python_d.exe
process_control_tests.py

..E.EE

ERROR: test_hard_timeout_works (__main__.ProcessControlTimeoutTests)

Driver falls back to hard terminate when soft terminate is ignored.

Traceback (most recent call last):

File "process_control_tests.py", line 177, in test_hard_timeout_works
  options="--sleep 120"),
File "process_control_tests.py", line 83, in inferior_command
  cls._suppress_soft_terminate(command)
File "process_control_tests.py", line 70, in _suppress_soft_terminate
  for signum in helper.soft_terminate_signals():

TypeError: 'NoneType' object is not iterable

ERROR: test_soft_timeout_works_core (__main__.ProcessControlTimeoutTests)

Driver uses soft terminate (with core request) when process times out.

Traceback (most recent call last):

File "process_control_tests.py", line 157, in test_soft_timeout_works_core
  self._soft_timeout_works(True)
File "process_control_tests.py", line 150, in _soft_timeout_works
  helper.was_soft_terminate(driver.returncode, with_core),
File "..\lib\process_control.py", line 192, in was_soft_terminate
  raise Exception("platform needs to implement")

Exception: platform needs to implement

ERROR: test_soft_timeout_works_no_core (__main__.ProcessControlTimeoutTests)

Driver uses soft terminate (no core request) when process times out.

Traceback (most recent call last):

File "process_control_tests.py", line 162, in

test_soft_timeout_works_no_core

  self._soft_timeout_works(False)
File "process_control_tests.py", line 150, in _soft_timeout_works
  helper.was_soft_terminate(driver.returncode, with_core),
File "..\lib\process_control.py", line 192, in was_soft_terminate
  raise Exception("platform needs to implement")

Exception: platform needs to implement


Ran 6 tests in 10.351s

FAILED (errors=3)
[37558 refs]

Looking into these now.

This is what I get.

d:\src\llvm\tools\lldb\test>cd test_runner

d:\src\llvm\tools\lldb\test\test_runner>cd test

d:\src\llvm\tools\lldb\test\test_runner\test>c:\Python27_LLDB\x86\python_d.exe
process_control_tests.py
..E.EE
======================================================================
ERROR: test_hard_timeout_works (__main__.ProcessControlTimeoutTests)
Driver falls back to hard terminate when soft terminate is ignored.


Traceback (most recent call last):

File "process_control_tests.py", line 177, in test_hard_timeout_works
  options="--sleep 120"),
File "process_control_tests.py", line 83, in inferior_command
  cls._suppress_soft_terminate(command)
File "process_control_tests.py", line 70, in _suppress_soft_terminate
  for signum in helper.soft_terminate_signals():

TypeError: 'NoneType' object is not iterable

This is going to be what appears to be a missing "is not None" check on the helper.soft_terminate_signals(). I'm expecting we'll need to figure out how Windows will want to test the "child process ignores the pleasant/nice kill attempt -- aka a soft terminate". I doubt the right thing to do is to tell the inferior to ignore signal numbers on Windows. In this case, here:

File "process_control_tests.py", line 83, in inferior_command

cls._suppress_soft_terminate(command)

we'll need to figure out what Windows needs to do to tell the inferior.py script how to ignore soft terminate signals.

One option I almost did but wanted to figure out if you need it is to switch the inferior.py parameter from the sense of "ignore these signals" to "do whatever platform thing is necessary to avoid a soft terminate". For the Posix-y systems, that would be ignoring SIGTERM and SIGQUIT. For Windows, it would be whatever you need. Then the test side can just add that (new) flag that says something like "--ignore-soft-terminate". That is probably the way to go here. You still need to figure out what needs to go in the handler for that in inferior.py, but that would be the general idea.

======================================================================
ERROR: test_soft_timeout_works_core (__main__.ProcessControlTimeoutTests)
Driver uses soft terminate (with core request) when process times out.


Traceback (most recent call last):

File "process_control_tests.py", line 157, in test_soft_timeout_works_core
  self._soft_timeout_works(True)
File "process_control_tests.py", line 150, in _soft_timeout_works
  helper.was_soft_terminate(driver.returncode, with_core),
File "..\lib\process_control.py", line 192, in was_soft_terminate
  raise Exception("platform needs to implement")

Exception: platform needs to implement

======================================================================
ERROR: test_soft_timeout_works_no_core (__main__.ProcessControlTimeoutTests)
Driver uses soft terminate (no core request) when process times out.


Traceback (most recent call last):

File "process_control_tests.py", line 162, in

test_soft_timeout_works_no_core

  self._soft_timeout_works(False)
File "process_control_tests.py", line 150, in _soft_timeout_works
  helper.was_soft_terminate(driver.returncode, with_core),
File "..\lib\process_control.py", line 192, in was_soft_terminate
  raise Exception("platform needs to implement")

Exception: platform needs to implement

This is one of the spots I need you to fill in.


Ran 6 tests in 10.351s

FAILED (errors=3)
[37558 refs]

Looking into these now.

The "was_soft_terminate()" method is looking at the subprocess.Popen-like object's returncode and, judging by that, certifying that it was (or was not) a soft terminate that caused the exit.

If you need the Popen-like object to figure that out (e.g. if you need to look at some of the Windows-extended Popen attributes), we can rearrange this to take the Popen-like object rather than only its returncode. That would be totally fine.

As far as I can tell, the value of the return code is undocumented when you
use Popen.terminate() on Windows. I don't know what that means for the
patch. It's quite a bit more complicated than I anticipated based on the
original thread to lldb-dev. I thought it was just going to call
Popen.terminate and assume that means it was a hard terminate. I still
don't grasp the need for all the complexity.

A few experiments on a throwaway program suggest that using
Popen.terminate() sets returncode to 1, but there doesn't seem to be any
guarantee that this is always the case.

tfiala added a comment.EditedSep 28 2015, 12:42 PM

As far as I can tell, the value of the return code is undocumented when you
use Popen.terminate() on Windows.

Okay, interesting. I found what looked like a race on Linux and OS X where calling the Popen-object returncode member (the normal way to get the returncode) would flip from the actual result (received after a wait() call, i.e.

popen_object.wait()
self.returncode = popen_object.returncode

...
returncode2 = popen_object.returncode

and find
returncode2 != self.returncode

I switched to using the result of wait() as the official return code.

I don't know what that means for the
patch. It's quite a bit more complicated than I anticipated based on the
original thread to lldb-dev. I thought it was just going to call
Popen.terminate and assume that means it was a hard terminate. I still
don't grasp the need for all the complexity.

It's the need for the core file generation, which requires a soft (i.e. deflectable by the target process) terminate request. Which requires a 2-level kill mechanism, as we need to make sure the process really dies for a real test runner environment that needs to be long-lived.

I can do one more adjustment. I'll make the soft terminate optional per platform. We'll disable it on Windows and only enable it on Posixy stuff. Then on Windows there will only be the hard terminate.

A few experiments on a throwaway program suggest that using
Popen.terminate() sets returncode to 1, but there doesn't seem to be any
guarantee that this is always the case.

What I was doing was writing a value into the subprocess.Popen-like object on the Posix side, to store extra state needed. (e.g. whether we're killing by process group --- if we created a process group --- or if we're just killing a process directly. Those are different operations on the Posix side). For this, you could just write into the popen object that you killed it with a hard terminate. And then read back that value on the verification side to say "oh yes, that was really the mechanism by which it was killed."

That point is moot, though, if there is only one way to kill on Windows. If there is only one way to kill, then one of these should happen:

  1. If we don't care about crashdumps on Windows, which I think I heard you say is true, then we make the soft terminate level optional by platform, and skip it (i.e. indicate it is not available) on Windows.
  1. If we do care about crashdumps on Windows, then we probably need to untangle the conflation of the soft terminate level (where a target process can ignore) from crash dump generation. On Posix, crash dump request is tied to the soft termination signal used. But on Windows, we could allow that to happen on the (one and only) hard terminate level.

I can get that together (assuming option #1) relatively quickly.

I don't think we can drop the complexity of the 2-tier kill level in the driver, though, given the desire to support core dump generation.

I can get that together (assuming option #1) relatively quickly.

I don't think we can drop the complexity of the 2-tier kill level in the driver, though, given the desire to support core dump generation.

I'm not going to touch this until I hear back from you, though, Zachary. Anything I do on this is way beyond the time I have to allocate to it at this point, so I'll have to stub out a dead simple solution (i.e. essentially just about what it was doing before) for Windows if none of this sounds good for you. I have to get OS X to stop hanging.

Yea, on Windows we can support core dump generation externally without
touching the python script, so that's not neded for us.

Ironically, at the system level there is a way to specify the return code
of the process when you terminate it. And Python internally must be using
this API, because it's the only way to do it. They just decided not to
expose that parameter through the subprocess.terminate() method, and
instead they pass their own magic value. If someone is clever enough to
dig through python source code and figure out what value it uses, we could
use that. But I think on Windows it's safe to assume that if you call
process.terminate(), then it was a hard terminate, and we can ignore
everything else.

Does that make sense / is easy to do?

Yea, on Windows we can support core dump generation externally without
touching the python script, so that's not neded for us.

Okay, good!

Ironically, at the system level there is a way to specify the return code
of the process when you terminate it. And Python internally must be using
this API, because it's the only way to do it.

Hah, that's interesting.

They just decided not to
expose that parameter through the subprocess.terminate() method, and
instead they pass their own magic value. If someone is clever enough to
dig through python source code and figure out what value it uses, we could
use that. But I think on Windows it's safe to assume that if you call
process.terminate(), then it was a hard terminate, and we can ignore
everything else.

Does that make sense / is easy to do?

Yes, makes sense and yes I can accommodate by just making that soft terminate level optional by platform. I'll get to working on that after lunch and I hope to have a patch up way before I head home today.

A few more comments

test/test_runner/test/process_control_tests.py
64

Change nt to Windows (unless you're removing this logic as discussed earlier)

81

Change "python" to sys.executable

Random thought: If you want to generate a core dump, we already have LLDB
attached to the process, so you have an SBProcess. Couldn't you use
Process.Halt, then call whatever method is necessary to have the debugger
create a core, then Process.Kill?

tfiala added inline comments.Sep 28 2015, 2:20 PM
test/test_runner/test/process_control_tests.py
64

Ah okay. I was copying the "nt" from somewhere else in the source code. We might want to grep for that if it is wholesale wrong. I assumed that was leftover from transition over to the NT codebase in W2k time period.

81

Okay. Better.

zturner added inline comments.Sep 28 2015, 2:25 PM
test/test_runner/test/process_control_tests.py
64

It's really confusing. There's os.name and sys.platform(), which which return different strings. It's possible the one you saw was checking os.name, for which nt is one of the valid return values. That's another good candidate for the cross-platform portability module, we could just have lldb_platform.os() which returns an enum. I was a little bummed to see the process_control module as a subfolder of test_runner, because all of the helper-related stuff could be at a higher level usable by anywhere in the test suite. But we can tackle that later.

Random thought: If you want to generate a core dump, we already have LLDB
attached to the process, so you have an SBProcess. Couldn't you use
Process.Halt, then call whatever method is necessary to have the debugger
create a core, then Process.Kill?

I'm not sure I follow. Here's the process layout:

  1. dotest.py parallel runner -> 2. worker thread/process -> 3. dotest.py inferior process -> 4. optional inferior test subject used by the lldb test

Item 3 above is the process that needs to be killed and possibly core dumped. It happens to be a process that may have an optional inferior (item #4) being tested by the lldb test code running in #3. But we're trying to kill 3, and all of its children, and have 3 core dump. (We don't care about a core dump for #4).

If we wanted to get a core for #4, that would seem possibly useful, but that's not the core we're looking for.

Did I follow your idea right?

zturner added inline comments.Sep 28 2015, 2:27 PM
test/test_runner/test/process_control_tests.py
64

Oh yea, in addition to sys.platform there's also platform.system, as you've used here. And even *those* don't return the same values.

Random thought: If you want to generate a core dump, we already have LLDB
attached to the process, so you have an SBProcess. Couldn't you use
Process.Halt, then call whatever method is necessary to have the debugger
create a core, then Process.Kill?

I'm not sure I follow. Here's the process layout:

  1. dotest.py parallel runner -> 2. worker thread/process -> 3. dotest.py inferior process -> 4. optional inferior test subject used by the lldb test

    Item 3 above is the process that needs to be killed and possibly core dumped. It happens to be a process that may have an optional inferior (item #4) being tested by the lldb test code running in #3. But we're trying to kill 3, and all of its children, and have 3 core dump. (We don't care about a core dump for #4).

    If we wanted to get a core for #4, that would seem possibly useful, but that's not the core we're looking for.

    Did I follow your idea right?

Ahh ok I thought it was a core of the inferior. My bad. Anyway that doesn't change anything about what I said earlier about not needing the soft terminate logic on Windows :)

tfiala added a comment.EditedSep 28 2015, 2:32 PM

Ah I see I morphed all those together (the platform naming).

... bummed to see them under test_runner...

Heh, that's funny. I was attempting to do a service of decluttering that top level test directory, which to me looks like somebody threw up a bunch of python code :-) I didn't anticipate it would have the opposite effect!

We can tackle that a few ways: make it a package, which accomplishes 99% of what I was trying to do --- getting things out of that top level lldb/test, *while* still allowing tests to pull in some of that functionality if they want. So it becomes something like:

lldb/test/lldb_platform/

__init__.py
...
the_files_.py
...
test/
   the tests for the package

And then whatever is in there could be loaded from an lldb test with:
import lldb_platform.{whatever}
or
from lldb_platform import {whatever}

That seems totally reasonable to me. How does that sound?

Random thought: If you want to generate a core dump, we already have LLDB
attached to the process, so you have an SBProcess. Couldn't you use
Process.Halt, then call whatever method is necessary to have the debugger
create a core, then Process.Kill?

I'm not sure I follow. Here's the process layout:

  1. dotest.py parallel runner -> 2. worker thread/process -> 3. dotest.py inferior process -> 4. optional inferior test subject used by the lldb test

    Item 3 above is the process that needs to be killed and possibly core dumped. It happens to be a process that may have an optional inferior (item #4) being tested by the lldb test code running in #3. But we're trying to kill 3, and all of its children, and have 3 core dump. (We don't care about a core dump for #4).

    If we wanted to get a core for #4, that would seem possibly useful, but that's not the core we're looking for.

    Did I follow your idea right?

Ahh ok I thought it was a core of the inferior. My bad. Anyway that doesn't change anything about what I said earlier about not needing the soft terminate logic on Windows :)

Okay!

Ah I see I morphed all those together (the platform naming).

... bummed to see them under test_runner...

Heh, that's funny. I was attempting to do a service of decluttering that top level test directory, which to me looks like somebody threw up a bunch of python code :-) I didn't anticipate it would have the opposite effect!

We can tackle that a few ways: make it a package, which accomplishes 99% of what I was trying to do --- getting things out of that top level lldb/test, *while* still allowing tests to pull in some of that functionality if they want. So it becomes something like:

lldb/test/lldb_platform/

__init__.py
...
the_files_.py
...
test/
   the tests for the package

And then whatever is in there could be loaded from an lldb test with:
import lldb_platform.{whatever}
or
from lldb_platform import {whatever}

That seems totally reasonable to me. How does that sound?

If it weren't for the fact that we have all these switches on os.name, sys.platform, and platform.system inside the tests as well as the test running infrastructure, then hiding it would be good. But yea, the test suite itself is still really fragile due to all the conditionals, so it would be great to be able to use this cross platform helpers from anywhere in the test suite. Making it a package sounds good, but I dont' think it needs to be done as part of this CL (or even immediately after) unless you really want to.

If it weren't for the fact that we have all these switches on os.name, sys.platform, and platform.system inside the tests as well as the test running infrastructure, then hiding it would be good. But yea, the test suite itself is still really fragile due to all the conditionals, so it would be great to be able to use this cross platform helpers from anywhere in the test suite. Making it a package sounds good, but I dont' think it needs to be done as part of this CL (or even immediately after) unless you really want to.

Good - I'll hold off on that part until later.

tfiala updated this revision to Diff 35921.Sep 28 2015, 4:18 PM

Changes:

  • soft terminate is now optional by platform, not supported by default, and listed as supported on POSIX-y systems.
  • run with timeout now uses soft terminate only when supported.
  • soft terminate tests are now skipped when soft terminate isn't supported by the platform.
  • other issues called out should be fixed.

This change does not:

  • add the lldb_platform package. (I'll do that later.)
  • plumb through the option to make core file generation accessible by the command line. (I'll do that later.) Right now it maintains core file generation requests for Linux and nobody else.

@zturner, can you give this a run?

tfiala added inline comments.Sep 28 2015, 4:23 PM
test/test_runner/lib/process_control.py
553

Drats, this should be self.returncode. I'll be adjusting that. Not germaine for your case, Zachary.

563

Same as above. Should be testing against self.returncode. I saw a few times where this value returned differently across calls. (i.e. started as -{signal-number} after the first wait() call after its death, then called it again in a print statement a few lines later and it returned 0). This would hit that same issue.

Granted this code block shouldn't ever get hit, since we don't ever try again after a hard kill attempt.

tfiala updated this revision to Diff 35925.Sep 28 2015, 4:28 PM

Fixed up my previous inline comments. I was caching the popen object's returncode returned from wait(), since it was showing unstable characteristics when reading after a successful wait(). But I failed to use it everywhere. Fixed that.

tfiala marked 6 inline comments as done.Sep 28 2015, 4:29 PM

Clearing out all fixed comments.

Hey Zachary,

Have you seen diff 5 or diff 6 on Windows yet? What results did you get? I think I'm just waiting on a clean Windows run at this point.

Thanks!

-Todd

Sorry got sidetracked this morning. I'll check it out now

Great, much appreciated, Zachary!

zturner added inline comments.Sep 29 2015, 12:47 PM
test/test_runner/lib/process_control.py
215

Should this return False now that this is not supported on Windows, or do you think this is still ok? It's probably just of theoretical concern.

230

Change this to:

  1. It's not actually documented what return code Popen.terminate()
  2. passes to TerminateProcess. Experimentation shows that it's always
  3. 1, but there's no guarantee that's true. Hopefully this check
  4. works well enough to be sufficient. return returncode != 0

With these changes I get the following output:

d:\src\llvm\tools\lldb\test\test_runner\test>python process_control_tests.py
....ss
----------------------------------------------------------------------
Ran 6 tests in 1.208s

OK (skipped=2)

Does that look right to you?

Does that look right to you?

That result looks right - the two soft terminate tests get skipped when it's not reported.

If you ran the lldb test suite with that setup, they should work as well as they did before the change. (If they don't, I'm missing a lower level test).

I'll fix up those other parts you called out and get what is hopefully the final patch up.

Unless you make significant other changes, feel free to just commit. I
don't have any other concerns. Thanks for working on this!

tfiala added inline comments.Sep 29 2015, 12:57 PM
test/test_runner/lib/process_control.py
215

Correct - this is not valid if soft terminate isn't supported. It should return false if soft terminate is not supported, and raise if it is supported but not overriddden, I think.

230

It's not actually documented what return code Popen.terminate()

passes to TerminateProcess. Experimentation shows that it's always

This is checking the returncode as returned by wait() on the process. So we're checking the return value of popen_object.wait() here (which should be identical to the popen_object.returncode() immediately after wait() succeeds).

I think for the Windows case, we override this in the WindowsProcessHelper and have that return "True". (All kills will be hard terminate in this case). This is really only used by the test cases right now. Alternatively, we can change it to pass in the popen object, and have us write in a "did a hard kill on this" member variable on the Windows implementation. Then, we just return "True" if that is set, False otherwise.

For Windows, though, I can put in the returncode != 0. If the hard kill test works with that, it is fine. But it will also pass if a process returns a non-zero exit code I think (i.e. return from main() is non-zero).

tfiala updated this revision to Diff 36043.Sep 29 2015, 3:11 PM

Fixes for Windows per previous comments, for posterity's sake.

This is the patch I will be committing.

tfiala accepted this revision.Sep 29 2015, 3:21 PM
tfiala removed a reviewer: clayborg.
tfiala added a reviewer: tfiala.

Accepting with given changes.

This revision is now accepted and ready to land.Sep 29 2015, 3:21 PM
tfiala closed this revision.Sep 29 2015, 3:22 PM

Committed here:

svn commit
Sending        test/dosep.py
Adding         test/test_runner
Adding         test/test_runner/README.txt
Adding         test/test_runner/lib
Adding         test/test_runner/lib/lldb_utils.py
Adding         test/test_runner/lib/process_control.py
Adding         test/test_runner/test
Adding         test/test_runner/test/inferior.py
Adding         test/test_runner/test/process_control_tests.py
Transmitting file data ......
Committed revision 248834.