This is an archive of the discontinued LLVM Phabricator instance.

[testsuite] Remove flakey test relying on `pexpect`
AbandonedPublic

Authored by davide on Jan 29 2018, 12:26 PM.

Details

Summary

This passes half of the time on my machine, and half of the time doesn't.
What this tests doesn't seem extremely valuable and the fact it relies on pexpect makes it slightly worse.
As I'm going to enable UNEXPECTED SUCCESSES as failures as soon as possible, I propose to remove this (and the other tests) we can't control

Diff Detail

Event Timeline

davide created this revision.Jan 29 2018, 12:26 PM
davide added a subscriber: lldb-commits.
vsk added a comment.Jan 29 2018, 12:37 PM

What's the failure mode? Have we had any issues with this on the bots?

Generally I'm all for removing flaky tests, I'd like to understand what makes this flaky so we can avoid whatever it is in the future. In this case, we should be able test this like we test clang code completion, a la c-index-test.

jingham requested changes to this revision.EditedJan 29 2018, 12:39 PM

There are a whole bunch of other tests that test completion in this file that use the exact same mechanism but don't seem to be flakey. Why is this one test flakey?

If for instance it's because "Fo" ends up being ambiguous because we chose too common a start string, then you could trivially fix the test by choosing a more uncommon name to complete on. But I'd want to know why this test is flakey first.

I also don't see why you say this test doesn't test something important. The ability to auto-complete symbol names it pretty important to command-line lldb users. If anything we should have more tests of the symbol completer...

This revision now requires changes to proceed.Jan 29 2018, 12:39 PM

There are a whole bunch of other tests that test completion in this file that use the exact same mechanism but don't seem to be flakey. Why is this one test flakey?

If for instance it's because "Fo" ends up being ambiguous because we chose too common a start string, then you could trivially fix the test by choosing a more uncommon name to complete on. But I'd want to know why this test is flakey first.

OK, I'll take a look at this further.

There are a whole bunch of other tests that test completion in this file that use the exact same mechanism but don't seem to be flakey. Why is this one test flakey?

If for instance it's because "Fo" ends up being ambiguous because we chose too common a start string, then you could trivially fix the test by choosing a more uncommon name to complete on. But I'd want to know why this test is flakey first.

I also don't see why you say this test doesn't test something important. The ability to auto-complete symbol names it pretty important to command-line lldb users. If anything we should have more tests of the symbol completer...

Apologies, I was thinking of another test (happens when there are multiple UNEXPECTED SUCCESSES ;)

This seems to pass pretty reliably locally. I'd go for UNXFAILING this unless there are objections.

There are a whole bunch of other tests that test completion in this file that use the exact same mechanism but don't seem to be flakey. Why is this one test flakey?

So, I take a look at this to reply to your question. They're all flakey.
Most of them fails non-deterministically depending on the load of the system. I just wrote a driver that spawns randomly threads that spin loops & checked out multiple copies of lldb and ran test suites in parallel.
I see many of the failures testing completion randomly failing.

davide added a comment.EditedJan 29 2018, 3:06 PM

There are a whole bunch of other tests that test completion in this file that use the exact same mechanism but don't seem to be flakey. Why is this one test flakey?

So, I take a look at this to reply to your question. They're all flakey.
Most of them fails non-deterministically depending on the load of the system. I just wrote a driver that spawns randomly threads that spin loops & checked out multiple copies of lldb and ran test suites in parallel.
I see many of the failures testing completion randomly failing.

As an example of a test which just failed on my machine because of a timeout (non-deterministic, happens around ~1/50 times on my machine)

TIMEOUT: test_step_over_3_times_dsym (python_api/thread/TestThreadAPI.py)
[TestThreadAPI.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/dcci/work/llvm/llvm/tools/lldb/test/dotest.py -q --arch=x86_64 --executable /Users/dcci/work/llvm/build/bin/lldb -s /Users/dcci/work/llvm/build/lldb-test-traces -S nm -u CXXFLAGS -u CFLAGS -C /Users/dcci/work/llvm/build/bin/clang --codesign-identity lldb_codesign --server /Users/dcci/work/llvm/build/bin/debugserver --results-port 52049 -S nm --inferior -p TestThreadAPI.py /Users/dcci/work/llvm/llvm/tools/lldb/packages/Python/lldbsuite/test --event-add-entries worker_index=6:int

I'll probably go ahead and either unXFAIL or SKIP the tests that I find being non-reliable cause of pexpect. Having tests that pass most of the time under load but not always is quite unfortunate, in particular now that we want to enable continuous integration for downstream projects using lldb.

Best,

Davide

If we just need to test completion, write a lit-style test that uses lldb-test that looks like this:

RUN: lldb-test complete --target=%T/foo --complete_str=MyPrefix | FileCheck %s

CHECK: Foo::Bar
CHECK: Foo::Baz
etc

Simple and not flaky

So that sounds like a pexpect problem. We have seen that pexpect based tests tend to be flakey, and particular will time out on loaded systems as you are seeing. But I really don't think there are very many cases where we need to use pexpect.

For instance, all the completion tests using the SBCommandInterpreter::HandleCompletion. This is a pretty direct translation to the code that the CommandInterpreter uses to handle tabs, so you wouldn't change what you are testing, and the tests would be much more stable because you're just calling an API rather than scraping stdout from a sub-launched lldb.

We do need to add more command-line parsing tests as well, but it would be straightforward to write a little unit test harness that feeds in command lines and checks the parsed version of the command (before it gets passed to DoExecute) without actually executing the command.

I would rather not just xfail all these pexpect tests, I worry they will just stay that way. Be better if we took the time to write them using more stable testing methods.

There are SB API's that call into the completion mechanism, so you could also just change the TestCompletion.complete_from_to test method to call the SB completion call. Given how the test is written, it looks like all you would have to do is reimplement that method, and all the tests should be on a good footing.

Spinning up a process just to test that auto-completion works though seems a little bit unnecessary, and introduces the possibility that flakiness and bugs in the process spawn mechanism (if any exist) will manifest in the test failure. It would be nice, if and when this test fails, to have some confidence that the failure is related to auto completing symbol names.

If we just need to test completion, write a lit-style test that uses lldb-test that looks like this:

RUN: lldb-test complete --target=%T/foo --complete_str=MyPrefix | FileCheck %s

CHECK: Foo::Bar
CHECK: Foo::Baz
etc

Simple and not flaky

That sound a great idea. I'll work on something like that in case this turns out to be unreliable (or, independently, when I get a chance :)

In the meanwhile, I reenabled the tests on rL323707 to see whether they fail on the bot.
FWIW, I agree that's a sledgehammer spinning an instance to test autocompletion.

Best,

Davide

There would be no spinning an instance, it would be call the API in python. No extra process, done in process.

Yes. We do need to have symbols to do symbol completion, which does require a binary, but you don't need to run it. Most of the other tests in there (e.g. simple command completion) should be able to work without even a binary. It seems weird to add a potentially fallible lldb-test layer on top of SBCommandInterpreter.HandleCompletion just so you can send text input through lldb-test rather than directly sending text input to SBCommandInterpreter.HandleCommand.

jingham added a comment.EditedJan 29 2018, 5:28 PM

lldb testcases are known not to be flakey if they don't use pexpect, which these wouldn't. The setup machinery for running a dotest based test is pretty well tested at this point.

And the lldb-test test would not just magically come into being by writing the lit-form text you suggested. You will have to write a lldb-test func that marshals the input (both --complete-string and you'll also need a cursor position to test completion inside a line). That's new and not tested code. Whereas the dotest test relies on the API it is directly testing, and trust that the basic machinery of dotest is going to continue to function.

davide abandoned this revision.Apr 2 2018, 9:36 AM

I think this is obsolete by now.