HomePhabricator

Refactor test runner to print sub-test-case pass/fail rate.
AuditedrL238467

Description

Refactor test runner to print sub-test-case pass/fail rate.

Details

Auditors
dawn
Committed
zturnerMay 28 2015, 12:56 PM
Parents
rL238466: [Format] Skip creating temporary std::strings when filling another string.
Branches
Unknown
Tags
Unknown

Event Timeline

dawn raised a concern with this commit.Jun 23 2015, 12:14 PM
dawn added a subscriber: dawn.

(Note: this e-mail was sent to lldb-commits, but adding it to Phabricator so it doesn't get forgotten)

This commit appears to have caused 548 (=1342-794) tests to no longer run on OSX when run using dosep.py.

The lldb build/test commands from Jenkins script which run on Yosemite slave are as follows:

 cmake -G Ninja .. "-DLLVM_TARGETS_TO_BUILD=ARM;X86;AArch64" -DCMAKE_CXX_FLAGS="-std=c++11 -stdlib=libc++" -DCMAKE_BUILD_TYPE=Release
[...]
ninja install
[...]
security unlock-keychain -p testuser /Users/testuser/Library/Keychains/login.keychain
./dosep.py --options "-v --executable $INSTALLDIR/bin/lldb" 2>&1 | tee $INSTALLDIR/lldb_test_out.log || true

Test output from Jenkins build after above commit (at commit svn 238512/git 3c8acff997166aaaa6bcdad491b55fdd6c6446e8):

Ran 378 test suites (1 failed) (0.264550%)
Ran 794 test cases (5 failed) (0.629723%)
FAIL: LLDB (suite) :: TestMultithreaded.py (Darwin ngnu.embarcadero.com 14.3.0 Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64 i386)

... results in totals of:

lldb_failures=5
lldb_errors=0
lldb_total=794

Test output from previous build (at commit svn 238455/git 7e22c8e9b6b4ac7225fe1740cc6fe5221c27533d):

[...]
Ran 1 test in 0.008s
[...]
Ran 1 test in 0.007s
[...]
Ran 1 test in 0.069s
[...]

... using the following script commands to check results:

lldb_failures=`grep -E "^FAILED" lldb_test_out.log | grep failures | sed 's/.*(failures=\([0-9]*\).*/\1/' | awk '{count+=$1} END {print count}'` || true
lldb_errors=`grep -E "^FAILED" lldb_test_out.log | grep errors | sed 's/.*errors=\([0-9]*\).*/\1/' | awk '{count+=$1} END {print count}'` || true
lldb_total=`grep -E "^Ran [0-9]+ tests? in" lldb_test_out.log | awk '{count+=$2} END {print count}'`

... results in totals of:

lldb_failures=0
lldb_errors=0
lldb_total=1342

Trying to run the tests using dotest.py (via ./dotest.py -v --executable $INSTALLDIR/bin/lldb)
now hangs in OSX on:

464: test_create_after_attach_with_dsym (TestCreateAfterAttach.CreateAfterAttachTestCase)

I suspect this is a regression which went unnoticed due to the new dosep output, but it means we have to use dosep.py for running our tests until the hang is fixed.

Can you please have a look?

Thanks,
-Dawn

dawn raised a concern with this commit.Jul 10 2015, 11:32 AM

This is still an issue. We are still seeing a drop of ~550 in the total number of tests run since this commit. Please have a look? Thanks.

Zach (zturner) is on vacation for another week or two, but I can take a look at this on Monday.

Can you give an example of a specific test that's no longer running? I'm trying to figure out if there are actually tests being skipped or if it's just a reporting problem. On Windows, we're reporting a slightly higher number (369) than the sum of the "Ran n tests" lines (342).

So when the issue was first discovered (or posted about) back in June, it
was unfortunately on my last day in the office before leaving for 5 weeks
and I couldn't get to it before leaving. I'm back, but now that a bit of
time has passed, I'm having a hard time figuring out what the best thing to
do here is.

I'm quite confident that the drop in tests is simply a reporting issue, but
it's hard to be more concrete than this without having a before and after
comparison of the exact tests that were run before my patch, and the exact
tests that were run after my patch.

If I had to hazard a guess, I'd say the difference comes from skipped tests
being added to the lldb_total result now, but not before.

It looks like the problem here is that previously we relied on the
individual subprocesses of dotest.py to print their own stdout and stderr.
Now dosep gets the stdout and stderr through a pipe, and it doesn't always
print the full output of the process. So in some cases the line "Ran N
tests in M seconds" isn't appearing.

I can fix this, but I guess this is a good time to remind people of the
dangers of log scraping. I don't think we should guarantee stability of
the details of the test output format, except for perhaps a small summary
printed at the end of the test run. So I would think about getting
whatever numbers you need into this summary section, and then relying only
on that for your output.

(Resending this through Phab since my last "reply" to email went to the old list.

It looks like the problem here is that previously we relied on the individual subprocesses of dotest.py to print their own stdout and stderr. Now dosep gets the stdout and stderr through a pipe, and it doesn't always print the full output of the process. So in some cases the line "Ran N tests in M seconds" isn't appearing.

I can fix this, but I guess this is a good time to remind people of the dangers of log scraping. I don't think we should guarantee stability of the details of the test output format, except for perhaps a small summary printed at the end of the test run. So I would think about getting whatever numbers you need into this summary section, and then relying only on that for your output

dawn accepted this commit.Aug 24 2015, 2:15 PM

After some investigation, it appears your patch may have simply exposed an existing bug, so in one sense I owe an appology, but in another, your patch made that bug impossible to workaround. :) Before your change, it was possible to count the total tests run via adding up all the Ns in the lines:

Ran N tests in .*

and this would give the correct total. But after your change, these lines were no longer getting printed, forcing one to depend on the final count from:

Ran N test cases .*

Which is wrong, as I'll explain below. Below I've done a comparison between dosep and dotest on a narrowed subset of tests to show how dosep can omit the test cases from a test suite in its count.

Tested on subset of lldb/test with just the following directories/files (i.e. all others directories/files were removed):

    test/make
    test/pexpect-2.4
    test/plugins
    test/types
    test/unittest2
# The .py files kept in test/types are as follows (so test/types/TestIntegerTypes.py* was removed):
    test/types/AbstractBase.py
    test/types/HideTestFailures.py
    test/types/TestFloatTypes.py
    test/types/TestFloatTypesExpr.py
    test/types/TestIntegerTypesExpr.py
    test/types/TestRecursiveTypes.py

Tests were run in the lldb/test directory using the following commands:

dotest:
    ./dotest.py -v
dosep:
    ./dosep.py -s --options "-v"

Comparing the test case totals, dotest correctly counts 46, but dosep counts only 16:

dotest:
    Ran 46 tests in 75.934s
dosep:
    Testing: 23 tests, 4 threads ## note: this number changes randonmly
    Ran 6 tests in 7.049s
    [PASSED TestFloatTypes.py] - 1 out of 23 test suites processed
    Ran 6 tests in 11.165s
    [PASSED TestFloatTypesExpr.py] - 2 out of 23 test suites processed
    Ran 30 tests in 54.581s ## FIXME: not counted?
    [PASSED TestIntegerTypesExpr.py] - 3 out of 23 test suites processed
    Ran 4 tests in 3.212s
    [PASSED TestRecursiveTypes.py] - 4 out of 23 test suites processed
    Ran 4 test suites (0 failed) (0.000000%)
    Ran 16 test cases (0 failed) (0.000000%)

With test/types/TestIntegerTypesExpr.py* removed, both correctly count 16 test cases:

dosep:
    Testing: 16 tests, 4 threads
    Ran 6 tests in 7.059s
    Ran 6 tests in 11.186s
    Ran 4 tests in 3.241s
    Ran 3 test suites (0 failed) (0.000000%)
    Ran 16 test cases (0 failed) (0.000000%)

In rev.238454 (before your change), results didn't count the number of test
cases, but the test suite count is wrong. Running dosep on the above test
subset but with all tests in types (i.e. adding back TestIntegerTypes.py so we
have 5 tests in types), we get:

dosep:
    Ran 6 tests in 7.871s
    Ran 6 tests in 13.812s
    Ran 30 tests in 36.102s
    Ran 30 tests in 64.063s
    Ran 4 tests.

It seems now that, with dosep's -s option, we can once again see the output:

Ran N tests in .*

So counting the totals via:

./dosep.py -s --options "-v --executable $INSTALLDIR/bin/lldb" 2>&1 | tee test_out.log || true
export total=`grep -E "^Ran [0-9]+ tests? in" test_out.log | awk '{count+=$2} END {print count}'`

works once again.

BTW, what about tests that time out? I don't see where dosep will report any information about tests which time out.

Note: I couldn't compare the test counts on all the tests because of the concern raised in http://reviews.llvm.org/rL237053. That is, dotest can no longer complete the tests, as all test suites fail after test case 898: test_disassemble_invalid_vst_1_64_raw_data get ERRORs. I don't think that issue is related to problems in dosep, but I could be wrong.

Note also: When running dotest on earlier sources, it can hang on several tests.
To work around this, delete these tests from lldb/test:

rm -rf ./functionalities/thread/create_after_attach/TestCreateAfterAttach.py*
rm -rf ./functionalities/plugins/python_os_plugin/TestPythonOSPlugin.py*
rm -rf ./functionalities/unwind/sigtramp/TestSigtrampUnwind.py*
rm -rf ./test/macosx/queues/TestQueues.py*
rm -rf ./test/functionalities/inferior-crashing/TestInferiorCrashing.py*

In summary, dosep is unable to count the test cases correctly, but this problem existed before your change, and I'm happy that I'm able to use my workaround again. It would be nice to get that fixed someday, as well as see information about the tests that timed out.

Thanks,
-Dawn