This is an archive of the discontinued LLVM Phabricator instance.

Un-revert "Teach the CMake build system to run lit's test suite. These can be run"
ClosedPublic

Authored by modocache on Jul 25 2017, 10:29 PM.

Details

Summary

Depends on https://reviews.llvm.org/D35879.

This reverts rL257268, which in turn was a revert of rL257221.
https://reviews.llvm.org/D35879 marks the tests in the lit test suite
that fail on Windows as XFAIL, which should allow these tests to pass
on Windows-based buildbots.

Event Timeline

modocache created this revision.Jul 25 2017, 10:29 PM
delcypher edited edge metadata.Jul 26 2017, 2:11 AM

Other than criticisms of my own code :p LGTM.

utils/lit/CMakeLists.txt
11

Minor nit (criticising my own code here :p). Should be should be single ` , not double.

utils/lit/tests/lit.cfg
32

Minor nits (criticising my own code here :p). Should be should be single ` , not double.

mgorny accepted this revision.Jul 26 2017, 4:39 AM

The code looks good, and seems to work. I'm sometimes getting a failure from shtest-timeout.py but I suspect that's an irrelevant race condition.

This revision is now accepted and ready to land.Jul 26 2017, 4:39 AM

That race condition may indeed exist, in rL257268 @delcypher mentions one reason for the revert is "the lit per test timeout tests fail." Perhaps related? Do you have more information about when the test fails?

modocache updated this revision to Diff 108274.Jul 26 2017, 7:22 AM

Use single ticks, not double

jroelofs edited edge metadata.Jul 26 2017, 7:24 AM

The code looks good, and seems to work. I'm sometimes getting a failure from shtest-timeout.py but I suspect that's an irrelevant race condition.

How is it failing? If it's a race condition, we should fix the test.

@modocache

That race condition may indeed exist, in rL257268 @delcypher mentions one reason for the revert is "the lit per test timeout tests fail." Perhaps related? Do you have more information about when the test fails?

The test failure I referred to there was fixed by r257616 . It was an API problem with the psutil module which is used to implement the timeout

@jroelofs

How is it failing? If it's a race condition, we should fix the test.

Agreed. I suspect the issue might be running test/Inputs/shtest-timeout/quick_then_slow.py and test/Input/shtest-timeout/slow.py because they use hardcoded sleep times could be fragile.

TBH it's been so long since I looked at this that I can't remember the exact reason why quick_then_slow.py and slow.py exist. There's a good reason to test with internal/external shell because the per test timeout is implemented differently in those cases but I'm drawing a blank on why I added quick_then_slow.py and slow.py. Looking at http://reviews.llvm.org/D14706 doesn't seem to tell me why...

Well, I can't reproduce that one now but I've got another one. An easy way to reproduce:

./utils/lit/lit.py -v utils/lit/tests/max-failures.py utils/lit/tests/shtest-shell.py

(or likewise run the two tests simultaneously)

I suspect it's because they call lit recursively, and they both use the same input directory, i.e. two instances of lit are run on the same test suite concurrently.

@mgorny The problem here is that %t is expanded in a deterministic way based on a test filename.
This makes running concurrently the same test prone to race conditions.

One way to fix it would be to create a true temporary file instead:

diff --git a/utils/lit/lit/TestRunner.py b/utils/lit/lit/TestRunner.py
index a60a0f8..40f8c71 100644
--- a/utils/lit/lit/TestRunner.py
+++ b/utils/lit/lit/TestRunner.py
@@ -817,7 +817,8 @@ def getDefaultSubstitutions(test, tmpDir, tmpBase, normalize_slashes=False):
     substitutions = []
     substitutions.extend([('%%', '#_MARKER_#')])
     substitutions.extend(test.config.substitutions)
-    tmpName = tmpBase + '.tmp'
+    fd, filename = tempfile.mkstemp()
+    tmpName = filename
     baseName = os.path.basename(tmpBase)
     substitutions.extend([('%s', sourcepath),
                           ('%S', sourcedir),

(and we would have to delete it later).

This could be also undesirable, making figuring out test behavior hard to reproduce.
Again, an easy solution would be to add a flag and an environment variable to switch between the two modes (temporary deleted file vs. persistent file).
This might be also desirable for CI, which might be running multiple jobs in parallel on a single checkout.
@ddunbar any comments?

A much easier fix would be to change lit tests to use mktemp instead of %t (then the test would need to be responsible for the file cleanup).

@george.karpenkov

@mgorny The problem here is that %t is expanded in a deterministic way based on a test filename.
This makes running concurrently the same test prone to race conditions.

One way to fix it would be to create a true temporary file instead:
<snip>
This could be also undesirable, making figuring out test behavior hard to reproduce.
Again, an easy solution would be to add a flag and an environment variable to switch between the two modes (temporary deleted file vs. persistent file).

Wouldn't it be just simpler to use tempfile.mkstemp() and specify the prefix argument to be what the deterministic file name used to be?

That way we get a temporary file name but it starts with a prefix that a human can easily recognise.

There's another problem with this proposal though. In a project that I work on that depends on lit we use

RUN: mkdir -p %t

Your proposed change to the semantics of %t would break all of these tests.

modocache added a subscriber: rnk.Jul 27 2017, 7:56 AM

An easy way to reproduce:

./utils/lit/lit.py -v utils/lit/tests/max-failures.py utils/lit/tests/shtest-shell.py

Looks like @rnk fixed this in rL309227. I'll try landing this revision today, and we can see which, if any, of the buildbots still break compared to when this was first landed a year ago. Does that sound good?

Hmm, when run locally on my macOS, max-failures.py and shtest-shell.py are now failing, so those will definitely need to be fixed before this is landed. I'm on rL309282.

I sent up D35947 to fix the new failures. Then I think I'll try to land this and get lit tests running continuously, unless anyone has any objections. :)

rnk added inline comments.Jul 27 2017, 9:47 AM
utils/lit/CMakeLists.txt
10

Does this track dependencies on the original source files so that changes to them are reflected in check-lit? If I touch one test file, how much time does this take, especially on NTFS?

Maybe we should do this the way that the LLVM tests do this, which is to configure each lit.site.cfg.in in llvm/test and have it point back at the source tree. We'd have to do this for everything in llvm/utils/lit/tests/Inputs, of course.

delcypher added inline comments.Jul 27 2017, 10:09 AM
utils/lit/CMakeLists.txt
10

Does this track dependencies on the original source files so that changes to them are reflected in check-lit? If I touch one test file, how much time does this take, especially on NTFS?

I'm not not sure what "this" is referring to here. In the current implementation the copy happens every time the check-lit target gets executed so the files are always up to date before running the tests. To avoid doing the copy in the future we could try ${CMAKE_COMMAND} -E copy_if_different instead but I don't know how that's implemented so it might not be any faster.

Maybe we should do this the way that the LLVM tests do this, which is to configure each lit.site.cfg.in in llvm/test and have it point back at the source tree.

I'm not sure what you mean by that. Do you mean the fact that for LLVM's main lit tests config.test_source_root and config.test_exec_root are different? I don't feel great about that because that means the build system has to a bunch of configuring and that then ties the running the tests to the build system which is currently not the case. Currently the tests can be run completely independently of the build system and that's important because lit is not tied to LLVM and has users outside of LLVM and its sub-projects.

rnk added inline comments.Jul 27 2017, 10:28 AM
utils/lit/CMakeLists.txt
10

My concern is that recursively copying llvm/utils/lit/tests/Inputs on every check-lit will be too slow. It already takes 20s on my machine. Python tests are supposed to be fast. =P

We definitely want to be able to run the lit tests without running cmake. I was imagining that every lit.cfg in Inputs would have a corresponding lit.site.cfg.in that gets configured by cmake to fill in config.test_source_root and config.test_exec_root as appropriate.

I don't want to let the perfect be the enemy of the good, though, so if you don't think it's worth it, let's commit this and get this running as part of check-all again.

mgorny added inline comments.Jul 27 2017, 11:04 AM
utils/lit/CMakeLists.txt
10

Amen to that! Also, I think copying the test files implies that if we remove some of them, they will remain in the build dir.

I am definitely a fan of turning these on first, then optimizing to reduce the time it takes for them to run. For the record, with this patch applied locally, check-all takes 681.54 real, 3349.92 user, 614.90 sys, whereas before it took 667.35 real, 3293.70 user, 600.09 sys. I hope that's acceptable price to pay for keeping lit's own tests passing, at least until we can optimize it to run faster.

I'll land this now, and monitor what happens for the next few hours. I'll revert it if related failures pop up. Fingers crossed! :)

modocache closed this revision.Jul 27 2017, 12:19 PM

This Windows buildbot failure could potentially be related: http://lab.llvm.org:8011/builders/llvm-clang-x86_64-expensive-checks-win/builds/4046. The build failed due to a timeout. That being said, the bot *is* named "expensive-checks", so maybe a timeout is to be expected...? I'll keep monitoring builds for now.

rnk added inline comments.Jul 28 2017, 1:41 PM
utils/lit/CMakeLists.txt
10

It turns out that this approach doesn't work well with incremental builds. The modules buildbot is red right now because copy_directory doesn't delete stale files in the tree: http://lab.llvm.org:8011/builders/clang-x86_64-linux-selfhost-modules/builds/7955

In the googletest-format test, we end up with both DummySubDir/OneTest and DummySubDir/OneTest.py files in the test suite, and we get some extra unexpected failures.

I'll try to implement the approach I suggested and see if it's any good.

delcypher added inline comments.Jul 28 2017, 2:10 PM
utils/lit/CMakeLists.txt
10

@mgorny

Amen to that! Also, I think copying the test files implies that if we remove some of them, they will remain in the build dir.

Good point. I hadn't thought of that. A quick hack to make things green again will be to run

${CMAKE_COMMAND} -E remove_directory "${CMAKE_CURRENT_BINARY_DIR}/tests"

But really we need to investigate the approach suggested by @rnk

rnk added inline comments.Jul 28 2017, 2:20 PM
utils/lit/CMakeLists.txt
10

I went ahead and landed the cmake -E remove_directory idea in rL309432. Hopefully that'll get things green while I try the site config thing.