Page MenuHomePhabricator

[lit] Sort test start times based on prior test timing data
ClosedPublic

Authored by davezarzycki on Mar 8 2021, 7:12 AM.

Details

Summary

Lit as it exists today has three hacks that allow users to run tests earlier:

  1. An entire test suite can set the is_early boolean.
  2. A very recently introduced "early_tests" feature.
  3. The --incremental flag forces failing tests to run first.

All of these approaches have problems.

  1. The is_early feature was until very recently undocumented. Nevertheless it still lacks testing and is a imprecise way of optimizing test starting times.
  2. The early_tests feature requires manual updates and doesn't scale.
  3. --incremental is undocumented, untested, and it requires modifying the *source* file system by "touching" the file. This "touch" based approach is arguably a hack because it confuses editors (because it looks like the test was modified behind the back of the editor) and "touching" the test source file doesn't work if the test suite is read only from the perspective of lit (via advanced filesystem/build tricks).

This patch attempts to simplify and address all of the above problems.

This patch formalizes, documents, tests, and defaults lit to recording the execution time of tests and then reordering all tests during the next execution. By reordering the tests, high core count machines run faster, sometimes significantly so.

This patch also always runs failing tests first, which is a positive user experience win for those that didn't know about the hidden --incremental flag.

Finally, if users want, they can _optionally_ commit the test timing data (or a subset thereof) back to the repository to accelerate bots and first-time runs of the test suite.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Nice!
I think this should also be used to improve the ETA time prediction.

Q: Where are these time files stored? I think it would not be good to store them next to the tests themselves.

Nice!
I think this should also be used to improve the ETA time prediction.

Q: Where are these time files stored? I think it would not be good to store them next to the tests themselves.

There is a dot file named .lit_test_times.txt that lives in the output directory, a.k.a. the "exec_root" as far as lit is concerned. This same file can also *optionally* live alongside the lit.cfg files in the "source_root". In short, lit checks the exec_root first and then falls back to the source_root.

LGTM!
In general a bit more code documentation could be nice. It seems all obvious in the context of this patch, but someone reading the codebase may not find this as trivial as it seems.

llvm/utils/lit/lit/cl_arguments.py
157

Seems like you deprecated the option, can you update the doc to reflect it?

llvm/utils/lit/lit/discovery.py
144

Maybe add a if litConfig.debug: trace here?

davezarzycki marked an inline comment as done.Mar 9 2021, 3:26 AM
davezarzycki added inline comments.
llvm/utils/lit/lit/cl_arguments.py
157

I still can't find any documentation for --incremental. What am I missing?

I believe I've addressed all of the feedback to date.

I've also fixed the regression. I had hoped to avoid adding test_times as a parameter to a bunch of methods, but adding that dictionary to the test suite makes lit effectively single threaded as the worker threads fought over what I assume is the global python interpreter lock.

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 9 2021, 3:29 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
mehdi_amini added inline comments.Mar 9 2021, 11:13 AM
llvm/utils/lit/lit/cl_arguments.py
157

Sorry I wasn't clear: I meant the "help" field here.

yln added a comment.Mar 9 2021, 2:21 PM

Very happy with the patch! Thanks for all your hard work! :)
I have a few small nits and one ask: can we model "test failed" explicitly instead of making their execution time really large?

llvm/utils/lit/lit/discovery.py
141

Use os.path.join(...)

145–149

Use Python's with statement

llvm/utils/lit/lit/main.py
108

I think it would be sufficient here to pass in the tests that we attempted to run.

166

Really like all the simplifications this enabled! :)

210

Is it possible to directly pass the display callback here now (and remove the local progress_callback())?

267–268

Can we explicitly model this part in code instead of increasing the execution time of failed tests? We can use the test.result.code property.

271–272

Not really needed and can be removed, right?

274–283

Using Python's with and os.path and string interpolation

llvm/utils/lit/tests/Inputs/reorder/.lit_test_times.txt
4

Clarifying my above comment:

PASS 3.0 subdir/ccc.txt
FAIL 2.0 bbb.txt
In D98179#2615130, @yln wrote:

can we model "test failed" explicitly instead of making their execution time really large?

I am strongly in favour of this, if it can be done. My team have wanted an option to rerun just failing tests, so being able to distinguish between those tests that failed and those that passed in the previous run would basically solve this (after a new option has been added to lit). When I say "failed" here, I actually mean XPASSes and failures that weren't XFAILs (plus UNRESOLVED etc). Running a subset of the lit testsuite would augment these results, rather than trash them (i.e. tests outside the subset would still be considered as failing until such time as they get run and then pass). I'm not asking for this retry mechanism to be implemented as part of that, but if this change can be made to work such that the retry feature is able to easily build on top of that, that would be great.

I believe I have addressed all of the feedback to date.

For people that care about easily identifying the failures from the previous testing run, that is now encoded via the sign of the test time (please remember that negative test times are impossible). For example:

3.14159 a-successful-test.txt
-2.71828 a-failed-test.txt

I've also implemented but commented out how one might use this to implement --only-failures. Feedback would be appreciated but I don't want to derail this change proposal on what should be an independent change.

Finally, python is not my native programming language, so thank you for helping me with idiomatic changes. What I'd really like help with though is figuring out a way to not pass the prior test timing data down as a new parameter to various methods. I tried to hang the data off the test suite data structure but that had brutal performance side effects. It seemed like python was hammering on the global interpreter lock despite the prior timing data being unused after the sorting phase. Weird. Help would be greatly appreciated. Thanks!

I've skimmed the implementation, and it looks good to me, but I haven't got the time right now to review thoroughly. I'm pleased to see the ease of implementing "only-failures" on top of this too, thanks for illustrating.

llvm/utils/lit/lit/main.py
272

file is a builtin python type, so don't use it as a variable name. Just call it time_file or something like that.

I renamed the file variable as requested.

I also found a workaround to the performance problem I referred to earlier. Now no new parameter is required during test discovery and therefore the diff is both simpler and more effective because now programmatic test creators get test record-and-replay support for free.

davezarzycki removed 1 blocking reviewer(s): Restricted Project.Mar 10 2021, 7:30 AM

I'm downgrading libcxx as a reviewer to non-blocking because the latest patch no longer requires changes to their project.

jhenderson added inline comments.Mar 12 2021, 1:37 AM
llvm/utils/lit/lit/Test.py
216

I believe you don't need the readlines() part of this line - my understanding is that you can iteratre over a file and it'll yield a line per iteration.

217

You an probably omit the first argument here, as demonstrated. That will split on the first sequence of whitespace.

llvm/utils/lit/lit/cl_arguments.py
154

I'd keep the old wording here and below (specifically "Start" -> "Run"). "Start" implies the tests might be started in a random order, but it may change mid run, which doesn't really make sense.

211

Maybe rather than "was ignored" use "is deprecated". Also "start" -> "run" as before.

llvm/utils/lit/lit/main.py
172

Perhaps worth a message with this assert to make it clear that the failure is due to an unknown TestOrder value.

270–271

Indentation here is inconsistent.

llvm/utils/lit/tests/reorder.py
2

I don't really understand what you mean by "starts" here. Do you mean something like "Check that we can change the test order."

Also, missing full stop.

I've made all of the requested changes to date. Two notes:

  1. One cannot simply omit the first argument to split() if the second is provided. As I just learned, python will complain. I switched the first parameter to None as a compromise.
  2. For the record, I disagree with the use of the word "run" over "start" but this isn't something that I care enough about to resist changing the patch. Strictly speaking, lit only controls when tests start, not anything about how they run. With enough cores, tests will complete in a fairly predictable order despite any effort to start tests in a different order (or randomly).

I've made all of the requested changes to date. Two notes:

  1. One cannot simply omit the first argument to split() if the second is provided. As I just learned, python will complain. I switched the first parameter to None as a compromise.
  2. For the record, I disagree with the use of the word "run" over "start" but this isn't something that I care enough about to resist changing the patch. Strictly speaking, lit only controls when tests start, not anything about how they run. With enough cores, tests will complete in a fairly predictable order despite any effort to start tests in a different order (or randomly).

That's odd, the documentation even contains such an example:

For example:
>>>

>>> '1,2,3'.split(',')
['1', '2', '3']
>>> '1,2,3'.split(',', maxsplit=1)
['1', '2,3']
>>> '1,2,,3,'.split(',')
['1', '2', '', '3', '']
jhenderson added inline comments.Mar 12 2021, 3:31 AM
llvm/utils/lit/lit/Test.py
217

There's a typo in my example: it should be maxsplit=1 (with the s). You need to use the argument name as shown. Example usage from my local python instance:

>>> s = '123 456 789'
>>> s.split(maxsplit=1)
['123', '456 789']

I fixed the call to split as requested. Being unfamiliar with python, I didn't know that it supported keyword arguments / parameter labels. Thanks.

Might somebody be willing to sign off on this change (this week or next)? I'd like to cherry-pick it to Swift's LLVM branch. Thanks for all the feedback so far.

jhenderson accepted this revision.Mar 15 2021, 8:38 AM

Might somebody be willing to sign off on this change (this week or next)? I'd like to cherry-pick it to Swift's LLVM branch. Thanks for all the feedback so far.

LGTM, but probably best to get someone else to confirm they're happy too.

llvm/utils/lit/lit/cl_arguments.py
159–161

Please delete this before pushing.

llvm/utils/lit/lit/main.py
74–75

Please delete this before pushing.

llvm/utils/lit/tests/shtest-shell.py
11

Nit

This revision is now accepted and ready to land.Mar 15 2021, 8:38 AM
yln accepted this revision.Mar 15 2021, 10:59 AM

Removed the commented only-failures and fixed a nit before committing.

Thanks everybody for the various rounds of feedback. I'm quite happy with how this turned out.

This revision was landed with ongoing or failed builds.Mar 16 2021, 2:23 AM
This revision was automatically updated to reflect the committed changes.
jmorse added a subscriber: jmorse.Mar 16 2021, 5:55 AM

Hi,

I'm experiencing reorder.py failing intermittently on a Windows system -- it seems to pick up the .lit_test_times.txt produced in the build directory for the Inputs tests after the first run, and use that for ordering. And seeing how that just contains noise, the Inputs tests are then run in the wrong order. Is this supposed to be accounted for already?

(I'm trying to produce something that reliably passes, but I'm not familiar with this part of lit).

That implies a different testing environment. How different is the Windows testing environment compared to Unix? Do you specifically have a custom testing setup?

Said differently, the Unix side seems to always run the lit tests from a fresh/clean setup, even for incremental builds. If Windows tests lit differently then ya, things will blow up.

jmorse added a comment.EditedMar 16 2021, 7:11 AM

It turns out there's a plain old windows-has-different-slashes problem going on. In the Test object __init__ method, the call to os.sep.join in this patch produces the string "subdir\ccc.txt", which won't match "subdir/ccc.txt" that you load from the the .lit_test_times.txt file. That then leads to no time being loaded for ccc.txt, and it gets mis-ordered.

I think the stale .lit_test_times.txt thing was a slight red-herring, but it's because my usual workflow is:

  • run ninja check-llvm or similar to run all tests,
  • run llvm-lit directly on any tests that fail

And without running ninja, the "preparing lit tests" step isn't performed, meaning .lit_test_times.txt isn't cleaned up. This means none of the buildbots are going to fail from a stale file as they always run all the tests, but it'll be a bit awkward for anyone running lit manually IMO. Making it idempotent would be highly appreciated!

(Edit: I've no idea why the public buildbots aren't suffering the path separator issue, but it's what I see on our internal test rig)

Ah yes, of course. I've marked that test as unavailable on Window for now: 61ca706461c5e1edc18526c9ddc3250fe074ed94

In the long run, We'll need to figure out a way for people to commit cross-platform trace files but for now the more important record-and-reorder feature works cross platform.

I've run out of time for today, but for the record, I think we should use slash as the canonical separator for the timing data file. The actual separator doesn't matter because these strings are never used as paths, just as keys into a dictionary. (Therefore any character would work.) This would let us harmonize the Unix and Windows timing data files.

yln added a comment.Mar 16 2021, 11:09 AM

... I think we should use slash as the canonical separator for the timing data file.

+1

nlopes added a subscriber: nlopes.Mar 19 2021, 3:37 PM

This patch makes the order of the list of failing tests non-deterministic. This is extremely annoying because you can't do a simple diff between test dumps anymore.
Before the list of failed tests used to be sorted.

Can we revert to the previous behavior please? The current behavior is not user friendly. Thanks!

mehdi_amini added a comment.EditedMar 19 2021, 3:56 PM

Can we revert to the previous behavior please? The current behavior is not user friendly. Thanks!

To clarify: you care about the order in the final summary, not the actual execution order, right? (the goal of this patch is the latter, if it changes the former this is just a side-effect I believe)

How are you diffing? Copy-pasting the terminal output to a file? (I alway pipe these kind of list through sort before diffing them)

Can we revert to the previous behavior please? The current behavior is not user friendly. Thanks!

To clarify: you care about the order in the final summary, not the actual execution order, right? (the goal of this patch is the latter, if it changes the former this is just a side-effect I believe)

I don't love that the default test order was changed. For our use case that isn't good, but I can just delete the .lit_test_times.txt file and get the old fixed order. (the reason why having a fixed order is good is because of timeouts. Also, slow tests often consume more memory in our setting, so it's not good to run all the slow tests at the same time).
So I'm just referring to the final summary, yes.

How are you diffing? Copy-pasting the terminal output to a file? (I alway pipe these kind of list through sort before diffing them)

No, I'm running lit and dumping into a file. sorting is not as easy as piping through sort, as I'm running with -vv (required). Anyway, sorting the list on lit side should be 1 line for some that knows where that line goes :)

No, I'm running lit and dumping into a file. sorting is not as easy as piping through sort, as I'm running with -vv (required).

Since tests run in parallel, don't you already have some non-determinism in the -vv output? Isn't it printing as test finish? (just curious, not pushing back on anything here).

Anyway, sorting the list on lit side should be 1 line for some that knows where that line goes :)

Yeah I agree that sorting on the lit side for the summary should be fairly easy!

davezarzycki added a comment.EditedMar 20 2021, 3:45 AM

No, I'm running lit and dumping into a file. sorting is not as easy as piping through sort, as I'm running with -vv (required).

Since tests run in parallel, don't you already have some non-determinism in the -vv output? Isn't it printing as test finish? (just curious, not pushing back on anything here).

Right. Unless -j1 is passed to lit, then the lit output is, strictly speaking, non-deterministic. That being said (and especially on a quiet machine), I do understand why people might feel that the output is consistent enough from run to run to be quasi-deterministic.

Anyway, sorting the list on lit side should be 1 line for some that knows where that line goes :)

Yeah I agree that sorting on the lit side for the summary should be fairly easy!

As far as I can tell, the lit architecture likes to "pay-as-you-go" with test output, so there isn't a good place to sort the output and I'd wager that a lot of people would be upset if the test output stalled until completion for sorting reasons.

If people feel that deleting the test timing data is onerous (and IMHO, I don't think that's a big ask), then I think the simplest solution is to add a --ignore-timing-data option to lit that ignores test timing data and gives people the old behavior (for whatever reason).

But I don't think the *default* lit behavior should be to ignore test failures and ignore test timing data. Both of these are about making the out-of-the-box experience for people more productive. If people have reasons to run tests in specific orders, then let's talk about those needs and why a lexical sort was convenient.

@nlopes are you talking about sorting *all* lit output, or the summary?

Can we revert to the previous behavior please? The current behavior is not user friendly. Thanks!

To clarify: you care about the order in the final summary, not the actual execution order, right? (the goal of this patch is the latter, if it changes the former this is just a side-effect I believe)

I don't love that the default test order was changed. For our use case that isn't good, but I can just delete the .lit_test_times.txt file and get the old fixed order. (the reason why having a fixed order is good is because of timeouts. Also, slow tests often consume more memory in our setting, so it's not good to run all the slow tests at the same time).

Can you elaborate more about your usage of timeouts? As far as I know, the feature exists only to detect wildly broken tests. If reordering the tests causes timeouts to fire, then either the timeout is too aggressive or the tests themselves were on the edge of failure anyway and any "unrelated" change could cause them to fail.

As to the resource management challenges, lit has an undocumented "parallelism group" feature that would probably help you keep resource usage in check and would be more far appropriate than lexical execution order.

I'm talking about sorting just the summary of failed tests, not the whole output. We need the whole -vv output, but that can be out of order.

Why are timeouts important? Our use case is running Alive2 with the test suite. Alive2 is heavy machinery and runs into timeouts. Running the tests in roughly the same order every time is important to avoid timeout tests flipping to failed or vice-versa. Plus slow tests = tests that consume a lot of memory (in our scenario), so we can't bundle slow tests together.
Adding a --ignore-timing-data would be great, yes! But I still feel that sorting the list of failed tests is important for user experience. I diff these all the time.

I'm talking about sorting just the summary of failed tests, not the whole output. We need the whole -vv output, but that can be out of order.

Aha! +1 then, i think sorting summary should be both easy and nice to have.

I'm talking about sorting just the summary of failed tests, not the whole output. We need the whole -vv output, but that can be out of order.

Aha! +1 then, i think sorting summary should be both easy and nice to have.

Agreed. Try: 5cbe2279f723f1cca1a542d95e7d9760e4f52240

I'm talking about sorting just the summary of failed tests, not the whole output. We need the whole -vv output, but that can be out of order.

Why are timeouts important? Our use case is running Alive2 with the test suite. Alive2 is heavy machinery and runs into timeouts. Running the tests in roughly the same order every time is important to avoid timeout tests flipping to failed or vice-versa. Plus slow tests = tests that consume a lot of memory (in our scenario), so we can't bundle slow tests together.
Adding a --ignore-timing-data would be great, yes! But I still feel that sorting the list of failed tests is important for user experience. I diff these all the time.

That still sounds incredibly brittle. If there is any variety in test machine performance, then any and all attempts at sorting should be futile because the underlying hardware will perturb different timeouts. Is this not your experience? How do you reconcile hardware performance and configuration details (like SMT) with timeout settings?

Why are timeouts important? Our use case is running Alive2 with the test suite. Alive2 is heavy machinery and runs into timeouts. Running the tests in roughly the same order every time is important to avoid timeout tests flipping to failed or vice-versa. Plus slow tests = tests that consume a lot of memory (in our scenario), so we can't bundle slow tests together.
Adding a --ignore-timing-data would be great, yes! But I still feel that sorting the list of failed tests is important for user experience. I diff these all the time.

That still sounds incredibly brittle. If there is any variety in test machine performance, then any and all attempts at sorting should be futile because the underlying hardware will perturb different timeouts. Is this not your experience? How do you reconcile hardware performance and configuration details (like SMT) with timeout settings?

Of course it's brittle :) Changing from a time-based setting to a ticks-based system is ongoing work, such that resource exhaustion becomes deterministic.
Nevertheless, on a same machine, we don't see many test flips. It's quite stable most of the times (just one test flip once in a while).

Why are timeouts important? Our use case is running Alive2 with the test suite. Alive2 is heavy machinery and runs into timeouts. Running the tests in roughly the same order every time is important to avoid timeout tests flipping to failed or vice-versa. Plus slow tests = tests that consume a lot of memory (in our scenario), so we can't bundle slow tests together.
Adding a --ignore-timing-data would be great, yes! But I still feel that sorting the list of failed tests is important for user experience. I diff these all the time.

That still sounds incredibly brittle. If there is any variety in test machine performance, then any and all attempts at sorting should be futile because the underlying hardware will perturb different timeouts. Is this not your experience? How do you reconcile hardware performance and configuration details (like SMT) with timeout settings?

Of course it's brittle :) Changing from a time-based setting to a ticks-based system is ongoing work, such that resource exhaustion becomes deterministic.
Nevertheless, on a same machine, we don't see many test flips. It's quite stable most of the times (just one test flip once in a while).

This seems really beyond the scope and purpose of sorting the tests.

If you don't mind and given that the workaround is trivial (delete the timing data), I'd like to hold off on adding --ignore-timing-data. If enough people complain then we can add that option. Is that okay with you?

Why are timeouts important? Our use case is running Alive2 with the test suite. Alive2 is heavy machinery and runs into timeouts. Running the tests in roughly the same order every time is important to avoid timeout tests flipping to failed or vice-versa. Plus slow tests = tests that consume a lot of memory (in our scenario), so we can't bundle slow tests together.
Adding a --ignore-timing-data would be great, yes! But I still feel that sorting the list of failed tests is important for user experience. I diff these all the time.

That still sounds incredibly brittle. If there is any variety in test machine performance, then any and all attempts at sorting should be futile because the underlying hardware will perturb different timeouts. Is this not your experience? How do you reconcile hardware performance and configuration details (like SMT) with timeout settings?

Of course it's brittle :) Changing from a time-based setting to a ticks-based system is ongoing work, such that resource exhaustion becomes deterministic.
Nevertheless, on a same machine, we don't see many test flips. It's quite stable most of the times (just one test flip once in a while).

This seems really beyond the scope and purpose of sorting the tests.

If you don't mind and given that the workaround is trivial (delete the timing data), I'd like to hold off on adding --ignore-timing-data. If enough people complain then we can add that option. Is that okay with you?

Sounds good. Thanks for adding support for the summary sort!

Something related to the time recording seems to fail intermittently on buildbots: https://lab.llvm.org/buildbot#builders/93/builds/2697

lebedev.ri added a comment.EditedApr 14 2021, 8:34 AM

Something related to the time recording seems to fail intermittently on buildbots: https://lab.llvm.org/buildbot#builders/93/builds/2697

I'll cut to the fix here: @davezarzycki is there any reason why the timing data is stored in a CSV format, and not JSON?
The fix is to switch to latter.

Something related to the time recording seems to fail intermittently on buildbots: https://lab.llvm.org/buildbot#builders/93/builds/2697

I'll cut to the fix here: @davezarzycki is there any reason why the timing data is stored in a CSV format, and not JSON?
The fix is to switch to latter.

What? Weird. The file isn't CSV. It's just a number followed by a space and then everything until the newline is the file name. Nothing fancy. Switching to JSON probably just means papering over the real problem which seems to be either that sometimes a test has an empty string for a name or that there is an edge case in python's number printing that can result in pure whitespace. We'd need to look at the timing data file to see why it's bogus (or corrupt).