Page MenuHomePhabricator

[lit] Generalize `early_tests`
AbandonedPublic

Authored by davezarzycki on Feb 19 2021, 5:33 AM.

Details

Summary

Building on the "early_tests" feature, some test suites have fast and low reward tests that should be run later. For example, Swift has a ton of tests that were generated by a fuzzer that are fast and not expected to regress but are still useful to verify. By running these tests later, we give normal tests (that probably take longer) a chance to run earlier and let the fast "late_tests" fill in the gaps at the end of the test suite execution.

The design of this change is general enough that people should be able to scale their use with their maintenance goals/non-goals.

Diff Detail

Unit TestsFailed

TimeTest
40 msx64 debian > Flang.Semantics::resolve102.f90
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/flang/test/Semantics/test_errors.sh /mnt/disks/ssd0/agent/llvm-project/flang/test/Semantics/resolve102.f90 /mnt/disks/ssd0/agent/llvm-project/build/tools/flang/test/Semantics/Output/resolve102.f90.tmp /mnt/disks/ssd0/agent/llvm-project/build/bin/f18 -intrinsic-module-directory /mnt/disks/ssd0/agent/llvm-project/build/tools/flang/include/flang

Event Timeline

davezarzycki created this revision.Feb 19 2021, 5:33 AM
davezarzycki requested review of this revision.Feb 19 2021, 5:33 AM

Could you clarify more what the benefit of running the tests later actually is? Is this intended to be useful in conjunction with something else? On the face of it, I'd expect people to run the full testsuite to completion, and then observe the full results. Ultimately, it doesn't matter whether a test runs first or last in this order for that case.

Could you clarify more what the benefit of running the tests later actually is? Is this intended to be useful in conjunction with something else? On the face of it, I'd expect people to run the full testsuite to completion, and then observe the full results. Ultimately, it doesn't matter whether a test runs first or last in this order for that case.

You want long running tests to run first and shorter one last as they will use to balance the imbalance from long running test. It's just to maximize the parallelism. If you have some medium to big tests running at the end, the risk is you are waiting for one core to finish while all the others are sitting idle. If you keep lots of small tests at the end, the cores that are idle will pick from it and hopefully reduce the imbalance.

Could you clarify more what the benefit of running the tests later actually is? Is this intended to be useful in conjunction with something else? On the face of it, I'd expect people to run the full testsuite to completion, and then observe the full results. Ultimately, it doesn't matter whether a test runs first or last in this order for that case.

You want long running tests to run first and shorter one last as they will use to balance the imbalance from long running test. It's just to maximize the parallelism. If you have some medium to big tests running at the end, the risk is you are waiting for one core to finish while all the others are sitting idle. If you keep lots of small tests at the end, the cores that are idle will pick from it and hopefully reduce the imbalance.

I get the idea of running long tests early, but from observation, the vast majority of lit tests in the LLVM test suite are very similar in length, lasting less than a second. As such, it doesn't make any significant difference whether they are run late, or in the middle (as long as the long ones are run early). I would assume the same applies as much for these Swift tests as any other equivalent ones. I could see a case if we're suggesting that this feature is for a downstream test suite that is mostly comprised of long and slow tests (i.e. ones that take several seconds each), so that it's easier to mark the ones that are quick, but I don't think that's the case here?

Could you clarify more what the benefit of running the tests later actually is? Is this intended to be useful in conjunction with something else? On the face of it, I'd expect people to run the full testsuite to completion, and then observe the full results. Ultimately, it doesn't matter whether a test runs first or last in this order for that case.

You want long running tests to run first and shorter one last as they will use to balance the imbalance from long running test. It's just to maximize the parallelism. If you have some medium to big tests running at the end, the risk is you are waiting for one core to finish while all the others are sitting idle. If you keep lots of small tests at the end, the cores that are idle will pick from it and hopefully reduce the imbalance.

I get the idea of running long tests early, but from observation, the vast majority of lit tests in the LLVM test suite are very similar in length, lasting less than a second. As such, it doesn't make any significant difference whether they are run late, or in the middle (as long as the long ones are run early). I would assume the same applies as much for these Swift tests as any other equivalent ones. I could see a case if we're suggesting that this feature is for a downstream test suite that is mostly comprised of long and slow tests (i.e. ones that take several seconds each), so that it's easier to mark the ones that are quick, but I don't think that's the case here?

I don’t think late_tests is useful to every lit test suite. That’s why I gave the Swift example in the description. And yes, this is just about maximizing concurrency. In particular, lit runs tests in lexical order so the “compiler_crasher” directories sort before most other tests. I’m not at my computer but my memory is that running the ultra-fast compiler crasher tests last to fill in the otherwise idle cores makes the whole test suite about 5% faster.

Could you clarify more what the benefit of running the tests later actually is? Is this intended to be useful in conjunction with something else? On the face of it, I'd expect people to run the full testsuite to completion, and then observe the full results. Ultimately, it doesn't matter whether a test runs first or last in this order for that case.

You want long running tests to run first and shorter one last as they will use to balance the imbalance from long running test. It's just to maximize the parallelism. If you have some medium to big tests running at the end, the risk is you are waiting for one core to finish while all the others are sitting idle. If you keep lots of small tests at the end, the cores that are idle will pick from it and hopefully reduce the imbalance.

I get the idea of running long tests early, but from observation, the vast majority of lit tests in the LLVM test suite are very similar in length, lasting less than a second. As such, it doesn't make any significant difference whether they are run late, or in the middle (as long as the long ones are run early). I would assume the same applies as much for these Swift tests as any other equivalent ones. I could see a case if we're suggesting that this feature is for a downstream test suite that is mostly comprised of long and slow tests (i.e. ones that take several seconds each), so that it's easier to mark the ones that are quick, but I don't think that's the case here?

I don’t think late_tests is useful to every lit test suite. That’s why I gave the Swift example in the description. And yes, this is just about maximizing concurrency. In particular, lit runs tests in lexical order so the “compiler_crasher” directories sort before most other tests. I’m not at my computer but my memory is that running the ultra-fast compiler crasher tests last to fill in the otherwise idle cores makes the whole test suite about 5% faster.

So back to my early question on this - is it possible to simply mark the other tests as early tests or are there too many of them to be practical? If it's impractical, that's fine and we can proceed. Also, what's the approximate duration of the slower tests (per test) compared to these fast tests? I'm just trying to get a picture of what is going on, that's all, before I sign off on another feature.

It is both impractical and unwise to mark all of the non-compiler-crasher tests in Swift as "early". It's impractical because that's most of the tests. It's unwise because having more than a few tests in the early stage negates the benefit of the early stage itself.

As to test timing, I don't think the specific results matter as much as the patterns that emerge. On a high-end workstation from 2017, it's not hard to be in a scenario where the "total test time" is close to the "slowest individual test time". This is true for LLVM and Swift. I haven't checked clang but I'd expect that to be true too. That's why it matters a lot to run the top five or so slowest tests early.

Now, in the specific case of Swift, we can further speed up the overall testing time by running easily identifiable fast tests "late". This works because there are a lot of "medium slow" tests (far too many to add to the "early_tests" phase) and because lit sorts the tests and "compiler_crasher" happens to sort early. For Swift, there are 6000+ fast (and low reward) tests spread across four directories: compiler_crashers, compiler_crashers_2, compiler_crashers_2_fixed, and compiler_crashers_fixed. So with one simple regular expression, we can get thousands of fast/low-reward tests to run late. This is a small performance win for fast machines and a huge user experience win for slower machines.

jhenderson added inline comments.Feb 24 2021, 4:16 AM
llvm/utils/lit/lit/TestingConfig.py
127–131

Why are late tests a regular expression, but early tests a list? The inconsistency seems confusing to me.

llvm/utils/lit/tests/late-tests.py
1 ↗(On Diff #324958)

You probably also want a test case showing the interaction between early and late tests (i.e. if both are set what happens, both in the case where they impact different sets, and where one test is specified in both).

davezarzycki added inline comments.Feb 24 2021, 4:38 AM
llvm/utils/lit/lit/TestingConfig.py
127–131

"late_tests" does not need to be regular expression. It was just expedient given my goals. The "late_tests" variable could be a list if directories if we treat tests therein as being "late".

llvm/utils/lit/tests/late-tests.py
1 ↗(On Diff #324958)

Sure. Once we settle on the regex question, I can make this change.

jhenderson added inline comments.Feb 24 2021, 5:31 AM
llvm/utils/lit/lit/TestingConfig.py
127–131

I think users will expect the two to be the same type. I don't mind which way around it is personally, and I think applying the rule (whether early or late) to all tests within a directory, when a directory is specified, would be a good idea.

yln added inline comments.Feb 24 2021, 12:14 PM
llvm/utils/lit/lit/TestingConfig.py
127–131

I agree with James here that it should be the same mechanism and would prefer a non-regex solution.

Maybe "test path prefix" is a good solution? Specify full path to name an individual test or a prefix, to name a directory and have everything inside execute late/early.

yln added a comment.Feb 24 2021, 12:48 PM

I also want to state that I am a bit concerned about feature creep and the the general usefulness of this effort.

  • Does shaving off 5% of test time for one project justify having it in upstream? Think added complexity, maintenance and opportunity cost. It is difficult to decide where to draw the line and I will not block this effort, but I wanted to see where other people stand on this?
  • Are there workarounds we have considered? An ugly one would be naming directories to influence order.
  • Annotation of fast and slow tests is manual and needs to be maintained. These things tend to rot over time and the benefit will decrease.

Is the sole goal with early/late tests to optimize execution time or are there any other semantics attached to it? Dave mentioned that the tests he has in mind are also low reward. My understanding is that this means that they "rarely fail/signal an issue that's not shown by other test".

For quick feedback, we have --incremental (poor name), which runs modified and failed tests first, i.e., which is an approximation of "run high reward tests first".

If we want to optimize overall execution time, maybe we should track test execution times and then run tests in decreasing order the next time around. This would have the benefit that it doesn't need manual annotation (which may rot over time). Maybe it should even be the default? Not to be addressed in this review of course, but I wanted to raise my points.

In D97046#2585820, @yln wrote:

I also want to state that I am a bit concerned about feature creep and the the general usefulness of this effort.

  • Does shaving off 5% of test time for one project justify having it in upstream?

My colleagues with slower machines are excited by the ability to push low value tests to the end. So it's not just about performance.

Think added complexity, maintenance and opportunity cost. It is difficult to decide where to draw the line and I will not block this effort, but I wanted to see where other people stand on this?

"late_tests" shouldn't be any more complicated than "early_tests" and they should share implementation details. To me, deciding that one is good and the other is bad would be hard to defend.

  • Are there workarounds we have considered? An ugly one would be naming directories to influence order.

Lexical hacks were considered during a discussion with my Swift colleagues but it really doesn't scale. People want to organize the test suite into logical groupings, not by testing order.

  • Annotation of fast and slow tests is manual and needs to be maintained. These things tend to rot over time and the benefit will decrease.

Ahem. Nothing bad will happen if the lists are stale, therefore they don't "need" maintenance. And people that care about test ordering will keep the lists up to date. Please understand, I've tried to design the "early_tests" / "late_tests" feature with actual test suites in mind to keep the maintenance burden as low as possible. In my experience, the early/late tests for a given project tend to be fairly stable over time. Also (and in my experience), programmers aren't motivated to fix slow tests unless they're caught during pre-commit/post-commit review, so again, the list is very stable.

Is the sole goal with early/late tests to optimize execution time or are there any other semantics attached to it? Dave mentioned that the tests he has in mind are also low reward. My understanding is that this means that they "rarely fail/signal an issue that's not shown by other test".

As I wrote earlier, performance is not the sole goal.

For quick feedback, we have --incremental (poor name), which runs modified and failed tests first, i.e., which is an approximation of "run high reward tests first".

I'm not sure what your point is here. The --incremental feature is only useful if one is doing incremental builds (not everybody does, for example: bots and paranoid people). Also, the --incremental feature isn't trying to improve the overall test time. Finally, the --incremental feature is poorly implemented and it doesn't work in rigorous testing environments.

If we want to optimize overall execution time, maybe we should track test execution times and then run tests in decreasing order the next time around. This would have the benefit that it doesn't need manual annotation (which may rot over time). Maybe it should even be the default? Not to be addressed in this review of course, but I wanted to raise my points.

Again, you're assuming incremental development and yes that'd be a great feature for people doing incremental development. But clean build performance matters too. I don't think asking developers to commit a sorted list of tests to a project to improve clean build performance is worth the time and effort for most project maintainers.

To repeat myself, I tried to design the "early_tests" and "late_tests" feature to be in the sweet spot of trade offs for all involved.

I feel like this should be handled by lit transparently, like recording the time it took to run each test last time, then somehow deciding on the new test run order based on that,

I feel like this should be handled by lit transparently, like recording the time it took to run each test last time, then somehow deciding on the new test run order based on that,

I agree that would be the ideal solution for incremental builds, but unless that recording is committed back to the repository (and periodically updated), it won't help clean builds (i.e. bots and paranoid people).

I feel like this should be handled by lit transparently, like recording the time it took to run each test last time, then somehow deciding on the new test run order based on that,

I agree that would be the ideal solution for incremental builds, but unless that recording is committed back to the repository (and periodically updated), it won't help clean builds (i.e. bots and paranoid people).

Yep.

I've implemented the requested prefix matching but having done so, I'd like to request that we just replace early_tests with test_phases which is a dictionary of paths/path-prefixes to integers, where zero is the default phase and negative numbers represent early phases and positive numbers represent late stages. This gives users maximal flexibility.

If one wants minimal maintenance, then put just a few slow tests in the -1 phase. (And Swift can its four compiler crash test prefixes in the +1 phase.)

If one doesn't mind regularly refreshing the dictionary, then by all means, commit a big list and use as many phases as you want.

Thoughts?

Ping. I'd like to fully generalize the recently introduced early_tests into test_phases. This will address earlier concerns and allow people to scale their test phasing with their maintenance goals/non-goals. For example, Swift's low-maintenance config would look like this:

config.test_phases = {

"stdlib/CharacterPropertiesLong.swift" : -1,
"stdlib/FixedPoint.swift.gyb" : -1,
"IDE/complete_ambiguous.swift" : -1,
"stdlib/UnicodeTrieGenerator.gyb" : -1,
"IDE/complete_value_expr.swift" : -1,
"compiler_crashers" : +1,
"compiler_crashers_fixed" : +1,
"compiler_crashers_2" : +1,
"compiler_crashers_2_fixed" : +1,

}

Thoughts? Feedback?

I really like the whole 'priority'/nice system instead of just having early/later tests. We do have some really long tests in LLDB (e.g., the ones that start a simulator on the machine) and being able to push those few tests even before the normal 'long' tests would be great.

jdenny added a comment.Mar 1 2021, 6:39 AM

While it's beyond what I need at the moment, I agree that test_phases looks better than early_tests, late_tests, etc.

It's not clear to me whether it's useful to be able to generate a test_phases config automatically based on the time it takes to run tests. If it is, that could be a separate patch.

davezarzycki retitled this revision from [lit] Add "late_tests" test suite config option to [lit] Generalize `early_tests`.
davezarzycki edited the summary of this revision. (Show Details)

This is the generalized update as previously discussed.

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2021, 5:17 AM

Let's rename EARLY_TESTS_THEN_BY_NAME while we're at it.

Harbormaster completed remote builds in B91548: Diff 327416.

I feel like this should be handled by lit transparently, like recording the time it took to run each test last time, then somehow deciding on the new test run order based on that,

+1.

This seems like the right thing to do to me. The "early" or phase looks very ad-hoc to me.
It also isn't clear to me how it'd would play with a "recording time and replay" solution that would be the desired default for me.

I agree with the idea. I haven't had a chance to look at this properly yet (I'm struggling with a health issue this week). Hopefully later in the week I will.

mlir/test/Unit/lit.cfg.py
16

I feel like this should be handled by lit transparently, like recording the time it took to run each test last time, then somehow deciding on the new test run order based on that,

+1.

This seems like the right thing to do to me. The "early" or phase looks very ad-hoc to me.
It also isn't clear to me how it'd would play with a "recording time and replay" solution that would be the desired default for me.

I don't know if you saw my response to @lebedev.ri but I'll try to paraphrase: I agree that recording and then sorting the order would be ideal for people doing incremental builds/testing but it won't help clean builds (bots and paranoid people) unless that sorted list gets committed back to the repository. I'm not looking to propose that anybody do that or that it be the only way to get good testing performance. Some people just want a low maintenance config that they update maybe once or twice a year at most. I.e. the top five or so slowest tests (and maybe in the case of Swift, putting the compiler crasher directories towards the end).

If somebody wants to teach lit how to record a sorted list of tests times and then read that list back in again, that'd be great, but I think that's a great followup patch. Thanks!

I agree with the idea. I haven't had a chance to look at this properly yet (I'm struggling with a health issue this week). Hopefully later in the week I will.

Thanks @jhenderson — get well soon.

I feel like this should be handled by lit transparently, like recording the time it took to run each test last time, then somehow deciding on the new test run order based on that,

+1.

This seems like the right thing to do to me. The "early" or phase looks very ad-hoc to me.
It also isn't clear to me how it'd would play with a "recording time and replay" solution that would be the desired default for me.

I don't know if you saw my response to @lebedev.ri but I'll try to paraphrase: I agree that recording and then sorting the order would be ideal for people doing incremental builds/testing but it won't help clean builds (bots and paranoid people) unless that sorted list gets committed back to the repository.

I saw that, but I don't see it as a problem :)

  1. clean build is a one-time thing, where most of the time will be spent building the compiler rather than running the tests
  2. bots can keep the timing from last run on the system if needed.
  3. I don't know what "paranoid" there is here? This is about having a text file with timing on the system, can you clarify? Ultimately if someone is paranoid about this they can pay the extra cost of not using the feature.

I'm not looking to propose that anybody do that or that it be the only way to get good testing performance. Some people just want a low maintenance config that they update maybe once or twice a year at most. I.e. the top five or so slowest tests (and maybe in the case of Swift, putting the compiler crasher directories towards the end).
If somebody wants to teach lit how to record a sorted list of tests times and then read that list back in again, that'd be great, but I think that's a great followup patch. Thanks!

Sure, I'm mostly concern that 1) it isn't clear how this plays with reading back the test time (would they override the "start-phases"?) and how the "start-phases" can encourage this to spread much further than the occasional "top five" slowest tests.

mlir/test/Unit/lit.cfg.py
17

What is the motivation for MLIR Unittests to start early?
I'd expect this to be at least documented. The comment here is just repeating *what* the code is doing instead of *why* we're doing this.

davezarzycki added a comment.EditedMar 4 2021, 3:58 AM

I feel like this should be handled by lit transparently, like recording the time it took to run each test last time, then somehow deciding on the new test run order based on that,

+1.

This seems like the right thing to do to me. The "early" or phase looks very ad-hoc to me.
It also isn't clear to me how it'd would play with a "recording time and replay" solution that would be the desired default for me.

I don't know if you saw my response to @lebedev.ri but I'll try to paraphrase: I agree that recording and then sorting the order would be ideal for people doing incremental builds/testing but it won't help clean builds (bots and paranoid people) unless that sorted list gets committed back to the repository.

I saw that, but I don't see it as a problem :)

  1. clean build is a one-time thing, where most of the time will be spent building the compiler rather than running the tests
  2. bots can keep the timing from last run on the system if needed.
  3. I don't know what "paranoid" there is here? This is about having a text file with timing on the system, can you clarify? Ultimately if someone is paranoid about this they can pay the extra cost of not using the feature.

By paranoid, I mean people that always do at least one clean build before committing/proposing a change.

I'm not looking to propose that anybody do that or that it be the only way to get good testing performance. Some people just want a low maintenance config that they update maybe once or twice a year at most. I.e. the top five or so slowest tests (and maybe in the case of Swift, putting the compiler crasher directories towards the end).
If somebody wants to teach lit how to record a sorted list of tests times and then read that list back in again, that'd be great, but I think that's a great followup patch. Thanks!

Sure, I'm mostly concern that 1) it isn't clear how this plays with reading back the test time (would they override the "start-phases"?) and how the "start-phases" can encourage this to spread much further than the occasional "top five" slowest tests.

I'm not looking to design or implement the record and reorder mechanism that some want. If somebody else wants to do that, then go for it. If one is wondering how a record and reorder mechanism might interact with this change proposal, then I see three scenarios:

  1. Using the record-and-reorder feature overrides and disables start_phases.
  2. Using start_phases overrides and disables the record-and-reorder feature.
  3. Using start_phases selectively overrides the record-over-reorder feature. For example, something in the explicit -1 phase starts before all of implicitly reordered tests.
mlir/test/Unit/lit.cfg.py
17

I don't know. Presumably some of them are slow. If you're really curious, then I'd recommend digging through the SCM history to see why the original is_early flag was added.

Was there an RFC pitch for early_tests on llvm-dev?
If not, i feel like it might be best to backout that change, and restart the process..

Was there an RFC pitch for early_tests on llvm-dev?
If not, i feel like it might be best to backout that change, and restart the process..

I tried to select a lot of people as reviewers for the original early_tests change proposal and unless I missed something, the original early_tests seemed generally well received/understood. There was not an RFC on llvm-dev. We can certainly do as you want, but what do you hope to gain by reverting and widening the discussion to all of llvm-dev? Is there something that you think the reviewers here are missing?

mehdi_amini added inline comments.Mar 4 2021, 9:51 AM
mlir/test/Unit/lit.cfg.py
17

This was just copied from LLVM as-is.

This kind of goes into my point about maintainability: the is_early was considered "hacky" (per author's words: https://reviews.llvm.org/D18089 ) but was coarse grain and minimalistic. I'm not sure that the finer grain control introduced here goes in the same spirit.

So to make it clear: I don't think this feature is super intrusive to lit and the code is fine. It is more the "how it will be used in the repo" that makes me pause here.

@mehdi_amini asked "how it will be used" in the code review comments of this change but I want to answer that question in the main thread of this change proposal:

As I've mentioned a few times in this change proposal, the goal I have is to modify Swift to have five or so of the slowest tests listed and push the fast/low-reward "compiler crasher" tests to the end. My goal is twofold: low maintenance and zero changes to infrastructure/workflow (i.e. bots / CI / CMake files / people's build environments). When I look at LLVM proper, I only see only one big outlier test that ought to be pushed ahead of the rest for a decent performance win. I haven't checked clang, but I'd wager that <5 need to be listed for a decent performance win.

The way I look at the proposed start_phases feature is that it's a hinting mechanism, no more, no less. People aren't obligated to use it, and the hints are just about a better user experience, not correctness. If we need to put the word "hint" into the docs or even config variable, then I'd support that.

Finally, if people still have a strong aversion to this hinting based approach, I can look into the record-and-reorder approach. It's less than ideal for some scenarios that I care about but I can live with it. And unless I'm missing something, I don't see any precedent for lit recording anything to the test output directory.

mlir/test/Unit/lit.cfg.py
17

Maybe we should just remove/deprecate the "is_early" hack then.

As to "how it will be used", I'll answer that in the main review feedback thread.

mehdi_amini added inline comments.Mar 5 2021, 9:59 AM
mlir/test/Unit/lit.cfg.py
17

It is very possible that my "fear" is over-exaggerated and only ~5 test would be listed in the "phase", I won't really object to you patch here, feel free to proceed (assuming no one else has strong objections).
If you can name it with "hint" in the name somehow that'd be better indeed!

yln added a comment.Mar 5 2021, 1:19 PM

@lebedev.ri
I don't think we need a RFC for this as long as nothing changes for people who don't use the early_tests or new, generalized ordering feature. We shouldn't make Dave pay for past debts. (early_tests was introduced 5 years ago!)

@davezarzycki
Thanks for explaining so patiently!

I now understand that there are two separate, but related features: record-and-reorder, and manually declared test phases. Let me summarize my understanding so you can check it.

record-and-reorder:
The sole goal here is to optimize execution time. lit can do this transparently, so maybe it would even make a good default so everyone can benefit. Making it the default is maybe something that benefit from an RFC so we can learn about people's concerns.
The one detail I want to push back on is that Dave argued that this would not apply to clean/full builds. Assume we store the order into a file. With enough motivation there is certainly a way to configure CI bots not to delete this file (or place it somewhere where it wouldn't be deleted).

manually-declared order:
Multiple goals: optimize execution time, execute high/low value tests first/last. Any other reasons to control order?
Although it is unfortunate that we have some overlap with --incremental here, I get that it doesn't accomplish all of these goals at the same time.

For me, the question is now whether the additional utility of "manually-declared order" justifies having it or if we should just got with "record-and-reorder". I think Dave is trying to make the argument that the answer here is "yes". I can see his arguments and would be happy to sign-off on an implementation, especially if the underlying machinery could be re-used to implement the other.

Design-wise I would prefer a solution that directly states test path prefixes, not "test phases". I think your sketch already did this, but let me restate:

// simply a list or dict of `path prefix -> priority`
config.test_order = [
'very/slow/test.c',
'directory/of/slow-tests/',
'<magic-value-for_all-other-tests>',  // name just for demonstration purposes ;)
'directory/of/low-value-tests/',
]

In lit configs, prepending to this list would mean "execute first", appending would mean "execute last".
And I imagine that "record-and-reorder" could be implemented (in a separate effort) on top of this by expanding <magic-value-for_all-other-tests> with the recorded order. (This would mean that the manually declared order is always respected.)

Let me know what you think

davezarzycki added a comment.EditedMar 6 2021, 5:28 AM
In D97046#2607752, @yln wrote:

@lebedev.ri
I don't think we need a RFC for this as long as nothing changes for people who don't use the early_tests or new, generalized ordering feature. We shouldn't make Dave pay for past debts. (early_tests was introduced 5 years ago!)

@davezarzycki
Thanks for explaining so patiently!

I now understand that there are two separate, but related features: record-and-reorder, and manually declared test phases. Let me summarize my understanding so you can check it.

record-and-reorder:
The sole goal here is to optimize execution time. lit can do this transparently, so maybe it would even make a good default so everyone can benefit. Making it the default is maybe something that benefit from an RFC so we can learn about people's concerns.
The one detail I want to push back on is that Dave argued that this would not apply to clean/full builds. Assume we store the order into a file. With enough motivation there is certainly a way to configure CI bots not to delete this file (or place it somewhere where it wouldn't be deleted).

manually-declared order:
Multiple goals: optimize execution time, execute high/low value tests first/last. Any other reasons to control order?
Although it is unfortunate that we have some overlap with --incremental here, I get that it doesn't accomplish all of these goals at the same time.

For me, the question is now whether the additional utility of "manually-declared order" justifies having it or if we should just got with "record-and-reorder". I think Dave is trying to make the argument that the answer here is "yes". I can see his arguments and would be happy to sign-off on an implementation, especially if the underlying machinery could be re-used to implement the other.

Design-wise I would prefer a solution that directly states test path prefixes, not "test phases". I think your sketch already did this, but let me restate:

// simply a list or dict of `path prefix -> priority`
config.test_order = [
'very/slow/test.c',
'directory/of/slow-tests/',
'<magic-value-for_all-other-tests>',  // name just for demonstration purposes ;)
'directory/of/low-value-tests/',
]

In lit configs, prepending to this list would mean "execute first", appending would mean "execute last".
And I imagine that "record-and-reorder" could be implemented (in a separate effort) on top of this by expanding <magic-value-for_all-other-tests> with the recorded order. (This would mean that the manually declared order is always respected.)

Let me know what you think

Hi @yln,

My opinions and approach to this change request have evolved as the conversation has progressed. At this point, I've come around to the record-and-reorder approach. It's more work and it's not ideal for every scenario that I care about but it's what I want most of the time. Also, record-and-reorder seems to excite people more in the conversation so far; and the hinting "start phase" based approach conflates performance goals with testing prioritization in general, which confuses people.

I'm strongly considering abandoning this change in favor of a new change proposal with the record-and-reorder approach.

And if people want to simplify lit then I think a case could be made to rip out all of the code to manually run tests earlier. I think a good case can be made that these approaches were just performance hacks.

Thoughts?

Dave

I have a patch that implements the record-and-reorder feature. It turned out to be simpler than I feared. Now I just need to figure out how to write a test for it that makes sense and is reliable. For ninja check-llvm, the record-and-reorder mode completes the test suite 51% faster on my 48-core machine. I think it's time to abandon this change proposal.

jdenny added a comment.Mar 7 2021, 6:50 AM

I have a patch that implements the record-and-reorder feature. It turned out to be simpler than I feared. Now I just need to figure out how to write a test for it that makes sense and is reliable. For ninja check-llvm, the record-and-reorder mode completes the test suite 51% faster on my 48-core machine. I think it's time to abandon this change proposal.

Nice! Thanks for all your work.

If I understand how the record-and-reorder feature works, I assume you can test the record and reorder phases separately. The reorder phase should be easy to test with -j1. The record phase is probably harder to test because it depends on test times. You could at least check that it produces a sanely formatted record. I suppose it might be possible to also have one test sleep for a long time and then check that it is ordered before one that terminates immediately, but I'm not sure how brittle that will be on the bots, and it will slow down lit's test suite.

davezarzycki abandoned this revision.Mar 7 2021, 7:54 AM

I have a patch that implements the record-and-reorder feature. It turned out to be simpler than I feared. Now I just need to figure out how to write a test for it that makes sense and is reliable. For ninja check-llvm, the record-and-reorder mode completes the test suite 51% faster on my 48-core machine. I think it's time to abandon this change proposal.

Nice! Thanks for all your work.

If I understand how the record-and-reorder feature works, I assume you can test the record and reorder phases separately. The reorder phase should be easy to test with -j1. The record phase is probably harder to test because it depends on test times. You could at least check that it produces a sanely formatted record. I suppose it might be possible to also have one test sleep for a long time and then check that it is ordered before one that terminates immediately, but I'm not sure how brittle that will be on the bots, and it will slow down lit's test suite.

Correct and correct. As with all tests that rely on timing, a sufficiently busy/slow machine will break whatever timing assumptions are made.

I'll finish the work and create a new Phab review tomorrow.

Wow, that's amazing! I love the outcome here :)

For people following along, here is the replacement differential: https://reviews.llvm.org/D98179