Page MenuHomePhabricator

[lit] Reliable progress indicator and ETA
ClosedPublic

Authored by lebedev.ri on Mar 22 2021, 6:07 AM.

Details

Summary

Quality of progress bar and ETA in lit has always bothered me.

For example, given ./bin/llvm-lit /repositories/llvm-project/clang/test/CodeGen* -sv
at 1%, it says it will take 10 more minutes,
at 25%, it says it will take 1.25 more minutes,
at 50%, it says it will take 30 more seconds,
and in the end finishes with Testing Time: 39.49s. That's rather wildly unprecise.

Currently, it assumes that every single test will take the same amount of time to run on average.
This is is a somewhat reasonable approximation overall, but it is quite clearly imprecise,
especially in the beginning.

But, we can do better now, after D98179! We now know how long the tests took to run last time.
So we can build a better ETA predictor, by accumulating the time spent already,
the time that will be spent on the tests for which we know the previous time,
and for the test for which we don't have previous time, again use the average time
over the tests for which we know current or previous run time.
It would be better to use median, but i'm wary of the cost that may incur.

Now, on first run of ./bin/llvm-lit /repositories/llvm-project/clang/test/CodeGen* -sv
at 10%, it says it will take 30 seconds,
at 25%, it says it will take 50 more seconds,
at 50%, it says it will take 27 more seconds,
and in the end finishes with Testing Time: 41.64s. That's pretty reasonable.

And on second run of ./bin/llvm-lit /repositories/llvm-project/clang/test/CodeGen* -sv
at 1%, it says it will take 1 minutes,
at 25%, it says it will take 30 more seconds,
at 50%, it says it will take 19 more seconds,
and in the end finishes with Testing Time: 39.49s. That's amazing i think!

I think people will love this :)

Note that currently i have not added any test coverage here.
I guess i could try, but i'm not really sure if this is testable..

Diff Detail

Event Timeline

lebedev.ri created this revision.Mar 22 2021, 6:07 AM
lebedev.ri requested review of this revision.Mar 22 2021, 6:07 AM
davezarzycki resigned from this revision.Mar 22 2021, 6:43 AM

Interesting. Thanks for looking into this and building on D98179. I'm sure that people will appreciate this change. :-)

(I'm not personally invested in the outcome, nor am I looking to become a regular contributor to lit, so I'll resign and defer to regular lit reviewers.)

Interesting. Thanks for looking into this and building on D98179. I'm sure that people will appreciate this change. :-)

(I'm not personally invested in the outcome, nor am I looking to become a regular contributor to lit, so I'll resign and defer to regular lit reviewers.)

No problem. Thank you for D98179!

yln accepted this revision.Mar 22 2021, 11:01 AM

Very cool, thanks! :)

LGTM, with nits.

llvm/utils/lit/lit/display.py
4

tests is now the list of test objects, not just a count, right?

9–10

Could now use string interpolation (because we require Python 3.6).

35

The test here should be if test.previous_elapsed: (and flip the if-else) to make sure None is counted as "zero" (in case that could ever happen) and avoid floating point comparison with 0.0. Same below in update().

73–79
This revision is now accepted and ready to land.Mar 22 2021, 11:01 AM
lebedev.ri marked an inline comment as done.Mar 22 2021, 11:02 AM
lebedev.ri added inline comments.
llvm/utils/lit/lit/display.py
4

Yep.

lebedev.ri marked 4 inline comments as done.

@yln @mehdi_amini thank you for taking a look!
Addressing review notes.

Shall i wait for more review here, or can i commit this?

llvm/utils/lit/lit/display.py
9–10

I think i'll prefer to leave that as-is.
If wanted, a cleanup en masse could be performed later.

jhenderson added inline comments.Mar 23 2021, 1:36 AM
llvm/utils/lit/lit/display.py
51

Is it a major bit of reworking to confirm whether median is slower than mean? I guess if it's straightforward, you could relatively quickly compare the total time for a significant set of tests before and after to see if it really is slower.

lebedev.ri marked an inline comment as done.Mar 23 2021, 1:43 AM

@jhenderson thank you for taking a look!

llvm/utils/lit/lit/display.py
51

It certainly won't be faster for obvious reasons. The question is how much of an overhead it will induce.
For small number of tests (N=1'000, maybe N=10'000) it should be fine,
but for N=100'000 i suspect it will not be good.
In the end, it will mainly only help on the first run, when we don't have previous times for many of the tests.

Unless you insist, i think that may or may not be a follow-up improvement?

I've not looked at this area enough to give a full LGTM, but am happy if others are.

llvm/utils/lit/lit/display.py
51

Fair enough. Not something I've thought about too much, and you're right that it could be a future improvement if really needed.

This revision was automatically updated to reflect the committed changes.
lebedev.ri marked an inline comment as done.

CC @goncharov

It looks like this one is causing the Windows premerge checks to timeout, see e.g. the run for this particular patch: https://buildkite.com/llvm-project/premerge-checks/builds/30849

When running ninja check-lit in a separate test setup on Windows, it fails (and hangs) with the following message:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\threading.py", line 954, in _bootstrap_inner
    self.run()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\pool.py", line 592, in _handle_results
    cache[job]._set(i, obj)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\pool.py", line 776, in _set
    self._callback(self._value)
  File "C:\code\llvm-project\llvm\build\bin\..\..\utils\lit\lit\display.py", line 98, in update
    percent = self.progress_predictor.update(test)
  File "C:\code\llvm-project\llvm\build\bin\..\..\utils\lit\lit\display.py", line 59, in update
    return self.time_elapsed / total_time
ZeroDivisionError: float division by zero

I went ahead and pushed a trivial fix for the division by zero. (As long as the patches that the premerge bot tests are based on a version of the monorepo before the division by zero fix though, the premerge jobs will hang and timeout I guess.)

CC @goncharov

It looks like this one is causing the Windows premerge checks to timeout, see e.g. the run for this particular patch: https://buildkite.com/llvm-project/premerge-checks/builds/30849

When running ninja check-lit in a separate test setup on Windows, it fails (and hangs) with the following message:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\threading.py", line 954, in _bootstrap_inner
    self.run()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\pool.py", line 592, in _handle_results
    cache[job]._set(i, obj)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\multiprocessing\pool.py", line 776, in _set
    self._callback(self._value)
  File "C:\code\llvm-project\llvm\build\bin\..\..\utils\lit\lit\display.py", line 98, in update
    percent = self.progress_predictor.update(test)
  File "C:\code\llvm-project\llvm\build\bin\..\..\utils\lit\lit\display.py", line 59, in update
    return self.time_elapsed / total_time
ZeroDivisionError: float division by zero

I went ahead and pushed a trivial fix for the division by zero. (As long as the patches that the premerge bot tests are based on a version of the monorepo before the division by zero fix though, the premerge jobs will hang and timeout I guess.)

Thanks!
How can total_time even be 0.0 there?

Thanks!
How can total_time even be 0.0 there?

Not entirely sure - for some reason, it runs this print routine a couple times right away (before any tests have completed) when starting testing when I invoke it as ninja check-lit, but not when I run it as python bin\llvm-lit.py -sv ..\utils\lit\tests. (On Windows, it uses SimpleProgressBar, not the more elaborate one anyway.)