This is an archive of the discontinued LLVM Phabricator instance.

[lit] - Allow 1 test to report multiple micro-test results to provide support for microbenchmarks.
ClosedPublic

Authored by homerdin on Feb 14 2018, 2:09 PM.

Details

Summary

These changes are to allow to a Result object to have nested Result objects in order to support microbenchmarks. Currently lit is restricted to reporting one result object for one test, this change provides support for when a test that wants to report individual timings for individual kernels.

This is revision is result of the discussions in https://reviews.llvm.org/D32272#794759, https://reviews.llvm.org/D37421#f8003b27 and https://reviews.llvm.org/D38496 that I have seen. It is a separation of the changes purposed in https://reviews.llvm.org/D40077.

With this change I will be adding the LCALS (Livermore Compiler Analysis Loop Suite) collection of loop kernels to the llvm test suite using the google benchmark library (https://reviews.llvm.org/D43319).

Previously microbenchmarks had been handled by using macros to section groups of microbenchmarks together and build many executables while still getting a grouped timing (MultiSource/TSVC). Recently the google benchmark library was added to the test suite and utilized with a litsupport plugin. However the limitation of 1 test 1 result limited its use to passing a runtime option to run only 1 microbenchmark with several hand written tests (MicroBenchmarks/XRay). This runs the same executable many times with different hand written tests. I will update the litsupport plugin to utilize the new functionality (https://reviews.llvm.org/D43316).

These changes allow lit to report micro test results if desired in order to get many precise timing results from 1 run of 1 test executable.

Example Output from LCALS:

Terminal:

PASS: test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test (1 of 1)
********** TEST 'test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test' RESULTS **********
MicroBenchmarks: 12 
compile_time: 6.9610 
hash: "5075a3ae907cf9631cdc4cf8401cbfb3" 
link_time: 0.0426 
**********
*** MICRO-TEST: BM_IF_QUAD_RAW/171
    exec_time:  2.6995 
*** MICRO-TEST: BM_IF_QUAD_RAW/44217
    exec_time:  698.8880 
*** MICRO-TEST: BM_IF_QUAD_RAW/5001
    exec_time:  78.9838 
*** MICRO-TEST: BM_INIT3_RAW/171
    exec_time:  0.2248 
*** MICRO-TEST: BM_INIT3_RAW/44217
    exec_time:  168.0970 
*** MICRO-TEST: BM_INIT3_RAW/5001
    exec_time:  15.1119 
*** MICRO-TEST: BM_MULADDSUB_RAW/171
    exec_time:  0.4491 
*** MICRO-TEST: BM_MULADDSUB_RAW/44217
    exec_time:  169.6760 
*** MICRO-TEST: BM_MULADDSUB_RAW/5001
    exec_time:  16.1443 
*** MICRO-TEST: BM_TRAP_INT_RAW/171
    exec_time:  2.0922 
*** MICRO-TEST: BM_TRAP_INT_RAW/44217
    exec_time:  540.9620 
*** MICRO-TEST: BM_TRAP_INT_RAW/5001
    exec_time:  61.1846

Partial JSON output:

"tests": [
  {
    "code": "PASS",
    "elapsed": null,
    "metrics": {
      "exec_time": 540.962
    },
    "name": "test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test:BM_TRAP_INT_RAW/44217",
    "output": ""
  },
  {
    "code": "PASS",
    "elapsed": null,
    "metrics": {
      "exec_time": 2.6995
    },
    "name": "test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test:BM_IF_QUAD_RAW/171",
    "output": ""
  },
  {
    "code": "PASS",
    "elapsed": null,
    "metrics": {
      "exec_time": 169.676
    },
    "name": "test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test:BM_MULADDSUB_RAW/44217",
    "output": ""
  },
  {
    "code": "PASS",
    "elapsed": null,
    "metrics": {
      "exec_time": 168.097
    },
    "name": "test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test:BM_INIT3_RAW/44217",
    "output": ""
  },
  {
    "code": "PASS",
    "elapsed": null,
    "metrics": {
      "exec_time": 16.1443
    },
    "name": "test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test:BM_MULADDSUB_RAW/5001",
    "output": ""
  },
  {
    "code": "PASS",
    "elapsed": null,
    "metrics": {
      "exec_time": 61.1846
    },
    "name": "test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test:BM_TRAP_INT_RAW/5001",
    "output": ""
  },

Diff Detail

Event Timeline

homerdin created this revision.Feb 14 2018, 2:09 PM
homerdin edited the summary of this revision. (Show Details)Feb 14 2018, 2:42 PM
delcypher requested changes to this revision.EditedFeb 15 2018, 1:13 AM

Please add at least one test to lit's test suite for the functionality you're adding. As this patch stands right now it's not obvious how to use this new functionality. A test would demonstrate how this feature is intended to be used and will ensure that if someone breaks your new feature that it will be noticed. Documentation would be great but that's probably asking too much.

This revision now requires changes to proceed.Feb 15 2018, 1:13 AM

I think this patch should be merged with D43316 as a first example, and then used in D43319 as a proper example. Once D43316 is approved, D43319 should wait until a few green builds are out on the LNT bots to go, once approved.

Detecting LNT errors in the bots is not a trivial task sometimes and pushing only one behavioural change at a time helps a lot.

I've also added a few comments on the Python code.

Finally, please use diff -U9999 to show the context before uploading the diff. This is hard to review as it is.

Thanks,
--renato

utils/lit/lit/main.py
86

IIRC Python 2.x and 3.x behave different regarding .items().

Do we force Python 3 for LIT?

89

We already have the **** enclosure for the whole test, I think this is a bit too verbose.

I'd suggest something like *** MICRO-TEST: micro_test_name

91

guard with if micro_test.metrics:

93

Same as above, we don't need more *** :)

130

What's with the [:-5] ?

homerdin updated this revision to Diff 134434.Feb 15 2018, 8:23 AM
homerdin edited the summary of this revision. (Show Details)

I updated to a full context diff and have made some of the python changes requested. I changed the way the name is formed to make it clearer.

@delcypher I had misread your request for tests that show what I am trying to do, which is why I opened the revision for LCALS. I will add a test to lit's test suite for these changes. I'll also look into the documentation and see what I can contribute there.

I still need to look into the Python 2.x and 3.x behavior regarding .items() and if we force Python 3 for LIT.

homerdin marked 3 inline comments as done.Feb 15 2018, 8:27 AM
homerdin added inline comments.
utils/lit/lit/main.py
130

I changed the way I build the name to be clearer. Was using [:-5] (.test) to expand the parent test name with the micro test name before ".test"

homerdin added inline comments.Feb 15 2018, 9:39 AM
utils/lit/lit/main.py
130

Realizing this is incorrect as it only works with ".test" files. Need a safer way to expand the test name.

rengolin added inline comments.Feb 15 2018, 2:16 PM
utils/lit/lit/main.py
130

regular expression would do nicely

homerdin updated this revision to Diff 134907.Feb 19 2018, 6:59 AM

I added tests for the console and JSON output of micro-tests. Also made a change to the way the test name is expanded.

rengolin added inline comments.Feb 19 2018, 7:26 AM
utils/lit/tests/test-data-micro.py
4 ↗(On Diff #134907)

Join these two lines with a pipe, otherwise .out files will be kept and can break further tests.

14 ↗(On Diff #134907)

Both micro-test and micro-results are being stored in dictionaries, that means their order is not guaranteed.

I assume that's the reason for the {{[0-2]}} regex here, but a similar solution would be needed for the values as well.

I recommend you look at CHECK-DAG:

  1. CHECK-NEXT: *** MICRO-TEST: test{{[0-2]}}
  2. CHECK-DAG: micro_value0: 4 ​# CHECK-DAG: micro_value1: 1.3

This means there are two lines after MICRO-TEST which have those two values, in any order.

It could match to a list after the next MICRO-TEST, however, if you have three CHECK-NEXT: of MICRO-TEST, then the last one would fail if you match the CHECK-DAG on the wrong line, so you're covered.

Just repeat that pattern three times and it should be fine.

utils/lit/tests/test-output-micro.py
1 ↗(On Diff #134907)

Same thing here, join the lines with a pipe.

9 ↗(On Diff #134907)

Here, the CHECK-DAG trick won't work, because you have two levels: one for the result and another for the metrics, and there's no way to identify CHECK-DAG1 and CHECK-DAG2.

One way to solve this is to sort the keys before dumping JSON, as to get a reproducible answer (that is, of course, if Python's default sort is stable).

homerdin updated this revision to Diff 135279.EditedFeb 21 2018, 10:04 AM

Thanks for all the feedback! I've made some changes to the tests:

  • Updated test-data-micro test to use CHECK-DAG and piped output
  • Updated test-output-micro test to use CHECK-DAG. I think this should work since parent result test-data-micro :: micro-tests.ini will always be appended to the end of the list after all micro-tests. This is testing the lit's output file so I sent the stdout to /dev/null and explicitly removed the output file to clean the test up.

Looks nearly good to me. I'd just opt for a different naming of the sub-tests. Also some nitpicks/comments on other nitpicks below:

utils/lit/lit/main.py
86

.items() use should be fine here (it'll give you a full copy in python2 and a lazy iterator in python3. But except for higher memory they should behave the same, and this doesn't look like we care about the memory usage of a few entries).

134

Appending the subtest name before .test feels odd to me as there is no actual file with the name testname/microtest.test how about testname.test:microtest instead?

utils/lit/tests/test-data-micro.py
14 ↗(On Diff #134907)

Actually I would recommend changing the printing code above to something like:

sorted_results = sorted(test.result.microResults.items())
for key, micro_test in sorted_results:

to avoid the indeterministic output.

utils/lit/tests/test-output-micro.py
9 ↗(On Diff #134907)

lit does the json dumping with sort_keys=True so the output should be deterministic so we don't need the CHECK-DAGs.

homerdin updated this revision to Diff 135686.Feb 23 2018, 1:22 PM
homerdin edited the summary of this revision. (Show Details)

Thanks for the comments!

  • Updated micro test name expansion to be ParentTest.test:MicroTest
  • Updated Console output to be deterministic and removed the CHECK-DAGS from the tests
  • Added Updated Output in the Summary

Thanks @MatzeB, apart from the missing pipe in the RUN line, I'm happy when you are.

cheers,
--renato

utils/lit/tests/test-data-micro.py
14 ↗(On Diff #134907)

Indeed, better.

utils/lit/tests/test-output-micro.py
1 ↗(On Diff #134907)

Forgot the pipe here. :)

9 ↗(On Diff #134907)

Ah, perfect!

homerdin added inline comments.Feb 27 2018, 8:29 AM
utils/lit/tests/test-output-micro.py
1 ↗(On Diff #134907)

This test is checking the output file that lit writes, is there a way to pipe the output file to FileCheck?

rengolin added inline comments.Feb 27 2018, 2:06 PM
utils/lit/tests/test-output-micro.py
1 ↗(On Diff #134907)

Yup. Try --output - for stdout, or just omit --output and see if it goes automatically to stdout.

homerdin updated this revision to Diff 136162.Feb 27 2018, 2:47 PM

Tried some things and was able to get it working using /dev/stdout.

Tried some things and was able to get it working using /dev/stdout.

That doesn't work on Windows, may not work on Mac?

Tried some things and was able to get it working using /dev/stdout.

That doesn't work on Windows, may not work on Mac?

Should I change it back to rm %t.results.out after FileCheck? I'm not sure how to do this with a pipe.

MatzeB added a comment.EditedMar 8 2018, 11:00 AM

Tried some things and was able to get it working using /dev/stdout.

That doesn't work on Windows, may not work on Mac?

Should I change it back to rm %t.results.out after FileCheck? I'm not sure how to do this with a pipe.

Yes, given that lit has nothing like -o - writing a temporary file is fine.

And I think most tests don't clean the temporary files as part of the test. I guess it's better for debugging in case of a failure if the file is still around and as the temp files are part of the build directory they will be cleaned up when the build directory is removed as a whole.

I think it should be fine if it's not in the list of patterns recognised as tests...

homerdin updated this revision to Diff 137637.Mar 8 2018, 12:26 PM

I believe I have made all the requested changes. Thank you for all the feedback!

rengolin accepted this revision.Mar 9 2018, 5:24 AM

LGTM with the sorting comment. Thanks!

utils/lit/lit/main.py
130

Don't you need to sort these ones, too?

homerdin added inline comments.Mar 9 2018, 6:53 AM
utils/lit/lit/main.py
130

Both the micro_test_data and micro_metrics_data dictionaries are sorted when dumped into the json file and are not used elsewhere.

This revision was not accepted when it landed; it landed in state Needs Review.Mar 13 2018, 9:40 AM
This revision was automatically updated to reflect the committed changes.
lebedev.ri added inline comments.
llvm/trunk/utils/lit/lit/Test.py
163–164 ↗(On Diff #138213)

Was this restriction motivated?
Does LNT fail to import multiple results with the same name from a single .json?

Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2019, 3:45 AM
homerdin marked an inline comment as done.Jun 10 2019, 7:29 AM
homerdin added inline comments.
llvm/trunk/utils/lit/lit/Test.py
163–164 ↗(On Diff #138213)

LNT accepts multiple results with the same name in a .json from multiple runs --exec-multisample as a combination of the individual lit outputs. If we give two tests the same name in the test suite we would be combining results from two separate tests. If you want to get multiple samples you can use --exec-multisample. Also google benchmark is already reporting the mean of many runs.

If you are wondering if you can add a micro-test result with the same name as a micro-test result that runs under a different executable that works because the names are extended with the parent's.

lebedev.ri marked an inline comment as done.Jun 10 2019, 7:42 AM
lebedev.ri added inline comments.
llvm/trunk/utils/lit/lit/Test.py
163–164 ↗(On Diff #138213)

Thanks.
I was actually thinking about using benchmark's repetitions instead of --exec-multisample,
but it appears that doesn't work, can't pass a list of measurements into addMicroResult().