This is an archive of the discontinued LLVM Phabricator instance.

[XRay][test-suite] Benchmarks for profiling mode implementation
Needs ReviewPublic

Authored by dberris on Jul 3 2018, 6:38 AM.

Download Raw Diff

Details

Reviewers

kpw
eizan
hans

Commits

rOLDT336970: [XRay][test-suite] Benchmarks for profiling mode implementation
rL336970: [XRay][test-suite] Benchmarks for profiling mode implementation

Summary

This patch adds microbenchmarks for the XRay Profiling Mode
implementation to the test-suite.

The benchmarks included cover:

Cost of the Profiling Mode runtime handler(s) and underlying implementation details when enabled.

Different benchmarks for different call stack traces. Initially showing deep, shallow, and wide function call stacks.

These microbenchmarks can be used to measure progress on the
optimisation work associated with the profiling mode runtime
implementation going forward. It also allows us to better qualify the
cost of the XRay runtime framework (in particular the trampolines) as we
make improvements to those in the future.

Depends on D48653.

Diff Detail

Repository: rL LLVM

Event Timeline

dberris created this revision.Jul 3 2018, 6:38 AM

Herald added a subscriber: mgorny. · View Herald TranscriptJul 3 2018, 6:38 AM

Some initial numbers for context:

$ ./MicroBenchmarks/XRay/ProfilingMode/deep-call-bench
Run on (48 X 3500 MHz CPU s)
2018-07-03 23:48:19
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_XRayProfilingDeepCallStack/threads:1        1038 ns       1038 ns     609434
BM_XRayProfilingDeepCallStack/threads:2         742 ns       1485 ns     473966
BM_XRayProfilingDeepCallStack/threads:4         562 ns       2247 ns     243984
BM_XRayProfilingDeepCallStack/threads:8        8798 ns      70379 ns       8000
BM_XRayProfilingDeepCallStack/threads:16      15860 ns     253766 ns       4144
BM_XRayProfilingDeepCallStack/threads:32       7404 ns     176888 ns       6080

And for the shallow calls:

$ ./MicroBenchmarks/XRay/ProfilingMode/shallow-call-bench
Run on (48 X 3500 MHz CPU s)
2018-07-03 23:49:30
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_XRayProfilingShallowStack/threads:1         143 ns        143 ns    4190065
BM_XRayProfilingShallowStack/threads:2         174 ns        348 ns    2429826
BM_XRayProfilingShallowStack/threads:4         142 ns        567 ns    1107872
BM_XRayProfilingShallowStack/threads:8         413 ns       3307 ns     172928
BM_XRayProfilingShallowStack/threads:16        557 ns       8907 ns      66992
BM_XRayProfilingShallowStack/threads:32        935 ns      29903 ns      63200

This is with the bug fixes in D48653. There's a bit more work to do to reduce the costs on the cases for deep, tightly-run call stacks.

We're missing a "wide" call stack version, and my intent is to use some recursive functions to see how we might be able to change those.

fixup: cleanup init/tear-down

Harbormaster completed remote builds in B20015: Diff 154043.Jul 3 2018, 7:57 PM

fixup: add a wide-call benchmark, recursion to simulate call depth

Harbormaster completed remote builds in B20017: Diff 154051.Jul 3 2018, 10:29 PM

dberris edited the summary of this revision. (Show Details)Jul 3 2018, 10:31 PM

Update to add wide call-tree benchmarks.

Harbormaster completed remote builds in B20293: Diff 155117.Jul 11 2018, 11:17 PM

Use thread real-time instead of CPU time in benchmarks and reduce depths to 64 max.

Harbormaster completed remote builds in B20294: Diff 155119.Jul 12 2018, 12:03 AM

dberris mentioned this in D49217: [XRay][compiler-rt] Simplify Allocator Implementation.Jul 12 2018, 12:06 AM

For reference, here's the benchmark run against the current state of head (deep-call-bench):

Run on (48 X 3500 MHz CPU s)
2018-07-12 17:21:44
---------------------------------------------------------------------------------------------
Benchmark                                                      Time           CPU Iterations
---------------------------------------------------------------------------------------------
BM_XRayProfilingDeepCallStack/1/real_time/threads:1          196 ns        196 ns    3548155
BM_XRayProfilingDeepCallStack/1/real_time/threads:2          148 ns        295 ns    5404008
BM_XRayProfilingDeepCallStack/1/real_time/threads:4           76 ns        305 ns    8427916
BM_XRayProfilingDeepCallStack/1/real_time/threads:8           45 ns        357 ns   11881512
BM_XRayProfilingDeepCallStack/1/real_time/threads:16         119 ns       1907 ns    7859872
BM_XRayProfilingDeepCallStack/1/real_time/threads:32        2446 ns      78267 ns     304480
BM_XRayProfilingDeepCallStack/2/real_time/threads:1          288 ns        288 ns    2294615
BM_XRayProfilingDeepCallStack/2/real_time/threads:2          196 ns        391 ns    3854258
BM_XRayProfilingDeepCallStack/2/real_time/threads:4          101 ns        405 ns    6306804
BM_XRayProfilingDeepCallStack/2/real_time/threads:8           77 ns        616 ns    9896592
BM_XRayProfilingDeepCallStack/2/real_time/threads:16          53 ns        855 ns   13341120
BM_XRayProfilingDeepCallStack/2/real_time/threads:32        3000 ns      95820 ns     241472
BM_XRayProfilingDeepCallStack/4/real_time/threads:1          479 ns        479 ns    1369221
BM_XRayProfilingDeepCallStack/4/real_time/threads:2          283 ns        566 ns    2439036
BM_XRayProfilingDeepCallStack/4/real_time/threads:4          158 ns        630 ns    4261736
BM_XRayProfilingDeepCallStack/4/real_time/threads:8          125 ns        998 ns    7218600
BM_XRayProfilingDeepCallStack/4/real_time/threads:16          74 ns       1190 ns    6930480
BM_XRayProfilingDeepCallStack/4/real_time/threads:32        3271 ns      94646 ns     165088
BM_XRayProfilingDeepCallStack/8/real_time/threads:1          847 ns        847 ns     801600
BM_XRayProfilingDeepCallStack/8/real_time/threads:2          500 ns       1000 ns    1309944
BM_XRayProfilingDeepCallStack/8/real_time/threads:4          261 ns       1043 ns    2147344
BM_XRayProfilingDeepCallStack/8/real_time/threads:8          147 ns       1174 ns    4556248
BM_XRayProfilingDeepCallStack/8/real_time/threads:16         102 ns       1634 ns    5224688
BM_XRayProfilingDeepCallStack/8/real_time/threads:32        5324 ns     155831 ns     101920
BM_XRayProfilingDeepCallStack/16/real_time/threads:1        1588 ns       1588 ns     434210
BM_XRayProfilingDeepCallStack/16/real_time/threads:2         894 ns       1787 ns     715548
BM_XRayProfilingDeepCallStack/16/real_time/threads:4         490 ns       1959 ns    1230536
BM_XRayProfilingDeepCallStack/16/real_time/threads:8         277 ns       2213 ns    2229616
BM_XRayProfilingDeepCallStack/16/real_time/threads:16        585 ns       9366 ns     923344
BM_XRayProfilingDeepCallStack/16/real_time/threads:32      16064 ns     513978 ns      59392
BM_XRayProfilingDeepCallStack/32/real_time/threads:1        3247 ns       3247 ns     215805
BM_XRayProfilingDeepCallStack/32/real_time/threads:2        1692 ns       3384 ns     409380
BM_XRayProfilingDeepCallStack/32/real_time/threads:4         925 ns       3700 ns     677716
BM_XRayProfilingDeepCallStack/32/real_time/threads:8         512 ns       4095 ns    1240360
BM_XRayProfilingDeepCallStack/32/real_time/threads:16        513 ns       8206 ns    1681168
BM_XRayProfilingDeepCallStack/32/real_time/threads:32      59676 ns    1909406 ns      15136
BM_XRayProfilingDeepCallStack/64/real_time/threads:1        6374 ns       6373 ns     105030
BM_XRayProfilingDeepCallStack/64/real_time/threads:2        3341 ns       6681 ns     174912
BM_XRayProfilingDeepCallStack/64/real_time/threads:4        1851 ns       7404 ns     278224
BM_XRayProfilingDeepCallStack/64/real_time/threads:8        1003 ns       8026 ns     590256
BM_XRayProfilingDeepCallStack/64/real_time/threads:16        796 ns      12726 ns     701920
BM_XRayProfilingDeepCallStack/64/real_time/threads:32      38176 ns    1152947 ns      16608

With the changes in D49217, we get:

Run on (48 X 3500 MHz CPU s)                                                                
2018-07-12 17:01:35                                                                         
---------------------------------------------------------------------------------------------
Benchmark                                                      Time           CPU Iterations
---------------------------------------------------------------------------------------------
BM_XRayProfilingDeepCallStack/1/real_time/threads:1          202 ns        202 ns    3477313
BM_XRayProfilingDeepCallStack/1/real_time/threads:2          179 ns        357 ns    4581178
BM_XRayProfilingDeepCallStack/1/real_time/threads:4          144 ns        577 ns    5875828
BM_XRayProfilingDeepCallStack/1/real_time/threads:8          125 ns       1003 ns    8311456                               
BM_XRayProfilingDeepCallStack/1/real_time/threads:16         174 ns       2792 ns    9522368
BM_XRayProfilingDeepCallStack/1/real_time/threads:32         146 ns       4687 ns    6358400
BM_XRayProfilingDeepCallStack/2/real_time/threads:1          295 ns        295 ns    2359216
BM_XRayProfilingDeepCallStack/2/real_time/threads:2          239 ns        478 ns    2888910
BM_XRayProfilingDeepCallStack/2/real_time/threads:4          160 ns        638 ns    5410336
BM_XRayProfilingDeepCallStack/2/real_time/threads:8          125 ns        999 ns    7721696
BM_XRayProfilingDeepCallStack/2/real_time/threads:16         103 ns       1647 ns    5126384
BM_XRayProfilingDeepCallStack/2/real_time/threads:32         131 ns       4153 ns    5427136
BM_XRayProfilingDeepCallStack/4/real_time/threads:1          490 ns        490 ns    1326060
BM_XRayProfilingDeepCallStack/4/real_time/threads:2          367 ns        734 ns    2276550                                          
BM_XRayProfilingDeepCallStack/4/real_time/threads:4          249 ns        994 ns    3981604                   
BM_XRayProfilingDeepCallStack/4/real_time/threads:8          174 ns       1394 ns    5467368
BM_XRayProfilingDeepCallStack/4/real_time/threads:16         129 ns       2057 ns    4399568
BM_XRayProfilingDeepCallStack/4/real_time/threads:32         148 ns       4718 ns    4695104
BM_XRayProfilingDeepCallStack/8/real_time/threads:1          873 ns        873 ns     788744
BM_XRayProfilingDeepCallStack/8/real_time/threads:2          535 ns       1071 ns    1177912
BM_XRayProfilingDeepCallStack/8/real_time/threads:4          339 ns       1354 ns    2235540
BM_XRayProfilingDeepCallStack/8/real_time/threads:8          256 ns       2051 ns    3818424                             
BM_XRayProfilingDeepCallStack/8/real_time/threads:16         208 ns       3323 ns    4687040
BM_XRayProfilingDeepCallStack/8/real_time/threads:32         211 ns       6751 ns    3579136
BM_XRayProfilingDeepCallStack/16/real_time/threads:1        1652 ns       1652 ns     414737
BM_XRayProfilingDeepCallStack/16/real_time/threads:2         975 ns       1950 ns     785698
BM_XRayProfilingDeepCallStack/16/real_time/threads:4         601 ns       2402 ns    1400136
BM_XRayProfilingDeepCallStack/16/real_time/threads:8         365 ns       2918 ns    2308440
BM_XRayProfilingDeepCallStack/16/real_time/threads:16        313 ns       5003 ns    1600000
BM_XRayProfilingDeepCallStack/16/real_time/threads:32        256 ns       8177 ns    3033056
BM_XRayProfilingDeepCallStack/32/real_time/threads:1        3419 ns       3418 ns     209959
BM_XRayProfilingDeepCallStack/32/real_time/threads:2        1858 ns       3716 ns     405304
BM_XRayProfilingDeepCallStack/32/real_time/threads:4        1051 ns       4204 ns     690604
BM_XRayProfilingDeepCallStack/32/real_time/threads:8         611 ns       4890 ns    1233168
BM_XRayProfilingDeepCallStack/32/real_time/threads:16        425 ns       6798 ns    1634992
BM_XRayProfilingDeepCallStack/32/real_time/threads:32        336 ns      10737 ns    1958368
BM_XRayProfilingDeepCallStack/64/real_time/threads:1        6438 ns       6438 ns     105337                    
BM_XRayProfilingDeepCallStack/64/real_time/threads:2        3432 ns       6864 ns     197488
BM_XRayProfilingDeepCallStack/64/real_time/threads:4        2477 ns       9906 ns     376460
BM_XRayProfilingDeepCallStack/64/real_time/threads:8        1069 ns       8547 ns     578224
BM_XRayProfilingDeepCallStack/64/real_time/threads:16        684 ns      10949 ns    1079040
BM_XRayProfilingDeepCallStack/64/real_time/threads:32        482 ns      15417 ns    1298176

One thing to notice here is the non-linear (super-linear?) scaling on the number of iterations and overheads. This is good evidence to support the changes in D49217, and allows us to be more confident in qualifying the costs/overheads of the profiling mode implementation.

Add some documentation on the benchmarks, make all benchmarks consistent.

eizan accepted this revision.Jul 12 2018, 9:48 PM

This revision is now accepted and ready to land.Jul 12 2018, 9:48 PM

Closed by commit rL336970: [XRay][test-suite] Benchmarks for profiling mode implementation (authored by dberris). · Explain WhyJul 12 2018, 9:53 PM

This revision was automatically updated to reflect the committed changes.

dberris added a child revision: D49363: [XRay][compiler-rt] Segmented Array: Simplify and Optimise.Jul 16 2018, 12:26 AM

dberris mentioned this in rL337342: [XRay][compiler-rt] Simplify Allocator Implementation.Jul 17 2018, 6:58 PM

dberris mentioned this in rCRT337342: [XRay][compiler-rt] Simplify Allocator Implementation.

These tests are flaky when run as part of the full test-suite, as happens when testing the release branch.

Any idea why these would break when run in parallel?

$ /work/llvm-release-test/branches_release_70/sandbox/bin/lit -sv MicroBenchmarks/XRay/ProfilingMode/
FAIL: test-suite :: MicroBenchmarks/XRay/ProfilingMode/deep-call-bench.test (1 of 3)
******************** TEST 'test-suite :: MicroBenchmarks/XRay/ProfilingMode/deep-call-bench.test' FAILED ********************

/work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/deep-call-bench --benchmark_format=csv > /work/llvm-release-test/branches_release_70/test-suite-build/Micr
oBenchmarks/XRay/ProfilingMode/Output/deep-call-bench.test.bench.csv

Run on (56 X 3500 MHz CPU s)
2018-08-02 14:24:49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
/work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/Output/deep-call-bench.test_run.script: line 1: 223587 Segmentation fault      /work/llvm-release-test/bra
nches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/deep-call-bench --benchmark_format=csv > /work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingM
ode/Output/deep-call-bench.test.bench.csv

********************
Testing Time: 187.27s
********************
Failing Tests (1):
    test-suite :: MicroBenchmarks/XRay/ProfilingMode/deep-call-bench.test

  Expected Passes    : 2
  Unexpected Failures: 1

(It varies how many and which tests fail.)

I've reverted this from trunk in r338710 and merged to 7.0 in r338711 until this is fixed.

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/CMakeLists.txt

I think there needs to be a llvm_test_run() for each test executable. Otherwise lit fails like this:

UNRESOLVED: test-suite :: MicroBenchmarks/XRay/ProfilingMode/shallow-call-bench.test (341 of 912)
******************** TEST 'test-suite :: MicroBenchmarks/XRay/ProfilingMode/shallow-call-bench.test' FAILED ********************
Exception during script execution:
Traceback (most recent call last):
  File "/work/llvm-release-test/branches_release_70/sandbox/local/lib/python2.7/site-packages/lit-0.7.0.dev0-py2.7.egg/lit/run.py", line 202, in _execute_test_impl
    result = test.config.test_format.execute(test, lit_config)
  File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/test.py", line 49, in execute
    litsupport.testfile.parse(context, test.getSourcePath())
  File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/testfile.py", line 50, in parse
    raise ValueError("Test has no RUN: line!")
ValueError: Test has no RUN: line!


********************
Testing: 0 .. 10.. 20.. 30.
UNRESOLVED: test-suite :: MicroBenchmarks/XRay/ProfilingMode/wide-call-bench.test (343 of 912)
******************** TEST 'test-suite :: MicroBenchmarks/XRay/ProfilingMode/wide-call-bench.test' FAILED ********************
Exception during script execution:
Traceback (most recent call last):
  File "/work/llvm-release-test/branches_release_70/sandbox/local/lib/python2.7/site-packages/lit-0.7.0.dev0-py2.7.egg/lit/run.py", line 202, in _execute_test_impl
    result = test.config.test_format.execute(test, lit_config)
  File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/test.py", line 49, in execute
    litsupport.testfile.parse(context, test.getSourcePath())
  File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/testfile.py", line 50, in parse
    raise ValueError("Test has no RUN: line!")
ValueError: Test has no RUN: line!

This worked for me:

llvm_test_run()
llvm_test_executable(deep-call-bench deep-call-bench.cc)
target_link_libraries(deep-call-bench benchmark)
llvm_test_run()
llvm_test_executable(shallow-call-bench shallow-call-bench.cc)
target_link_libraries(shallow-call-bench benchmark)
llvm_test_run()
llvm_test_executable(wide-call-bench wide-call-bench.cc)
target_link_libraries(wide-call-bench benchmark)

However, the tests are flaky when run together, see below.

Herald added a subscriber: jfb. · View Herald TranscriptAug 2 2018, 5:38 AM

Hi Hans,

I *think* this could be caused by the process running the tests
exhausting available RAM (if this is running in a container with
limited memory availability). I'll need to look into this deeper.

I'm OK with running them sequentially, or even just limiting the
number of threads in the benchmarks to reduce the requirements on
available system memory.

Cheers

In D48879#1186442, @dberris wrote:

Hi Hans,

I *think* this could be caused by the process running the tests
exhausting available RAM (if this is running in a container with
limited memory availability). I'll need to look into this deeper.

Yes, that would make sense. I'm also not very familiar with how test-suite runs, but I think it uses a python virtualenv, and maybe that limits the amount of RAM available?

I'm OK with running them sequentially, or even just limiting the
number of threads in the benchmarks to reduce the requirements on
available system memory.

I'm not sure if there's a way to make lit run them sequentially, unless they're turned into a single test executable. If making the benchmarks less "heavy" works, maybe that's the way to go.

In D48879#1186974, @hans wrote:

In D48879#1186442, @dberris wrote:

Hi Hans,

I *think* this could be caused by the process running the tests
exhausting available RAM (if this is running in a container with
limited memory availability). I'll need to look into this deeper.

Yes, that would make sense. I'm also not very familiar with how test-suite runs, but I think it uses a python virtualenv, and maybe that limits the amount of RAM available?

That's plausible.

I'm OK with running them sequentially, or even just limiting the
number of threads in the benchmarks to reduce the requirements on
available system memory.

I'm not sure if there's a way to make lit run them sequentially, unless they're turned into a single test executable. If making the benchmarks less "heavy" works, maybe that's the way to go.

I think doing both things makes sense. I'll merge them into a single binary and reducing the number of threads being executed.

Let me try something, and get you a patch to clean this up.

Re-opening this time with @hans for feedback on consolidation.

This revision is now accepted and ready to land.Aug 14 2018, 6:57 AM

Consolidate benchmarks into a single benchmark binary.

dberris requested review of this revision.Aug 14 2018, 6:59 AM

I patched it in (on the 7.0 branch because that's what I have handy) but it crashes:

FAIL: test-suite :: MicroBenchmarks/XRay/ProfilingMode/profiling-bench.test (442 of 910)
******************** TEST 'test-suite :: MicroBenchmarks/XRay/ProfilingMode/profiling-bench.test' FAILED ********************

/work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/profiling-bench --benchmark_format=csv > /work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/Output/profiling-bench.test.bench.csv

Run on (56 X 3500 MHz CPU s)
2018-08-14 16:49:47
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
/work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/Output/profiling-bench.test_run.script: line 1: 25496 Segmentation fault      /work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/profiling-bench --benchmark_format=csv > /work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/Output/profiling-bench.test.bench.csv

********************
Testing Time: 63.37s
********************
Failing Tests (1):
    test-suite :: MicroBenchmarks/XRay/ProfilingMode/profiling-bench.test

  Expected Passes    : 909
  Unexpected Failures: 1

If I run it directly:

$ /work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/profiling-bench --benchmark_format=csv > /work/llvm-release-test/branches_release_70/test-suite-build/MicroBenchmarks/XRay/ProfilingMode/Output/profiling-bench.test.bench.csv

it runs fine.

I'm guessing lit does something to the environment that makes the test fail.

fixup: remove deprecated option in environment variable

Harbormaster completed remote builds in B21452: Diff 160594.Aug 14 2018, 8:25 AM

I could reproduce locally, and so I had a look but this time I needed to rebuild on a clean build dir. PTAL?

In D48879#1199017, @dberris wrote:

I could reproduce locally, and so I had a look but this time I needed to rebuild on a clean build dir. PTAL?

I ran "make clean" on it and tried again, but with the same result

It does produce some output before segfaulting, if that's any help:

name,iterations,real_time,cpu_time,time_unit,bytes_per_second,items_per_second,label,error_occurred,error_message
"BM_XRayProfilingShallowStack/real_time/threads:1",5061883,144.624,134.259,ns,,,,,
"BM_XRayProfilingShallowStack/real_time/threads:2",6094380,94.531,179.154,ns,,,,,

In D48879#1199118, @hans wrote:

In D48879#1199017, @dberris wrote:

I could reproduce locally, and so I had a look but this time I needed to rebuild on a clean build dir. PTAL?

I ran "make clean" on it and tried again, but with the same result

I found that 'make clean' doesn't quite cut it unfortunately, I needed to have a full clean build -- I suspect this is because the lit config is copied/cached when building/running the tests?

Either that or the version that's in 7.0 doesn't have some recent changes to profiling mode. I have been testing with the latest from trunk.

It does produce some output before segfaulting, if that's any help:

name,iterations,real_time,cpu_time,time_unit,bytes_per_second,items_per_second,label,error_occurred,error_message
"BM_XRayProfilingShallowStack/real_time/threads:1",5061883,144.624,134.259,ns,,,,,
"BM_XRayProfilingShallowStack/real_time/threads:2",6094380,94.531,179.154,ns,,,,,

That's a little helpful, but not much -- I was using virtualenv as well locally, but after a full re-build/re-configure I couldn't reproduce the failures. :/

In D48879#1199290, @dberris wrote:

In D48879#1199118, @hans wrote:

In D48879#1199017, @dberris wrote:

I could reproduce locally, and so I had a look but this time I needed to rebuild on a clean build dir. PTAL?

I ran "make clean" on it and tried again, but with the same result

I found that 'make clean' doesn't quite cut it unfortunately, I needed to have a full clean build -- I suspect this is because the lit config is copied/cached when building/running the tests?

I tried with a new build dir like this:

$ CC=/work/llvm-7.0/build.clang6/bin/clang CXX=/work/llvm-7.0/build.clang6/bin/clang++ cmake -GNinja ../test-suite -DTEST_SUITE_LIT=/work/llvm-7.0/build.clang6/bin/llvm-lit
$ ninja check

but got the same error.

Either that or the version that's in 7.0 doesn't have some recent changes to profiling mode. I have been testing with the latest from trunk.

I tried using trunk, but got compile errors:

/work/test-suite/MicroBenchmarks/XRay/ProfilingMode/profiling-bench.cc:33:7: error: use of undeclared identifier '__xray_log_init_mode'; did you mean '__xray_log_register_mode'?
  if (__xray_log_init_mode("xray-profiling", "no_flush=true") !=
      ^~~~~~~~~~~~~~~~~~~~
      __xray_log_register_mode
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:226:23: note: '__xray_log_register_mode' declared here
XRayLogRegisterStatus __xray_log_register_mode(const char *Mode,
                      ^
/work/test-suite/MicroBenchmarks/XRay/ProfilingMode/profiling-bench.cc:33:46: error: no viable conversion from 'const char [14]' to 'XRayLogImpl'
  if (__xray_log_init_mode("xray-profiling", "no_flush=true") !=
                                             ^~~~~~~~~~~~~~~
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:155:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'const char [14]' to 'const XRayLogImpl &' for 1st argument
struct XRayLogImpl {
       ^
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:155:8: note: candidate constructor (the implicit move constructor) not viable: no known conversion from 'const char [14]' to 'XRayLogImpl &&' for 1st argument
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:227:60: note: passing argument to parameter 'Impl' here
                                               XRayLogImpl Impl);
                                                           ^
2 errors generated.

In D48879#1202052, @hans wrote:

In D48879#1199290, @dberris wrote:

Either that or the version that's in 7.0 doesn't have some recent changes to profiling mode. I have been testing with the latest from trunk.

I tried using trunk, but got compile errors:

/work/test-suite/MicroBenchmarks/XRay/ProfilingMode/profiling-bench.cc:33:7: error: use of undeclared identifier '__xray_log_init_mode'; did you mean '__xray_log_register_mode'?
  if (__xray_log_init_mode("xray-profiling", "no_flush=true") !=
      ^~~~~~~~~~~~~~~~~~~~
      __xray_log_register_mode
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:226:23: note: '__xray_log_register_mode' declared here
XRayLogRegisterStatus __xray_log_register_mode(const char *Mode,
                      ^
/work/test-suite/MicroBenchmarks/XRay/ProfilingMode/profiling-bench.cc:33:46: error: no viable conversion from 'const char [14]' to 'XRayLogImpl'
  if (__xray_log_init_mode("xray-profiling", "no_flush=true") !=
                                             ^~~~~~~~~~~~~~~
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:155:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'const char [14]' to 'const XRayLogImpl &' for 1st argument
struct XRayLogImpl {
       ^
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:155:8: note: candidate constructor (the implicit move constructor) not viable: no known conversion from 'const char [14]' to 'XRayLogImpl &&' for 1st argument
/work/llvm/build.release/lib/clang/8.0.0/include/xray/xray_log_interface.h:227:60: note: passing argument to parameter 'Impl' here
                                               XRayLogImpl Impl);
                                                           ^
2 errors generated.

That's... not right -- did you also update the compiler-rt subproject?

That's... not right -- did you also update the compiler-rt subproject?

Ah, I guess I didn't. After updating, the test builds and fails like with the 7.0 branch.

Right -- I have two other patches under review (D50831 and D50782) which should reduce/eliminate these.

At least I still can't get it to reproduce on my machine, so that's something else. :(

Both the patches have landed, do you mind sync'ing to HEAD and trying again?

In D48879#1203580, @dberris wrote:

Both the patches have landed, do you mind sync'ing to HEAD and trying again?

Sorry, no, still the same error. I tried doing a fresh checkout and build to make sure there's nothing funny in my environment, but that didn't help:

$ svn export https://llvm.org/svn/llvm-project/llvm/trunk llvm && svn export https://llvm.org/svn/llvm-project/cfe/trunk llvm/tools/clang && svn export https://llvm.org/svn/llvm-project/compiler-rt/trunk llvm/projects/compiler-rt 

$ mkdir build && cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Release -LLVM_ENABLE_ASSERTIONS=ON ../llvm && ninja
$ cd ..

$ svn export https://llvm.org/svn/llvm-project/test-suite/trunk test-suite
$ cd test-suite
$ wget https://reviews.llvm.org/D48879?download=true -O /tmp/patch
$ patch -p0 < /tmp/patch
$ cd ..

$ mkdir test-suite-build && cd test-suite-build
$ CC=../build/bin/clang CXX=../build/bin/clang++ cmake -GNinja ../test-suite -DTEST_SUITE_LIT=../build/bin/llvm-lit
$ ninja check

Revision Contents

Path

Size

test-suite/

trunk/

MicroBenchmarks/

XRay/

CMakeLists.txt

1 line

ProfilingMode/

CMakeLists.txt

20 lines

deep-call-bench.cc

90 lines

shallow-call-bench.cc

84 lines

wide-call-bench.cc

142 lines

Diff 155319

test-suite/trunk/MicroBenchmarks/XRay/CMakeLists.txt

	add_subdirectory(ReturnReference)			add_subdirectory(ReturnReference)
	add_subdirectory(FDRMode)			add_subdirectory(FDRMode)
				add_subdirectory(ProfilingMode)

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/CMakeLists.txt

				check_cxx_compiler_flag(-fxray-instrument COMPILER_HAS_FXRAY_INSTRUMENT)
				check_cxx_compiler_flag(-fxray-modes=xray-profiling
				COMPILER_HAS_FXRAY_PROFILING)
				if(ARCH STREQUAL "x86"
				AND COMPILER_HAS_FXRAY_INSTRUMENT
				AND COMPILER_HAS_FXRAY_PROFILING)
				list(APPEND CPPFLAGS
				-std=c++11 -Wl,--gc-sections
				-fxray-instrument -fxray-modes=xray-profiling)
				list(APPEND LDFLAGS
				-fxray-instrument -fxray-modes=xray-profiling)
				llvm_test_run()
				hansUnsubmitted Not Done Reply Inline Actions I think there needs to be a llvm_test_run() for each test executable. Otherwise lit fails like this: UNRESOLVED: test-suite :: MicroBenchmarks/XRay/ProfilingMode/shallow-call-bench.test (341 of 912) ****************** TEST 'test-suite :: MicroBenchmarks/XRay/ProfilingMode/shallow-call-bench.test' FAILED **************** Exception during script execution: Traceback (most recent call last): File "/work/llvm-release-test/branches_release_70/sandbox/local/lib/python2.7/site-packages/lit-0.7.0.dev0-py2.7.egg/lit/run.py", line 202, in _execute_test_impl result = test.config.test_format.execute(test, lit_config) File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/test.py", line 49, in execute litsupport.testfile.parse(context, test.getSourcePath()) File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/testfile.py", line 50, in parse raise ValueError("Test has no RUN: line!") ValueError: Test has no RUN: line! **************** Testing: 0 .. 10.. 20.. 30. UNRESOLVED: test-suite :: MicroBenchmarks/XRay/ProfilingMode/wide-call-bench.test (343 of 912) **************** TEST 'test-suite :: MicroBenchmarks/XRay/ProfilingMode/wide-call-bench.test' FAILED **************** Exception during script execution: Traceback (most recent call last): File "/work/llvm-release-test/branches_release_70/sandbox/local/lib/python2.7/site-packages/lit-0.7.0.dev0-py2.7.egg/lit/run.py", line 202, in _execute_test_impl result = test.config.test_format.execute(test, lit_config) File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/test.py", line 49, in execute litsupport.testfile.parse(context, test.getSourcePath()) File "/work/llvm-release-test/branches_release_70/test-suite.src/litsupport/testfile.py", line 50, in parse raise ValueError("Test has no RUN: line!") ValueError: Test has no RUN: line! This worked for me: llvm_test_run() llvm_test_executable(deep-call-bench deep-call-bench.cc) target_link_libraries(deep-call-bench benchmark) llvm_test_run() llvm_test_executable(shallow-call-bench shallow-call-bench.cc) target_link_libraries(shallow-call-bench benchmark) llvm_test_run() llvm_test_executable(wide-call-bench wide-call-bench.cc) target_link_libraries(wide-call-bench benchmark) However, the tests are flaky when run together, see below. hans:** I think there needs to be a llvm_test_run() for each test executable. Otherwise lit fails like…
				llvm_test_executable(deep-call-bench deep-call-bench.cc)
				target_link_libraries(deep-call-bench benchmark)
				llvm_test_executable(shallow-call-bench shallow-call-bench.cc)
				target_link_libraries(shallow-call-bench benchmark)
				llvm_test_executable(wide-call-bench wide-call-bench.cc)
				target_link_libraries(wide-call-bench benchmark)
				endif()

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/deep-call-bench.cc

				//===- deep-call-bench.cc - XRay Profiling Mode Benchmarks ----------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// These benchmarks measure the cost of XRay profiling mode when enabled.
				//
				//===----------------------------------------------------------------------===//

				#include <atomic>
				#include <iostream>
				#include <mutex>
				#include <thread>
				#include "benchmark/benchmark.h"
				#include "xray/xray_log_interface.h"

				namespace {

				std::atomic<int> some_global{1};

				std::atomic<int> some_temporary{0};

				[[clang::xray_never_instrument]] static void profiling_setup() {
				if (__xray_log_select_mode("xray-profiling") != XRAY_REGISTRATION_OK) {
				std::cerr << "Failed selecting 'xray-profiling' mode. Aborting.\n";
				std::abort();
				}

				if (__xray_log_init_mode("xray-profiling", "no_flush=true") !=
				XRAY_LOG_INITIALIZED) {
				std::cerr << "Failed initializing xray-profiling mode. Aborting.\n";
				std::abort();
				};

				__xray_patch();
				}

				[[clang::xray_never_instrument]] static void profiling_teardown() {
				if (__xray_log_finalize() != XRAY_LOG_FINALIZED) {
				std::cerr << "Failed to finalize xray-profiling mode. Aborting.\n";
				std::abort();
				}

				if (__xray_log_flushLog() != XRAY_LOG_FLUSHED) {
				std::cerr << "Failed to flush xray-profiling mode. Aborting.\n";
				std::abort();
				}
				}

				} // namespace

				[[clang::xray_always_instrument]] __attribute__((weak))
				__attribute__((noinline)) int
				deep(int depth) {
				if (depth == 0) return some_global.load(std::memory_order_acquire);
				return some_global.load(std::memory_order_acquire) + deep(depth - 1);
				}

				// This benchmark measures the cost of XRay instrumentation in deep function
				// call stacks, where each function has been instrumented. We use function call
				// recursion to control the depth of the recursion as an input. We make the
				// recursion function a combination of: no-inline, have weak symbol binding, and
				// force instrumentation with XRay. Each iteration of the benchmark will
				// initialize the XRay profiling runtime, and then tear it down afterwards.
				//
				// We also run the benchmark on multiple threads, to track and identify
				// whether/where the contention and scalability issues are in the implementation
				// of the profiling runtime.
				[[clang::xray_never_instrument]] static void BM_XRayProfilingDeepCallStack(
				benchmark::State &state) {
				if (state.thread_index == 0) profiling_setup();

				benchmark::DoNotOptimize(some_temporary = deep(state.range(0)));

				for (auto _ : state)
				benchmark::DoNotOptimize(some_temporary = deep(state.range(0)));

				if (state.thread_index == 0) profiling_teardown();
				}
				BENCHMARK(BM_XRayProfilingDeepCallStack)
				->ThreadRange(1, 32)
				->RangeMultiplier(2)
				->Range(1, 64)
				->UseRealTime();

				BENCHMARK_MAIN();

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/shallow-call-bench.cc

				//===- shallow-call-bench.cc - XRay Profiling Mode Benchmarks -------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// These benchmarks measure the cost of XRay profiling mode when enabled.
				//
				//===----------------------------------------------------------------------===//

				#include <atomic>
				#include <iostream>
				#include <mutex>
				#include <thread>
				#include "benchmark/benchmark.h"
				#include "xray/xray_log_interface.h"

				namespace {

				std::atomic<int> some_global{0};

				std::atomic<int> some_temporary{0};

				[[clang::xray_never_instrument]] static void profiling_setup() {
				if (__xray_log_select_mode("xray-profiling") != XRAY_REGISTRATION_OK) {
				std::cerr << "Failed selecting 'xray-profiling' mode. Aborting.\n";
				std::abort();
				}

				if (__xray_log_init_mode("xray-profiling", "no_flush=true") !=
				XRAY_LOG_INITIALIZED) {
				std::cerr << "Failed initializing xray-profiling mode. Aborting.\n";
				std::abort();
				};

				__xray_patch();
				}

				[[clang::xray_never_instrument]] static void profiling_teardown() {
				if (__xray_log_finalize() != XRAY_LOG_FINALIZED) {
				std::cerr << "Failed to finalize xray-profiling mode. Aborting.\n";
				std::abort();
				}

				if (__xray_log_flushLog() != XRAY_LOG_FLUSHED) {
				std::cerr << "Failed to flush xray-profiling mode. Aborting.\n";
				std::abort();
				}
				}

				} // namespace

				#define XRAY_WEAK_NOINLINE \
				[[clang::xray_always_instrument]] __attribute__((weak)) \
				__attribute__((noinline))

				XRAY_WEAK_NOINLINE int shallow() {
				return some_global.fetch_add(1, std::memory_order_acq_rel);
				}

				// This benchmark measures the cost of XRay instrumentation in shallow function
				// call stack, where we instrument a single function call. We make the function
				// a combination of: no-inline, have weak symbol binding, and force
				// instrumentation with XRay. Each iteration of the benchmark will initialize
				// the XRay profiling runtime, and then tear it down afterwards.
				//
				// We also run the benchmark on multiple threads, to track and identify
				// whether/where the contention and scalability issues are in the implementation
				// of the profiling runtime.
				[[clang::xray_never_instrument]] static void BM_XRayProfilingShallowStack(
				benchmark::State &state) {
				if (state.thread_index == 0) profiling_setup();

				benchmark::DoNotOptimize(some_temporary = shallow());
				for (auto _ : state) benchmark::DoNotOptimize(some_temporary = shallow());

				if (state.thread_index == 0) profiling_teardown();
				}
				BENCHMARK(BM_XRayProfilingShallowStack)->ThreadRange(1, 64)->UseRealTime();

				BENCHMARK_MAIN();

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/wide-call-bench.cc

				//===- wide-call-bench.cc - XRay Profiling Mode Benchmarks ----------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// These benchmarks measure the cost of XRay profiling mode when enabled.
				//
				//===----------------------------------------------------------------------===//

				#include <atomic>
				#include <iostream>
				#include <mutex>
				#include <thread>
				#include "benchmark/benchmark.h"
				#include "xray/xray_log_interface.h"

				namespace {

				std::atomic<int> some_global{1};

				std::atomic<int> some_temporary{0};

				[[clang::xray_never_instrument]] static void profiling_setup() {
				if (__xray_log_select_mode("xray-profiling") != XRAY_REGISTRATION_OK) {
				std::cerr << "Failed selecting 'xray-profiling' mode. Aborting.\n";
				std::abort();
				}

				if (__xray_log_init_mode("xray-profiling", "no_flush=true") !=
				XRAY_LOG_INITIALIZED) {
				std::cerr << "Failed initializing xray-profiling mode. Aborting.\n";
				std::abort();
				};

				__xray_patch();
				}

				[[clang::xray_never_instrument]] static void profiling_teardown() {
				if (__xray_log_finalize() != XRAY_LOG_FINALIZED) {
				std::cerr << "Failed to finalize xray-profiling mode. Aborting.\n";
				std::abort();
				}

				if (__xray_log_flushLog() != XRAY_LOG_FLUSHED) {
				std::cerr << "Failed to flush xray-profiling mode. Aborting.\n";
				std::abort();
				}
				}

				} // namespace

				#define XRAY_WEAK_NOINLINE \
				[[clang::xray_always_instrument]] __attribute__((weak)) \
				__attribute__((noinline))

				XRAY_WEAK_NOINLINE int wide8() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int wide7() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int wide6() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int wide5() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int wide4() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int wide3() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int wide2() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int wide1() {
				return some_global.load(std::memory_order_acquire);
				}
				XRAY_WEAK_NOINLINE int call(int depth, int width) {
				if (depth == 0) return some_global.load(std::memory_order_acquire);

				auto val = 0;
				switch (width) {
				default:
				case 8:
				val += wide8();
				case 7:
				val += wide7();
				case 6:
				val += wide6();
				case 5:
				val += wide5();
				case 4:
				val += wide4();
				case 3:
				val += wide3();
				case 2:
				val += wide2();
				case 1:
				val += wide1();
				}

				return some_global.load(std::memory_order_acquire) + val +
				call(depth - 1, width);
				}

				// This benchmark measures the cost of XRay instrumentation in wide function
				// call stacks, where each function has been instrumented. We use function call
				// recursion to control the depth of the recursion as an input, as well as an
				// input-controlled branching (non-looping) to determine the width of other
				// functions. We make the recursion function a combination of: no-inline, have
				// weak symbol binding, and force instrumentation with XRay. Each iteration of
				// the benchmark will initialize the XRay profiling runtime, and then tear it
				// down afterwards.
				//
				// We also run the benchmark on multiple threads, to track and identify
				// whether/where the contention and scalability issues are in the implementation
				// of the profiling runtime.
				[[clang::xray_never_instrument]] static void BM_XRayProfilingWideCallStack(
				benchmark::State &state) {
				if (state.thread_index == 0) profiling_setup();

				benchmark::DoNotOptimize(some_temporary =
				call(state.range(0), state.range(1)));
				for (auto _ : state)
				benchmark::DoNotOptimize(some_temporary =
				call(state.range(0), state.range(1)));

				if (state.thread_index == 0) profiling_teardown();
				}
				BENCHMARK(BM_XRayProfilingWideCallStack)
				->ThreadRange(1, 32)
				->RangeMultiplier(2)
				->Ranges({{1, 64}, {1, 8}})
				->UseRealTime();

				BENCHMARK_MAIN();

This is an archive of the discontinued LLVM Phabricator instance.

[XRay][test-suite] Benchmarks for profiling mode implementationNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 155319

test-suite/trunk/MicroBenchmarks/XRay/CMakeLists.txt

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/CMakeLists.txt

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/deep-call-bench.cc

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/shallow-call-bench.cc

test-suite/trunk/MicroBenchmarks/XRay/ProfilingMode/wide-call-bench.cc

[XRay][test-suite] Benchmarks for profiling mode implementation
Needs ReviewPublic