Page MenuHomePhabricator

[test-suite] Adding HACCKernels app
ClosedPublic

Authored by homerdin on Sep 29 2017, 11:17 AM.

Details

Summary
Description:

The Hardware/Hybrid Accelerated Cosmology Code (HACC), a cosmology N-body-code
framework, is designed to run efficiently on diverse computing architectures
and to scale to millions of cores and beyond. The gravitational force is the
only significant force between particles at cosmological scales, and, in HACC,
this force is divided into two components: a long-range component and a
short-range component. The long-range component is handled using a distributed
grid-based solver, and the short-range component is by more-direct
particle-particle computations. On many systems, a tree-based multipole
approximation is used to further reduce the computational complexity of the
short-range force. The inner-most computation is a direct N^2 particle-particle
force calculation of the short-range part of the gravitational force. It is this
inner-most calculation that consumes most of the simulation time, is
computationally bound, and is what is represented by this benchmark.

Link:

Web: https://xgitlab.cels.anl.gov/hacc/HACCKernels

When run on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.2GHz:
compile_time: 11.6126 
exec_time: 13.3000

Diff Detail

Event Timeline

homerdin created this revision.Sep 29 2017, 11:17 AM

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Thanks!

Kristof.

hfinkel edited edge metadata.Sep 30 2017, 8:21 AM

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

Thanks!

Kristof.

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.
For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness, as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.
This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).
Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.
Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Thanks!

Kristof

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.

I'm not sure that it runs too long in the abstract, but we certainly waste CPU time by having programs that run for longer than necessary.

For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness,

Agreed.

as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.

At the risk of going too far afield, this has not been universally my experience. When checking for performance, on build servers with ~50 hardware threads, I often run with the test suite with a level of parallelism matching the number of hardware threads. I'd run the test suite ~15 times and then use ministat (https://github.com/codahale/ministat) to compare the ~15 timings from each test to a previous run. I've found these numbers to be better than quiet-server serial runs for two reasons: First, even a quiet server is noisy and we need to run the test multiple times (unless they really run for a long time), and second, the cores are in a more production-like state (where, for example, multiple hardware threads are being used and there's contention for the cache). I/O-related times are obviously more variable this way, but I've generally found that tests that run for a second (and as low of 0.2s on some systems) are fine for this kind of configuration. Also, so long as you have more than 30 hardware threads (or something like that, depending on the architecture), it's actually faster this way than a single serial run. Moreover, ministat gives error bars :-)

In case you're curious, there's also a Python version of ministat (https://github.com/lebinh/ministat).

This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).

We definitely need more coverage for performance. We also need *a lot* more coverage for correctness (i.e. the fact that I catch far more miscompiles from self hosting than from the test suite is a problem).

Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.

This is also close to my experience; aiming for about a second, maybe two, makes sense.

Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Okay. I propose that we shorten the current running time to around 1.5 seconds. That should leave sufficient running time once we start vectorizing the loops.

Thanks!

Kristof

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.

I'm not sure that it runs too long in the abstract, but we certainly waste CPU time by having programs that run for longer than necessary.

For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness,

Agreed.

as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.

At the risk of going too far afield, this has not been universally my experience. When checking for performance, on build servers with ~50 hardware threads, I often run with the test suite with a level of parallelism matching the number of hardware threads. I'd run the test suite ~15 times and then use ministat (https://github.com/codahale/ministat) to compare the ~15 timings from each test to a previous run. I've found these numbers to be better than quiet-server serial runs for two reasons: First, even a quiet server is noisy and we need to run the test multiple times (unless they really run for a long time), and second, the cores are in a more production-like state (where, for example, multiple hardware threads are being used and there's contention for the cache). I/O-related times are obviously more variable this way, but I've generally found that tests that run for a second (and as low of 0.2s on some systems) are fine for this kind of configuration. Also, so long as you have more than 30 hardware threads (or something like that, depending on the architecture), it's actually faster this way than a single serial run. Moreover, ministat gives error bars :-)

In case you're curious, there's also a Python version of ministat (https://github.com/lebinh/ministat).

This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).

We definitely need more coverage for performance. We also need *a lot* more coverage for correctness (i.e. the fact that I catch far more miscompiles from self hosting than from the test suite is a problem).

Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.

This is also close to my experience; aiming for about a second, maybe two, makes sense.

Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Okay. I propose that we shorten the current running time to around 1.5 seconds. That should leave sufficient running time once we start vectorizing the loops.

Thanks Hal, SGTM!
Also thanks for sharing your experience with running benchmarks in parallel - good to see that it shouldn't be too hard to make it beneficial on high core count systems.

homerdin updated this revision to Diff 117377.Oct 2 2017, 9:57 AM

Thanks for the feedback and discussion! I adjusted the RUN_OPTIONS as suggested: exec_time: 1.4786.

hfinkel added inline comments.Oct 2 2017, 4:54 PM
MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/GravityForceKernel.cpp
101

As Brian and I discussed offline, I've suggested that we update this to read:

#if _OPENMP >= 201307
#pragma omp simd reduction(+:lax,lay,laz)
#elsif defined(clang)
#pragma clang loop vectorize(assume_safety)
#endif

So that we'll vectorize the loop. Looking at this again today, it seems like the problem blocking loop vectorization here is just that we don't self report, by default, a new enough OpenMP version. I suspect that bumping that version is blocked on unrelated things.

homerdin updated this revision to Diff 117527.Oct 3 2017, 7:47 AM

I made the change that Hal mentioned to allow for the loop to vectorize.

I made the change that Hal mentioned to allow for the loop to vectorize.

Could you add a comment to the source to explain why there's a difference in pragmas, for future reference? Ideally we want to minimize as many clang specific vectorization pragmas, if the vectorizer isn't working for whatever reason then it should be noticed, especially if there's a regression somewhere in piping AA information through.

homerdin updated this revision to Diff 119925.Oct 23 2017, 12:50 PM

Added a comment to explain the difference between the two pragmas

hfinkel added inline comments.Nov 2 2017, 6:48 PM
MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/GravityForceKernel.cpp
100

I suggest wording this as follows:

// For the test suite: Clang does not report a high-enough version of OpenMP to enable the pragma below. Moreover, vectorization is desirable regardless of whether OpenMP is enabled (even if Clang's reported version were high enough), so we also use the Clang loop pragma to assume vectorization safety.
homerdin updated this revision to Diff 121470.Nov 3 2017, 6:55 AM

Made the suggested change to the comment for the explanation of clang specific pragma.

hfinkel accepted this revision.Nov 3 2017, 10:21 AM

LGTM

This revision is now accepted and ready to land.Nov 3 2017, 10:21 AM
This revision was automatically updated to reflect the committed changes.
zvi added a subscriber: zvi.Nov 7 2017, 10:17 PM

Since the final commit of this patch, rL317483, the AVX2 buildbot is broken: http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/1402

Brian, will you be able to look into this failure right away? If not, please consider reverting the commit until we get this sorted out. If you believe the failure is due to a bug in LLVM, please create a bug report. Thanks.

spatel added a subscriber: spatel.Nov 8 2017, 6:12 AM
spatel added a comment.Nov 8 2017, 6:15 AM
In D38417#918942, @zvi wrote:

Since the final commit of this patch, rL317483, the AVX2 buildbot is broken: http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/1402

Brian, will you be able to look into this failure right away? If not, please consider reverting the commit until we get this sorted out. If you believe the failure is due to a bug in LLVM, please create a bug report. Thanks.

As noted in off-list email, on x86 I'm seeing this output:
$ ./317576fastbroadwell 450
Iterations: 450
Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.585 -2378.97: 0.088137 s
Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.281 -2378.1: 0.089248 s
Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1: 0.091363 s

While the HACCKernels.reference_output is:
Iterations: 450
Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.584 -2378.97
Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.281 -2378.1
Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1

If this test is being built with -ffast-math, we should add an FP tolerance to the output verification?

homerdin updated this revision to Diff 122106.Nov 8 2017, 9:16 AM

Updated to use FP_TOLERANCE since -ffast-math is being used.

spatel added inline comments.Nov 8 2017, 9:48 AM
MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile
4

That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle acceptable for this program? How high can we go before we decide the result is bogus?

You can reproduce this locally if you have at least a Sandybridge to test on. Ie, if you specify -march=nehalem (or nothing), you should see "689.584" in the output, but if you specify -march=sandybridge, you should see "689.585".

I don't know what underlying transforms cause that difference, but that seems like a reasonable error for -ffast-math.

hfinkel added inline comments.Nov 8 2017, 9:55 AM
MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile
4

That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle acceptable for this program? How high can we go before we decide the result is bogus?

Why not? It's a relative tolerance. (We have FP_ABSTOLERANCE for setting the absolute tolerance).

In any case, the changes generally come from the inverse sqrt approximation. I think we should set this to not much more than needed for observed divergence.

Brian, were you able to confirm that this tolerance is sufficient?

spatel added inline comments.Nov 8 2017, 10:07 AM
MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile
4

Ah, sorry - I missed that this was relative. In that case, it should be ok.

homerdin added inline comments.Nov 8 2017, 10:13 AM
MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile
4

Yes, I was able to reproduce the original issue and this tolerance level is sufficient. (689.585 / 689.584) - 1 = .00000145

-- Testing: 1 tests, 1 threads --
PASS: test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test (1 of 1)
********** TEST 'test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test' RESULTS **********
compile_time: 0.7596 
exec_time: 0.4761 
hash: "8676d81cf540659eb9466f24e6275c19" 
link_time: 0.0196 
**********
Testing Time: 0.61s
  Expected Passes    : 1
[bhomerding@thing03 HACCKernels]$ less Output/HACCKernels.test.out
Iterations: 450
Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.585 -2378.97
Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.282 -2378.1
Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1
exit 0