This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
LICENSE.TXT
-
MultiSource/Benchmarks/DOE-ProxyApps-C++/
-
Benchmarks/
-
DOE-ProxyApps-C++/
-
CMakeLists.txt
-
HACCKernels/
-
CMakeLists.txt
-
COPYING
2
GravityForceKernel.cpp
-
HACCKernels.h
-
HACCKernels.reference_output
4
Makefile
-
README
-
main.cpp
-
Makefile

Differential D38417

[test-suite] Adding HACCKernels app
ClosedPublic

Authored by homerdin on Sep 29 2017, 11:17 AM.

Download Raw Diff

Details

Reviewers

hfinkel

Commits

rOLDT317701: Revert "[test-suite] Revert rL317483 (HACCKernels Benchmark)"
rOLDT317483: [test-suite] Adding the HACCKernels Benchmark
rOLDT317697: [test-suite] Revert rL317483 (HACCKernels Benchmark)
rL317701: Revert "[test-suite] Revert rL317483 (HACCKernels Benchmark)"
rL317697: [test-suite] Revert rL317483 (HACCKernels Benchmark)
rL317483: [test-suite] Adding the HACCKernels Benchmark

Summary

Description:

The Hardware/Hybrid Accelerated Cosmology Code (HACC), a cosmology N-body-code
framework, is designed to run efficiently on diverse computing architectures
and to scale to millions of cores and beyond. The gravitational force is the
only significant force between particles at cosmological scales, and, in HACC,
this force is divided into two components: a long-range component and a
short-range component. The long-range component is handled using a distributed
grid-based solver, and the short-range component is by more-direct
particle-particle computations. On many systems, a tree-based multipole
approximation is used to further reduce the computational complexity of the
short-range force. The inner-most computation is a direct N^2 particle-particle
force calculation of the short-range part of the gravitational force. It is this
inner-most calculation that consumes most of the simulation time, is
computationally bound, and is what is represented by this benchmark.

Link:

Web: https://xgitlab.cels.anl.gov/hacc/HACCKernels

When run on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.2GHz:

compile_time: 11.6126 
exec_time: 13.3000

Diff Detail

Event Timeline

homerdin created this revision.Sep 29 2017, 11:17 AM

Herald added a subscriber: mgorny. · View Herald TranscriptSep 29 2017, 11:17 AM

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Thanks!

Kristof.

In D38417#885195, @kristof.beyls wrote:

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

Thanks!

Kristof.

In D38417#885226, @hfinkel wrote:

In D38417#885195, @kristof.beyls wrote:

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.
For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness, as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.
This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).
Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.
Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Thanks!

Kristof

In D38417#885236, @kristof.beyls wrote:

In D38417#885226, @hfinkel wrote:

In D38417#885195, @kristof.beyls wrote:

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.

I'm not sure that it runs too long in the abstract, but we certainly waste CPU time by having programs that run for longer than necessary.

For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness,

Agreed.

as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.

At the risk of going too far afield, this has not been universally my experience. When checking for performance, on build servers with ~50 hardware threads, I often run with the test suite with a level of parallelism matching the number of hardware threads. I'd run the test suite ~15 times and then use ministat (https://github.com/codahale/ministat) to compare the ~15 timings from each test to a previous run. I've found these numbers to be better than quiet-server serial runs for two reasons: First, even a quiet server is noisy and we need to run the test multiple times (unless they really run for a long time), and second, the cores are in a more production-like state (where, for example, multiple hardware threads are being used and there's contention for the cache). I/O-related times are obviously more variable this way, but I've generally found that tests that run for a second (and as low of 0.2s on some systems) are fine for this kind of configuration. Also, so long as you have more than 30 hardware threads (or something like that, depending on the architecture), it's actually faster this way than a single serial run. Moreover, ministat gives error bars :-)

In case you're curious, there's also a Python version of ministat (https://github.com/lebinh/ministat).

This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).

We definitely need more coverage for performance. We also need *a lot* more coverage for correctness (i.e. the fact that I catch far more miscompiles from self hosting than from the test suite is a problem).

Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.

This is also close to my experience; aiming for about a second, maybe two, makes sense.

Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Okay. I propose that we shorten the current running time to around 1.5 seconds. That should leave sufficient running time once we start vectorizing the loops.

Thanks!

Kristof

In D38417#885249, @hfinkel wrote:

In D38417#885236, @kristof.beyls wrote:

In D38417#885226, @hfinkel wrote:

In D38417#885195, @kristof.beyls wrote:

Hi Brian,

Thanks for working on this!

On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.

Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?

I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.

I'm not sure that it runs too long in the abstract, but we certainly waste CPU time by having programs that run for longer than necessary.

For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness,

Agreed.

as we typically:
a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.

At the risk of going too far afield, this has not been universally my experience. When checking for performance, on build servers with ~50 hardware threads, I often run with the test suite with a level of parallelism matching the number of hardware threads. I'd run the test suite ~15 times and then use ministat (https://github.com/codahale/ministat) to compare the ~15 timings from each test to a previous run. I've found these numbers to be better than quiet-server serial runs for two reasons: First, even a quiet server is noisy and we need to run the test multiple times (unless they really run for a long time), and second, the cores are in a more production-like state (where, for example, multiple hardware threads are being used and there's contention for the cache). I/O-related times are obviously more variable this way, but I've generally found that tests that run for a second (and as low of 0.2s on some systems) are fine for this kind of configuration. Also, so long as you have more than 30 hardware threads (or something like that, depending on the architecture), it's actually faster this way than a single serial run. Moreover, ministat gives error bars :-)

In case you're curious, there's also a Python version of ministat (https://github.com/lebinh/ministat).

This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).

We definitely need more coverage for performance. We also need *a lot* more coverage for correctness (i.e. the fact that I catch far more miscompiles from self hosting than from the test suite is a problem).

Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.

There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.

This is also close to my experience; aiming for about a second, maybe two, makes sense.

Which is why I reacted above.
As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.

In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Okay. I propose that we shorten the current running time to around 1.5 seconds. That should leave sufficient running time once we start vectorizing the loops.

Thanks Hal, SGTM!
Also thanks for sharing your experience with running benchmarks in parallel - good to see that it shouldn't be too hard to make it beneficial on high core count systems.

Thanks for the feedback and discussion! I adjusted the RUN_OPTIONS as suggested: exec_time: 1.4786.

hfinkel added inline comments.Oct 2 2017, 4:54 PM

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/GravityForceKernel.cpp
102	As Brian and I discussed offline, I've suggested that we update this to read: #if _OPENMP >= 201307 #pragma omp simd reduction(+:lax,lay,laz) #elsif defined(clang) #pragma clang loop vectorize(assume_safety) #endif So that we'll vectorize the loop. Looking at this again today, it seems like the problem blocking loop vectorization here is just that we don't self report, by default, a new enough OpenMP version. I suspect that bumping that version is blocked on unrelated things.

I made the change that Hal mentioned to allow for the loop to vectorize.

In D38417#887050, @homerdin wrote:

I made the change that Hal mentioned to allow for the loop to vectorize.

Could you add a comment to the source to explain why there's a difference in pragmas, for future reference? Ideally we want to minimize as many clang specific vectorization pragmas, if the vectorizer isn't working for whatever reason then it should be noticed, especially if there's a regression somewhere in piping AA information through.

Added a comment to explain the difference between the two pragmas

hfinkel added inline comments.Nov 2 2017, 6:48 PM

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/GravityForceKernel.cpp

101

I suggest wording this as follows:

// For the test suite: Clang does not report a high-enough version of OpenMP to enable the pragma below. Moreover, vectorization is desirable regardless of whether OpenMP is enabled (even if Clang's reported version were high enough), so we also use the Clang loop pragma to assume vectorization safety.

Made the suggested change to the comment for the explanation of clang specific pragma.

LGTM

This revision is now accepted and ready to land.Nov 3 2017, 10:21 AM

Closed by commit rL317483: [test-suite] Adding the HACCKernels Benchmark (authored by homerdin). · Explain WhyNov 6 2017, 6:57 AM

This revision was automatically updated to reflect the committed changes.

Since the final commit of this patch, rL317483, the AVX2 buildbot is broken: http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/1402

Brian, will you be able to look into this failure right away? If not, please consider reverting the commit until we get this sorted out. If you believe the failure is due to a bug in LLVM, please create a bug report. Thanks.

spatel added a subscriber: spatel.Nov 8 2017, 6:12 AM

In D38417#918942, @zvi wrote:

Since the final commit of this patch, rL317483, the AVX2 buildbot is broken: http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/1402

Brian, will you be able to look into this failure right away? If not, please consider reverting the commit until we get this sorted out. If you believe the failure is due to a bug in LLVM, please create a bug report. Thanks.

As noted in off-list email, on x86 I'm seeing this output:
$ ./317576fastbroadwell 450
Iterations: 450
Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.585 -2378.97: 0.088137 s
Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.281 -2378.1: 0.089248 s
Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1: 0.091363 s

While the HACCKernels.reference_output is:
Iterations: 450
Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.584 -2378.97
Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.281 -2378.1
Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1

If this test is being built with -ffast-math, we should add an FP tolerance to the output verification?

Updated to use FP_TOLERANCE since -ffast-math is being used.

spatel added inline comments.Nov 8 2017, 9:48 AM

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile
4	That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle acceptable for this program? How high can we go before we decide the result is bogus? You can reproduce this locally if you have at least a Sandybridge to test on. Ie, if you specify -march=nehalem (or nothing), you should see "689.584" in the output, but if you specify -march=sandybridge, you should see "689.585". I don't know what underlying transforms cause that difference, but that seems like a reasonable error for -ffast-math.

hfinkel added inline comments.Nov 8 2017, 9:55 AM

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile
4	That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle acceptable for this program? How high can we go before we decide the result is bogus? Why not? It's a relative tolerance. (We have FP_ABSTOLERANCE for setting the absolute tolerance). In any case, the changes generally come from the inverse sqrt approximation. I think we should set this to not much more than needed for observed divergence. Brian, were you able to confirm that this tolerance is sufficient?

spatel added inline comments.Nov 8 2017, 10:07 AM

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile
4	Ah, sorry - I missed that this was relative. In that case, it should be ok.

homerdin added inline comments.Nov 8 2017, 10:13 AM

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile

Yes, I was able to reproduce the original issue and this tolerance level is sufficient. (689.585 / 689.584) - 1 = .00000145

-- Testing: 1 tests, 1 threads --
PASS: test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test (1 of 1)
********** TEST 'test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test' RESULTS **********
compile_time: 0.7596 
exec_time: 0.4761 
hash: "8676d81cf540659eb9466f24e6275c19" 
link_time: 0.0196 
**********
Testing Time: 0.61s
  Expected Passes    : 1
[bhomerding@thing03 HACCKernels]$ less Output/HACCKernels.test.out
Iterations: 450
Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.585 -2378.97
Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.282 -2378.1
Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1
exit 0

Revision Contents

Path

Size

LICENSE.TXT

1 line

MultiSource/

Benchmarks/

DOE-ProxyApps-C++/

CMakeLists.txt

1 line

HACCKernels/

CMakeLists.txt

5 lines

COPYING

50 lines

GravityForceKernel.cpp

165 lines

HACCKernels.h

93 lines

HACCKernels.reference_output

5 lines

7 lines

59 lines

193 lines

2 lines

Diff 122106

LICENSE.TXT

Context not available.
	HPCCG: llvm-test/MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG	HPCCG: llvm-test/MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG
	PENNANT: llvm-test/MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT	PENNANT: llvm-test/MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT
	miniFE: llvm-test/MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE	miniFE: llvm-test/MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE
		HACCKernels llvm-test/MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels
	Fhourstones: llvm-test/MultiSource/Benchmarks/Fhourstones	Fhourstones: llvm-test/MultiSource/Benchmarks/Fhourstones
	Fhourstones-3.1: llvm-test/MultiSource/Benchmarks/Fhourstones-3.1	Fhourstones-3.1: llvm-test/MultiSource/Benchmarks/Fhourstones-3.1
	McCat: llvm-test/MultiSource/Benchmarks/McCat	McCat: llvm-test/MultiSource/Benchmarks/McCat
Context not available.

MultiSource/Benchmarks/DOE-ProxyApps-C++/CMakeLists.txt

Context not available.
	add_subdirectory(PENNANT)	add_subdirectory(PENNANT)
	add_subdirectory(miniFE)	add_subdirectory(miniFE)
	add_subdirectory(CLAMR)	add_subdirectory(CLAMR)
		add_subdirectory(HACCKernels)
Context not available.

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/CMakeLists.txt

This file was added.

				set(PROG HACCKernels)
				set(FP_TOLERANCE 0.00001)
				list(APPEND CPPFLAGS -ffast-math -DVERIFICATION_OUTPUT_ONLY=ON)
				set(RUN_OPTIONS 450)
				llvm_multisource()

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/COPYING

This file was added.

				Copyright (C) 2017, UChicago Argonne, LLC
				All Rights Reserved

				Hardware/Hybrid Cosmology Code (HACC), Version 1.0

				Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann,
				Vitali Morozov, Jeffrey Emberson, Thomas Uram, Esteban Rangel
				(Argonne National Laboratory)

				David Daniel, Patricia Fasel, Chung-Hsing Hsu, Zarija Lukic, James Ahrens
				(Los Alamos National Laboratory)

				George Zagaris
				(Kitware)

				OPEN SOURCE LICENSE

				Redistribution and use in source and binary forms, with or without
				modification, are permitted provided that the following conditions are met:

				1. Redistributions of source code must retain the above copyright notice,
				this list of conditions and the following disclaimer. Software changes,
				modifications, or derivative works, should be noted with comments and the
				author and organization’s name.

				2. Redistributions in binary form must reproduce the above copyright notice,
				this list of conditions and the following disclaimer in the documentation
				and/or other materials provided with the distribution.

				3. Neither the names of UChicago Argonne, LLC or the Department of Energy nor
				the names of its contributors may be used to endorse or promote products
				derived from this software without specific prior written permission.

				4. The software and the end-user documentation included with the
				redistribution, if any, must include the following acknowledgment:

				"This product includes software produced by UChicago Argonne, LLC under
				Contract No. DE-AC02-06CH11357 with the Department of Energy."

				********************************************************************************
				DISCLAIMER
				THE SOFTWARE IS SUPPLIED "AS IS" WITHOUT WARRANTY OF ANY KIND. NEITHER THE
				UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR
				UCHICAGO ARGONNE, LLC, NOR ANY OF THEIR EMPLOYEES, MAKES ANY WARRANTY, EXPRESS
				OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURARY,
				COMPLETENESS, OR USEFULNESS OF ANY INFORMATION, DATA, APPARATUS, PRODUCT, OR
				PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE PRIVATELY
				OWNED RIGHTS.

				********************************************************************************

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/GravityForceKernel.cpp

This file was added.

				/*
				* Copyright (C) 2017, UChicago Argonne, LLC
				* All Rights Reserved
				*
				* Hardware/Hybrid Cosmology Code (HACC), Version 1.0
				*
				* Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann,
				* Vitali Morozov, Jeffrey Emberson, Thomas Uram, Esteban Rangel
				* (Argonne National Laboratory)
				*
				* David Daniel, Patricia Fasel, Chung-Hsing Hsu, Zarija Lukic, James Ahrens
				* (Los Alamos National Laboratory)
				*
				* George Zagaris
				* (Kitware)
				*
				* OPEN SOURCE LICENSE
				*
				* Redistribution and use in source and binary forms, with or without
				* modification, are permitted provided that the following conditions are met:
				*
				* 1. Redistributions of source code must retain the above copyright notice,
				* this list of conditions and the following disclaimer. Software changes,
				* modifications, or derivative works, should be noted with comments and
				* the author and organization’s name.
				*
				* 2. Redistributions in binary form must reproduce the above copyright
				* notice, this list of conditions and the following disclaimer in the
				* documentation and/or other materials provided with the distribution.
				*
				* 3. Neither the names of UChicago Argonne, LLC or the Department of Energy
				* nor the names of its contributors may be used to endorse or promote
				* products derived from this software without specific prior written
				* permission.
				*
				* 4. The software and the end-user documentation included with the
				* redistribution, if any, must include the following acknowledgment:
				*
				* "This product includes software produced by UChicago Argonne, LLC under
				* Contract No. DE-AC02-06CH11357 with the Department of Energy."
				*
				* *****************************************************************************
				* DISCLAIMER
				* THE SOFTWARE IS SUPPLIED "AS IS" WITHOUT WARRANTY OF ANY KIND. NEITHER THE
				* UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR
				* UCHICAGO ARGONNE, LLC, NOR ANY OF THEIR EMPLOYEES, MAKES ANY WARRANTY,
				* EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE
				* ACCURARY, COMPLETENESS, OR USEFULNESS OF ANY INFORMATION, DATA, APPARATUS,
				* PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE
				* PRIVATELY OWNED RIGHTS.
				*
				* *****************************************************************************
				*/

				#include "HACCKernels.h"
				#include <cmath>

				extern const float PolyCoefficients4[] = {
				0.263729f, -0.0686285f, 0.00882248f, -0.000592487f, 0.0000164622f
				};

				extern const float PolyCoefficients5[] = {
				0.269327f, -0.0750978f, 0.0114808f, -0.00109313f, 0.0000605491f,
				-0.00000147177f
				};

				extern const float PolyCoefficients6[] = {
				0.271431f, -0.0783394f, 0.0133122f, -0.00159485f, 0.000132336f,
				-0.00000663394f, 0.000000147305f
				};

				// HACC's gravity short-range-force kernel represents the part of the 1/r^2
				// gravitational force that is not computed by the long-range grid solver. This
				// kernel computes the acceleration of a target particle from all of the other
				// particles in the provided interaction lists. It is assumed that the target
				// particle has unit mass while the interaction-list can contain pseudo
				// particles with larger mass values. Beyond a distance of MaxSep, the
				// inter-particle force should be completely accounted for by the long-range
				// grid solver (and thus we filter out such interactions here). Closer than
				// MaxSep, we directly compute the inter-particle force, subtracting the
				// long-range part of the force (as fit to a polynomial of the specified
				// degree). A softening length, SofteningLen, is also used, as is standard in
				// N-body codes.

				template <int PolyOrder, const float (&PolyCoefficients)[PolyOrder+1]>
				static void GravityForceKernel(int n, float RESTRICT x, float RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az) {
				float lax = 0.0f, lay = 0.0f, laz = 0.0f;

				// As written below, the mass array is conditionally accessed (i.e. accessed
				// only if the interaction is not filtered by the distance checks). This will
				// tend to inhibit vectorization on architectures without masked vector loads.
				// With OpenMP 4+, we can explicitly inform the compiler that vectorization is
				// safe.
				//
				// For the test suite: Clang does not report a high-enough version of OpenMP
				// to enable the pragma below. Moreover, vectorization is desirable regardless
				hfinkelUnsubmitted Not Done Reply Inline Actions I suggest wording this as follows: // For the test suite: Clang does not report a high-enough version of OpenMP to enable the pragma below. Moreover, vectorization is desirable regardless of whether OpenMP is enabled (even if Clang's reported version were high enough), so we also use the Clang loop pragma to assume vectorization safety. hfinkel: I suggest wording this as follows: // For the test suite: Clang does not report a high…
				// of whether OpenMP is enabled (even if Clang's reported version were high
				hfinkelUnsubmitted Not Done Reply Inline Actions As Brian and I discussed offline, I've suggested that we update this to read: #if _OPENMP >= 201307 #pragma omp simd reduction(+:lax,lay,laz) #elsif defined(clang) #pragma clang loop vectorize(assume_safety) #endif So that we'll vectorize the loop. Looking at this again today, it seems like the problem blocking loop vectorization here is just that we don't self report, by default, a new enough OpenMP version. I suspect that bumping that version is blocked on unrelated things. hfinkel: As Brian and I discussed offline, I've suggested that we update this to read: #if _OPENMP >=…
				// enough), so we also use the Clang loop pragma to assume vectorization safety.
				#if _OPENMP >= 201307
				#pragma omp simd reduction(+:lax,lay,laz)
				#elif defined(clang)
				#pragma clang loop vectorize(assume_safety)
				#endif
				for (int i = 0; i < n; ++i) {
				float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;
				float r2 = dx * dx + dy * dy + dz * dz;

				if (r2 >= MaxSepSqrd \|\| r2 == 0.0f)
				continue;

				float r2s = r2 + SofteningLenSqrd;
				float f = PolyCoefficients[PolyOrder];
				for (int p = 1; p <= PolyOrder; ++p)
				f = PolyCoefficients[PolyOrder-p] + r2*f;

				f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];

				lax += f * dx;
				lay += f * dy;
				laz += f * dz;
				}

				ax += lax;
				ay += lay;
				az += laz;
				}

				void GravityForceKernel4(int n, float RESTRICT x, float RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az) {
				GravityForceKernel<4, PolyCoefficients4>(n, x, y, z, mass, x0, y0, z0,
				MaxSepSqrd, SofteningLenSqrd,
				ax, ay, az);
				}

				void GravityForceKernel5(int n, float RESTRICT x, float RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az) {
				GravityForceKernel<5, PolyCoefficients5>(n, x, y, z, mass, x0, y0, z0,
				MaxSepSqrd, SofteningLenSqrd,
				ax, ay, az);
				}

				void GravityForceKernel6(int n, float RESTRICT x, float RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az) {
				GravityForceKernel<6, PolyCoefficients6>(n, x, y, z, mass, x0, y0, z0,
				MaxSepSqrd, SofteningLenSqrd,
				ax, ay, az);
				}

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.h

This file was added.

				/*
				* Copyright (C) 2017, UChicago Argonne, LLC
				* All Rights Reserved
				*
				* Hardware/Hybrid Cosmology Code (HACC), Version 1.0
				*
				* Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann,
				* Vitali Morozov, Jeffrey Emberson, Thomas Uram, Esteban Rangel
				* (Argonne National Laboratory)
				*
				* David Daniel, Patricia Fasel, Chung-Hsing Hsu, Zarija Lukic, James Ahrens
				* (Los Alamos National Laboratory)
				*
				* George Zagaris
				* (Kitware)
				*
				* OPEN SOURCE LICENSE
				*
				* Redistribution and use in source and binary forms, with or without
				* modification, are permitted provided that the following conditions are met:
				*
				* 1. Redistributions of source code must retain the above copyright notice,
				* this list of conditions and the following disclaimer. Software changes,
				* modifications, or derivative works, should be noted with comments and
				* the author and organization’s name.
				*
				* 2. Redistributions in binary form must reproduce the above copyright
				* notice, this list of conditions and the following disclaimer in the
				* documentation and/or other materials provided with the distribution.
				*
				* 3. Neither the names of UChicago Argonne, LLC or the Department of Energy
				* nor the names of its contributors may be used to endorse or promote
				* products derived from this software without specific prior written
				* permission.
				*
				* 4. The software and the end-user documentation included with the
				* redistribution, if any, must include the following acknowledgment:
				*
				* "This product includes software produced by UChicago Argonne, LLC under
				* Contract No. DE-AC02-06CH11357 with the Department of Energy."
				*
				* *****************************************************************************
				* DISCLAIMER
				* THE SOFTWARE IS SUPPLIED "AS IS" WITHOUT WARRANTY OF ANY KIND. NEITHER THE
				* UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR
				* UCHICAGO ARGONNE, LLC, NOR ANY OF THEIR EMPLOYEES, MAKES ANY WARRANTY,
				* EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE
				* ACCURARY, COMPLETENESS, OR USEFULNESS OF ANY INFORMATION, DATA, APPARATUS,
				* PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE
				* PRIVATELY OWNED RIGHTS.
				*
				* *****************************************************************************
				*/

				#ifndef RESTRICT
				#if defined(__GNUC__) \|\| defined(__clang__)
				#define RESTRICT __restrict__
				#elif defined(_MSC_VER)
				#define RESTRICT __restrict
				#else
				#define RESTRICT /* empty */
				#endif
				#endif

				typedef void
				(GravityForceKernelFunc)(int n, float RESTRICT x, float *RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az);

				void GravityForceKernel4(int n, float RESTRICT x, float RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az);

				void GravityForceKernel5(int n, float RESTRICT x, float RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az);

				void GravityForceKernel6(int n, float RESTRICT x, float RESTRICT y,
				float RESTRICT z, float RESTRICT mass,
				float x0, float y0, float z0,
				float MaxSepSqrd, float SofteningLenSqrd,
				float &RESTRICT ax, float &RESTRICT ay,
				float &RESTRICT az);

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.reference_output

This file was added.

				Iterations: 450
				Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.584 -2378.97
				Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.281 -2378.1
				Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1
				exit 0

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile

This file was added.

				LEVEL = ../../../..

				PROG = HACCKernels
				FP_TOLERANCE = 0.00001
				spatelUnsubmitted Not Done Reply Inline Actions That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle acceptable for this program? How high can we go before we decide the result is bogus? You can reproduce this locally if you have at least a Sandybridge to test on. Ie, if you specify -march=nehalem (or nothing), you should see "689.584" in the output, but if you specify -march=sandybridge, you should see "689.585". I don't know what underlying transforms cause that difference, but that seems like a reasonable error for -ffast-math. spatel: That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle…
				hfinkelUnsubmitted Not Done Reply Inline Actions That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle acceptable for this program? How high can we go before we decide the result is bogus? Why not? It's a relative tolerance. (We have FP_ABSTOLERANCE for setting the absolute tolerance). In any case, the changes generally come from the inverse sqrt approximation. I think we should set this to not much more than needed for observed divergence. Brian, were you able to confirm that this tolerance is sufficient? hfinkel: > That's not big enough? We're seeing a 0.001 difference in the output string. Is that wiggle…
				spatelUnsubmitted Not Done Reply Inline Actions Ah, sorry - I missed that this was relative. In that case, it should be ok. spatel: Ah, sorry - I missed that this was relative. In that case, it should be ok.
				homerdinAuthorUnsubmitted Not Done Reply Inline Actions Yes, I was able to reproduce the original issue and this tolerance level is sufficient. (689.585 / 689.584) - 1 = .00000145 -- Testing: 1 tests, 1 threads -- PASS: test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test (1 of 1) ******** TEST 'test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test' RESULTS ****** compile_time: 0.7596 exec_time: 0.4761 hash: "8676d81cf540659eb9466f24e6275c19" link_time: 0.0196 ****** Testing Time: 0.61s Expected Passes : 1 [bhomerding@thing03 HACCKernels]$ less Output/HACCKernels.test.out Iterations: 450 Gravity Short-Range-Force Kernel (4th Order): 34376.3 689.585 -2378.97 Gravity Short-Range-Force Kernel (5th Order): 34361.8 689.282 -2378.1 Gravity Short-Range-Force Kernel (6th Order): 34360.9 689.252 -2378.1 exit 0 homerdin:** Yes, I was able to reproduce the original issue and this tolerance level is sufficient. (689.
				CXXFLAGS = -ffast-math -DVERIFICATION_OUTPUT_ONLY=ON
				RUN_OPTIONS = 450
				include $(LEVEL)/MultiSource/Makefile.multisrc

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/README

This file was added.

				CCKernels: A Benchmark for HACC's Particle Force Kernels

				The Hardware/Hybrid Accelerated Cosmology Code (HACC), a cosmology N-body-code
				framework, is designed to run efficiently on diverse computing architectures
				and to scale to millions of cores and beyond. The gravitational force is the
				only significant force between particles at cosmological scales, and, in HACC,
				this force is divided into two components: a long-range component and a
				short-range component. The long-range component is handled using a distributed
				grid-based solver, and the short-range component is by more-direct
				particle-particle computations. On many systems, a tree-based multipole
				approximation is used to further reduce the computational complexity of the
				short-range force. The inner-most computation is a direct N^2 particle-particle
				force calculation of the short-range part of the gravitational force. It is this
				inner-most calculation that consumes most of the simulation time, is
				computationally bound, and is what is represented by this benchmark.

				Because this inner-most force calculation is algorithmically isolated from the
				overall scale of the problem, the parameters don't need to be adjusted to
				represent the workload on different machine scales (e.g. petascale or
				exascale).

				For more information on HACC, see:

				Salman Habib, et al. HACC: Simulating Sky Surveys on State-of-the-Art
				Supercomputing Architectures. New Astronomy Volume 42, January 2016, pp. 49-65.
				http://doi.org/10.1016/j.newast.2015.06.003
				https://arxiv.org/abs/1410.2805

				The benchmark can be compiled using cmake (or make directly using
				Makefile.simple) and then run like this:

				$ ./HACCKernels
				Maximum OpenMP Threads: 1
				Iterations: 2000
				Gravity Short-Range-Force Kernel (4th Order): 26307.2 -122.385 -1369.32: 4.45269 s
				Gravity Short-Range-Force Kernel (5th Order): 26297.5 -123.056 -1368.67: 4.51347 s
				Gravity Short-Range-Force Kernel (6th Order): 26297.6 -123.225 -1368.66: 4.8256 s

				The accumulated acceleration in each direction for all particles in the last
				iteration, which is a function of the total number of iterations, is printed as
				a diagnostic. It should be similar for all polynomial kernel orders.

				If you'd like the benchmark only to display deterministic output (i.e.
				omitting information on the number of threads, timing, and the like), then
				define the preprocessor symbol VERIFICATION_OUTPUT_ONLY when compiling.
				You can enable this option when configuring by passing
				-DVERIFICATION_OUTPUT_ONLY=ON to cmake.

				Compared to the older HACCmk procurement benchmark
				(https://asc.llnl.gov/CORAL-benchmarks/#haccmk), this benchmark:

				* More closely matches the parallelization scheme used by the production code.
				* Uses a more-realistic distribution of interaction-list lengths and
				out-of-bounds particles.
				* Includes 4th-, 5th-, and 6th-order kernels.

				For more information, contact: Hal Finkel <hfinkel@anl.gov>

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/main.cpp

This file was added.

				/*
				* Copyright (C) 2017, UChicago Argonne, LLC
				* All Rights Reserved
				*
				* Hardware/Hybrid Cosmology Code (HACC), Version 1.0
				*
				* Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann,
				* Vitali Morozov, Jeffrey Emberson, Thomas Uram, Esteban Rangel
				* (Argonne National Laboratory)
				*
				* David Daniel, Patricia Fasel, Chung-Hsing Hsu, Zarija Lukic, James Ahrens
				* (Los Alamos National Laboratory)
				*
				* George Zagaris
				* (Kitware)
				*
				* OPEN SOURCE LICENSE
				*
				* Redistribution and use in source and binary forms, with or without
				* modification, are permitted provided that the following conditions are met:
				*
				* 1. Redistributions of source code must retain the above copyright notice,
				* this list of conditions and the following disclaimer. Software changes,
				* modifications, or derivative works, should be noted with comments and
				* the author and organization’s name.
				*
				* 2. Redistributions in binary form must reproduce the above copyright
				* notice, this list of conditions and the following disclaimer in the
				* documentation and/or other materials provided with the distribution.
				*
				* 3. Neither the names of UChicago Argonne, LLC or the Department of Energy
				* nor the names of its contributors may be used to endorse or promote
				* products derived from this software without specific prior written
				* permission.
				*
				* 4. The software and the end-user documentation included with the
				* redistribution, if any, must include the following acknowledgment:
				*
				* "This product includes software produced by UChicago Argonne, LLC under
				* Contract No. DE-AC02-06CH11357 with the Department of Energy."
				*
				* *****************************************************************************
				* DISCLAIMER
				* THE SOFTWARE IS SUPPLIED "AS IS" WITHOUT WARRANTY OF ANY KIND. NEITHER THE
				* UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR
				* UCHICAGO ARGONNE, LLC, NOR ANY OF THEIR EMPLOYEES, MAKES ANY WARRANTY,
				* EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE
				* ACCURARY, COMPLETENESS, OR USEFULNESS OF ANY INFORMATION, DATA, APPARATUS,
				* PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE
				* PRIVATELY OWNED RIGHTS.
				*
				* *****************************************************************************
				*/

				#include "HACCKernels.h"
				#include <ctime>
				#include <cstdlib>
				#include <limits>
				#include <vector>
				#include <iostream>

				#ifdef _OPENMP
				#include <omp.h>
				#endif

				// This number of iterations, which can be changed via the command line, is set
				// so that the benchmark will run for a few seconds per polynomial degree of
				// the force kernel on a single core of a modern CPU.
				int NumIters = 2000;

				// The interaction lists range is size between a few hundred and a few thousand.
				int IListMin = 250;
				int IListMax = 2250;

				// The number of particles to update, which represents the number of particles
				// per leaf node of the force evaluation tree in HACC, varies between tends of
				// particles to around a hundred particles depending on the platform. These
				// numbers represent the high side of the production range.
				int PMin = 75;
				int PMax = 150;

				// The softening length and maximum separation similar to those used in HACC
				// high-resolution configurations.
				float SofteningLen = 0.1;
				float MaxSep = 3.2;

				// In this benchmark we offset the positions of the particles being updated
				// from the particles in the interaction list so that some of the interactions
				// will be filtered for being out of range. 0.1 yields ~5% of interactions
				// filtered for being out of range.
				float OffsetAdjFrac = 0.1;

				// A simple random-number generator, see: https://en.wikipedia.org/wiki/Xorshift
				static unsigned int rand32(unsigned int &state) {
				unsigned int x = state;
				x ^= x << 13;
				x ^= x >> 17;
				x ^= x << 5;
				return (state = x);
				}

				static float randflt(unsigned int &state) {
				return ((float) rand32(state)) / ((float) 0xffffffff);
				}

				void run(GravityForceKernelFunc GravityForceKernel, const char *Desc) {
				#ifndef VERIFICATION_OUTPUT_ONLY
				std::clock_t Start, End;
				#endif

				std::cout << "Gravity Short-Range-Force Kernel (" << Desc << "): ";
				#ifndef VERIFICATION_OUTPUT_ONLY
				Start = std::clock();
				#endif

				float ax, ay, az;

				// We use lastprivate for (ax,ay,az) so that the reported output, which can
				// be used for validation, does not depend on the order in which parallel
				// loop iterations are executed.

				// Because each iteration has a different amount of work, dynamic or guided
				// scheduling is used here. guided gives the implementation more scheduling
				// freedom.
				#ifdef _OPENMP
				#pragma omp parallel for schedule(guided) lastprivate(ax,ay,az)
				#endif
				for (int i = 0; i < NumIters; ++i) {
				// Set the random seed used by each iteration to be a function of the
				// iteration number only. This allows information from any fixed iteration
				// (e.g. first or last) to be used for numerical validation.
				unsigned int seed = i+1;
				ax = ay = az = 0.0f;

				int ILParticleCount = IListMin + rand32(seed) % (IListMax - IListMin);
				int ParticleCount = PMin + rand32(seed) % (PMax - PMin);
				std::vector<float> px(ParticleCount), py(ParticleCount),
				pz(ParticleCount);
				std::vector<float> x(ILParticleCount), y(ILParticleCount),
				z(ILParticleCount), mass(ILParticleCount);

				// Fill the particle arrays and the interaction list. The interaction-list
				// particles are offset in the x direction based on OffsetAdjFrac.
				for (int j = 0; j < ParticleCount; ++j) {
				px[j] = randflt(seed)0.5MaxSep;
				py[j] = randflt(seed)0.5MaxSep;
				pz[j] = randflt(seed)0.5MaxSep;
				}

				for (int j = 0; j < ILParticleCount; ++j) {
				x[j] = randflt(seed)0.5MaxSep + (0.5+OffsetAdjFrac)*MaxSep;
				y[j] = randflt(seed)0.5MaxSep;
				z[j] = randflt(seed)0.5MaxSep;
				mass[j] = 1.0f + randflt(seed);
				}


				for (int j = 0; j < ParticleCount; ++j)
				GravityForceKernel(ILParticleCount, &x[0], &y[0], &z[0], &mass[0],
				px[j], py[j], pz[j], MaxSep*MaxSep,
				SofteningLen*SofteningLen, ax, ay, az);
				}

				#ifndef VERIFICATION_OUTPUT_ONLY
				End = std::clock();
				#endif

				std::cout << ax << " " << ay << " " << az;

				#ifndef VERIFICATION_OUTPUT_ONLY
				std::cout << ": ";
				std::cout << ((float)(End - Start))/CLOCKS_PER_SEC << " s\n";
				#else
				std::cout << "\n";
				#endif
				}

				int main(int argc, char *argv[]) {
				#if defined(_OPENMP) && !defined(VERIFICATION_OUTPUT_ONLY)
				std::cout << "Maximum OpenMP Threads: " << omp_get_max_threads() << "\n";
				#endif

				if (argc > 1)
				NumIters = atoi(argv[1]);
				std::cout << "Iterations: " << NumIters << "\n";

				run(GravityForceKernel4, "4th Order");
				run(GravityForceKernel5, "5th Order");
				run(GravityForceKernel6, "6th Order");

				return 0;
				}

MultiSource/Benchmarks/DOE-ProxyApps-C++/Makefile

	# MultiSource/DOE-ProxyApps-C++ Makefile: Build all subdirectories automatically			# MultiSource/DOE-ProxyApps-C++ Makefile: Build all subdirectories automatically

	LEVEL = ../../..			LEVEL = ../../..
	PARALLEL_DIRS = HPCCG PENNANT miniFE CLAMR			PARALLEL_DIRS = HPCCG PENNANT miniFE CLAMR HACCKernels

	include $(LEVEL)/Makefile.programs			include $(LEVEL)/Makefile.programs

This is an archive of the discontinued LLVM Phabricator instance.

[test-suite] Adding HACCKernels appClosedPublic

Details

Description:

Link:

When run on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.2GHz:

Diff Detail

Event Timeline

Revision Contents

Diff 122106

LICENSE.TXT

MultiSource/Benchmarks/DOE-ProxyApps-C++/CMakeLists.txt

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/CMakeLists.txt

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/COPYING

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/GravityForceKernel.cpp

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.h

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.reference_output

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/Makefile

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/README

MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/main.cpp

MultiSource/Benchmarks/DOE-ProxyApps-C++/Makefile

[test-suite] Adding HACCKernels app
ClosedPublic