This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
runtime/src/
-
src/
-
kmp_global.cpp

Differential D40358

Use hyperbarrier by default on all architectures
ClosedPublic

Authored by Hahnfeld on Nov 22 2017, 7:42 AM.

Download Raw Diff

Details

Reviewers

AndreyChurbanov
jlpeyton
hbae
hfinkel

Commits

rGe628ab4c65aa: Use hyperbarrier by default on all architectures
rL320152: Use hyperbarrier by default on all architectures
rOMP320152: Use hyperbarrier by default on all architectures

Summary

All architectures except x86_64 used the linear barrier implementation
by default which doesn't give good performance for a larger number
of threads.

Improvements for PARALLEL overhead (EPCC) with this patch on a Power8
system (2 sockets x 10 cores x 8 threads, OMP_PLACES=cores)

 20 threads:  4.55us -> 3.49us
 40 threads:  8.84us -> 4.06us
 80 threads: 19.18us -> 4.74us
160 threads: 54.22us -> 6.73us

Diff Detail

Repository: rOMP OpenMP

Event Timeline

Hahnfeld created this revision.Nov 22 2017, 7:42 AM

The idea was that 32-bit machines will probably have small number of cores (2, or 4, or ...). Then hyper barrier can have bigger overhead. Can you check if 2 or 4 threads work faster on hyper barrier comparing to linear? If not, then maybe the condition could be fixed in different way, e.g. adding Power arch to the x86_64, leaving linear barrier for 32-bit archs.

BTW, the comments "hyper2: C78980" could be safely removed I think. This is some very old info that says nothing nowadays (at least to me:).

In D40358#932906, @AndreyChurbanov wrote:

The idea was that 32-bit machines will probably have small number of cores (2, or 4, or ...). Then hyper barrier can have bigger overhead. Can you check if 2 or 4 threads work faster on hyper barrier comparing to linear? If not, then maybe the condition could be fixed in different way, e.g. adding Power arch to the x86_64, leaving linear barrier for 32-bit archs.

I might be seeing a slightly better average with the linear barrier for 2 threads (1 percent?), but a higher standard deviation - not really sure about this.
The hyper barrier clearly wins for 4 threads by about 5 percent and naturally for all higher thread counts.
(Tested on the same Power system.)

So in theory, the hyper barrier collapses to a linear barrier for all thread counts less than 5 because we have a branch factor of 4, right? Obviously with a higher overhead because of the more complex code, but the synchronization pattern (which threads waits for which child) remains the same...

BTW, the comments "hyper2: C78980" could be safely removed I think. This is some very old info that says nothing nowadays (at least to me:).

Ok, will do after we agreed on the general direction of this. (I always thought these to be references to an internal bug tracker? There are more references in kmp_atomic.cpp)

LGTM

If some architecture will see the slowdown (unlikely) because of this change we can restore the linear barrier default for them.

This revision is now accepted and ready to land.Nov 24 2017, 8:04 AM

Adding D40722 as a dependence because the improved barrier made it more likely to hit the problem fixed there.

Closed by commit rOMP320152: Use hyperbarrier by default on all architectures (authored by Hahnfeld). · Explain WhyDec 8 2017, 7:07 AM

This revision was automatically updated to reflect the committed changes.

I need to contact someone who introduced KMP_REVERSE_HYPER_BAR implementation (unfortunately, it was added here in a huge collective commit years ago, Tue Oct 7 16:25:50 2014).
I have a piece of proprietary code that stopped to fail on AArch64 after I reverted D40358. Digging deeper I found out that it's not the problem with hyperbariers per se. As I undefined KMP_REVERSE_HYPER_BAR, the problem went away. What else I found is that the problem can be reproduced on x86_64 too, only less frequent (once per

for i in `seq 1 1000`; do

execution). As I'd have to spend a time to figure out the logic behind it and verify its validity, in the meantime, can someone familiar with this code have a look at it and confirm reverse implementation is sane?

Paul,

Can you share more info on the problem on x86_64? What kind of failure you see? What is OS and HW? Can you share a reproducer?

The code you are talking about is quite ancient. I doubt anybody can guarantee the code correctness, but we don't see any problems in many years of testing. I cannot tell anything on AArch64, but if we could reproduce the problem on x86_64, we'd be happy to investigate it.

Thanks,
Andrey

Not much I'm able to say. The error is the malformed integer value resulting in incorrect computation result, more often on AArch64, less often on x86_64, both architectures running under Linux. The code does something unusual so I'm not surprised no one faced this before. Also keep in mind that I would never find it if hyperbarriers weren't introduced for all architectures.

Revision Contents

Path

Size

runtime/

src/

kmp_global.cpp

21 lines

Diff 126146

runtime/src/kmp_global.cpp

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	#endif			#endif
	size_t __kmp_stkoffset = KMP_DEFAULT_STKOFFSET;			size_t __kmp_stkoffset = KMP_DEFAULT_STKOFFSET;
	int __kmp_stkpadding = KMP_MIN_STKPADDING;			int __kmp_stkpadding = KMP_MIN_STKPADDING;

	size_t __kmp_malloc_pool_incr = KMP_DEFAULT_MALLOC_POOL_INCR;			size_t __kmp_malloc_pool_incr = KMP_DEFAULT_MALLOC_POOL_INCR;

	// Barrier method defaults, settings, and strings.			// Barrier method defaults, settings, and strings.
	// branch factor = 2^branch_bits (only relevant for tree & hyper barrier types)			// branch factor = 2^branch_bits (only relevant for tree & hyper barrier types)
	#if KMP_ARCH_X86_64
	kmp_uint32 __kmp_barrier_gather_bb_dflt = 2;			kmp_uint32 __kmp_barrier_gather_bb_dflt = 2;
	/* branch_factor = 4 / / hyper2: C78980 */			/* branch_factor = 4 / / hyper2: C78980 */
	kmp_uint32 __kmp_barrier_release_bb_dflt = 2;			kmp_uint32 __kmp_barrier_release_bb_dflt = 2;
	/* branch_factor = 4 / / hyper2: C78980 */			/* branch_factor = 4 / / hyper2: C78980 */
	#else
	kmp_uint32 __kmp_barrier_gather_bb_dflt = 2;			kmp_bar_pat_e __kmp_barrier_gather_pat_dflt = bp_hyper_bar;
	/* branch_factor = 4 / / communication in core for MIC */			/* hyper2: C78980 */
	kmp_uint32 __kmp_barrier_release_bb_dflt = 2;			kmp_bar_pat_e __kmp_barrier_release_pat_dflt = bp_hyper_bar;
	/* branch_factor = 4 / / communication in core for MIC */			/* hyper2: C78980 */
	#endif // KMP_ARCH_X86_64
	#if KMP_ARCH_X86_64
	kmp_bar_pat_e __kmp_barrier_gather_pat_dflt = bp_hyper_bar; /* hyper2: C78980 */
	kmp_bar_pat_e __kmp_barrier_release_pat_dflt =
	bp_hyper_bar; /* hyper2: C78980 */
	#else
	kmp_bar_pat_e __kmp_barrier_gather_pat_dflt = bp_linear_bar;
	kmp_bar_pat_e __kmp_barrier_release_pat_dflt = bp_linear_bar;
	#endif
	kmp_uint32 __kmp_barrier_gather_branch_bits[bs_last_barrier] = {0};			kmp_uint32 __kmp_barrier_gather_branch_bits[bs_last_barrier] = {0};
	kmp_uint32 __kmp_barrier_release_branch_bits[bs_last_barrier] = {0};			kmp_uint32 __kmp_barrier_release_branch_bits[bs_last_barrier] = {0};
	kmp_bar_pat_e __kmp_barrier_gather_pattern[bs_last_barrier] = {bp_linear_bar};			kmp_bar_pat_e __kmp_barrier_gather_pattern[bs_last_barrier] = {bp_linear_bar};
	kmp_bar_pat_e __kmp_barrier_release_pattern[bs_last_barrier] = {bp_linear_bar};			kmp_bar_pat_e __kmp_barrier_release_pattern[bs_last_barrier] = {bp_linear_bar};
	char const *__kmp_barrier_branch_bit_env_name[bs_last_barrier] = {			char const *__kmp_barrier_branch_bit_env_name[bs_last_barrier] = {
	"KMP_PLAIN_BARRIER", "KMP_FORKJOIN_BARRIER"			"KMP_PLAIN_BARRIER", "KMP_FORKJOIN_BARRIER"
	#if KMP_FAST_REDUCTION_BARRIER			#if KMP_FAST_REDUCTION_BARRIER
	,			,
	▲ Show 20 Lines • Show All 409 Lines • Show Last 20 Lines