This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/runtime/src/
-
runtime/
-
src/
-
kmp_barrier.cpp

Differential D77603

[OpenMP] Sync writes to child thread's data before reduction
ClosedPublic

Authored by bryanpkc on Apr 6 2020, 3:48 PM.

Download Raw Diff

Details

Reviewers

Hahnfeld
jlpeyton
jdoerfert
AndreyChurbanov

Commits

rGb86ff5f6efbe: [OpenMP] Sync writes to child thread's data before reduction

Summary

On systems with weak memory consistency, this patch fixes an intermittent crash
in the reduction function called by __kmp_hyper_barrier_gather, which suffers
from a race on a child thread's data.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bryanpkc created this revision.Apr 6 2020, 3:48 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2020, 3:48 PM

Herald added subscribers: openmp-commits, guansong, yaxunl. · View Herald Transcript

Harbormaster failed remote builds in B52065: Diff 255523!Apr 6 2020, 4:55 PM

I don't see paired memory barrier in a child thread between assigning th.th_local.reduce_data in __kmp_barrier_template() and releasing b_arrived barrier flag that frees parent to go to reduce data. So it might be that the problem could just become lesser probable. Should paired KMP_MB be added after the reduce_data assignment? Or does atomic releasing of a flag serves as a memory barrier? Then my assumption is wrong and second MB is not needed.

In D77603#1970402, @AndreyChurbanov wrote:

I don't see paired memory barrier in a child thread between assigning th.th_local.reduce_data in __kmp_barrier_template() and releasing b_arrived barrier flag that frees parent to go to reduce data.

Thanks for the review. You are right, the atomic release does not serve as a memory barrier, and I should add a paired memory barrier after the child thread finishes writing to its local data.

Add a paired memory barrier to the child thread's path after it finishes writing to its own data and before releasing the parent thread.

Harbormaster failed remote builds in B52924: Diff 256954!Apr 13 2020, 4:16 AM

LGTM

May worth waiting for a day or two for others' comments, up to you.

This revision is now accepted and ready to land.Apr 13 2020, 5:08 AM

Sorry, found one more issue.

I think the first MB should be moved inside the block:

for (level...
  if (((tid...

Rational: current code will work for one level tree of threads. For multiple levels tree, the parent thread on level 0 can become child on level 1, and it will miss MB for publishing its reduction data accumulated on level 0 to be used by its "new" parent on level 1. Former parent flushed its initial reduction data only, but not newly accumulated data (at line 612).

Once MB is moved inside the loop-if block - to before child thread flag releasing, it will work for all levels children (including those were parent at lower levels of the tree).

E.g. let's look at 8 threads t0 - t7 for current code.
At level 0 parent t0 has children t1, t2, t3; parent t4 has children t5, t6, t7.
At next level 1 parent t0 has one child t4, which didn't flush its reduction data after reducing data of the last child t7 at line 612. So t0 has a chance to reduce stale data of t4.
With suggested code thread t4 will flush its partial data right before its flag releasing (and won't flash initial data which is not needed, only final partial data matter).

This revision now requires changes to proceed.Apr 13 2020, 5:46 AM

Pair the memory barriers correctly at the same nesting level within the loop.

LGTM

This revision is now accepted and ready to land.Apr 13 2020, 10:47 AM

In D77603#1977658, @AndreyChurbanov wrote:
Sorry, found one more issue.

I think the first MB should be moved inside the block:
for (level...
  if (((tid...

You are absolutely right about that. I have fixed the problem in the newest version.

Should I also add memory barriers to __kmp_linear_barrier_gather_template, __kmp_tree_barrier_gather, and __kmp_hierarchical_barrier_gather? They seem to suffer from the same problem on systems with weak memory order (but I don't have a handy test case to prove it).

Harbormaster failed remote builds in B52957: Diff 257026!Apr 13 2020, 11:53 AM

In D77603#1978289, @bryanpkc wrote:

Should I also add memory barriers to __kmp_linear_barrier_gather_template, __kmp_tree_barrier_gather, and __kmp_hierarchical_barrier_gather? They seem to suffer from the same problem on systems with weak memory order (but I don't have a handy test case to prove it).

Yes, ideally barriers of all types should have these extra synchronizations, though with lesser priority I think. Because other barriers do not work by default, they need to be explicitly requested. So it might be done in a separate patch. Or this patch can be extended, not sure what is better.

In D77603#1978909, @AndreyChurbanov wrote:

Yes, ideally barriers of all types should have these extra synchronizations, though with lesser priority I think. Because other barriers do not work by default, they need to be explicitly requested.

Thanks! In that case I will commit this one first, and follow up with another patch.

Closed by commit rGb86ff5f6efbe: [OpenMP] Sync writes to child thread's data before reduction (authored by bryanpkc). · Explain WhyApr 14 2020, 11:51 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

openmp/

runtime/

src/

kmp_barrier.cpp

2 lines

Diff 257429

openmp/runtime/src/kmp_barrier.cpp

Show First 20 Lines • Show All 543 Lines • ▼ Show 20 Lines	#endif
for (level = 0, offset = 1; offset < num_threads;		for (level = 0, offset = 1; offset < num_threads;
level += branch_bits, offset <<= branch_bits) {		level += branch_bits, offset <<= branch_bits) {
kmp_uint32 child;		kmp_uint32 child;
kmp_uint32 child_tid;		kmp_uint32 child_tid;

if (((tid >> level) & (branch_factor - 1)) != 0) {		if (((tid >> level) & (branch_factor - 1)) != 0) {
kmp_int32 parent_tid = tid & ~((1 << (level + branch_bits)) - 1);		kmp_int32 parent_tid = tid & ~((1 << (level + branch_bits)) - 1);

		KMP_MB(); // Synchronize parent and child threads.
KA_TRACE(20,		KA_TRACE(20,
("__kmp_hyper_barrier_gather: T#%d(%d:%d) releasing T#%d(%d:%d) "		("__kmp_hyper_barrier_gather: T#%d(%d:%d) releasing T#%d(%d:%d) "
"arrived(%p): %llu => %llu\n",		"arrived(%p): %llu => %llu\n",
gtid, team->t.t_id, tid, __kmp_gtid_from_tid(parent_tid, team),		gtid, team->t.t_id, tid, __kmp_gtid_from_tid(parent_tid, team),
team->t.t_id, parent_tid, &thr_bar->b_arrived,		team->t.t_id, parent_tid, &thr_bar->b_arrived,
thr_bar->b_arrived,		thr_bar->b_arrived,
thr_bar->b_arrived + KMP_BARRIER_STATE_BUMP));		thr_bar->b_arrived + KMP_BARRIER_STATE_BUMP));
// Mark arrival to parent thread		// Mark arrival to parent thread
Show All 25 Lines	#endif /* KMP_CACHE_MANAGE */
("__kmp_hyper_barrier_gather: T#%d(%d:%d) wait T#%d(%d:%u) "		("__kmp_hyper_barrier_gather: T#%d(%d:%d) wait T#%d(%d:%u) "
"arrived(%p) == %llu\n",		"arrived(%p) == %llu\n",
gtid, team->t.t_id, tid, __kmp_gtid_from_tid(child_tid, team),		gtid, team->t.t_id, tid, __kmp_gtid_from_tid(child_tid, team),
team->t.t_id, child_tid, &child_bar->b_arrived, new_state));		team->t.t_id, child_tid, &child_bar->b_arrived, new_state));
// Wait for child to arrive		// Wait for child to arrive
kmp_flag_64 c_flag(&child_bar->b_arrived, new_state);		kmp_flag_64 c_flag(&child_bar->b_arrived, new_state);
c_flag.wait(this_thr, FALSE USE_ITT_BUILD_ARG(itt_sync_obj));		c_flag.wait(this_thr, FALSE USE_ITT_BUILD_ARG(itt_sync_obj));
ANNOTATE_BARRIER_END(child_thr);		ANNOTATE_BARRIER_END(child_thr);
		KMP_MB(); // Synchronize parent and child threads.
#if USE_ITT_BUILD && USE_ITT_NOTIFY		#if USE_ITT_BUILD && USE_ITT_NOTIFY
// Barrier imbalance - write min of the thread time and a child time to		// Barrier imbalance - write min of the thread time and a child time to
// the thread.		// the thread.
if (__kmp_forkjoin_frames_mode == 2) {		if (__kmp_forkjoin_frames_mode == 2) {
this_thr->th.th_bar_min_time = KMP_MIN(this_thr->th.th_bar_min_time,		this_thr->th.th_bar_min_time = KMP_MIN(this_thr->th.th_bar_min_time,
child_thr->th.th_bar_min_time);		child_thr->th.th_bar_min_time);
}		}
#endif		#endif
▲ Show 20 Lines • Show All 1,571 Lines • Show Last 20 Lines