Page MenuHomePhabricator

[OpenMP] Sync writes to child thread's data before reduction
ClosedPublic

Authored by bryanpkc on Apr 6 2020, 3:48 PM.

Details

Summary

On systems with weak memory consistency, this patch fixes an intermittent crash
in the reduction function called by __kmp_hyper_barrier_gather, which suffers
from a race on a child thread's data.

Diff Detail

Event Timeline

bryanpkc created this revision.Apr 6 2020, 3:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2020, 3:48 PM

I don't see paired memory barrier in a child thread between assigning th.th_local.reduce_data in __kmp_barrier_template() and releasing b_arrived barrier flag that frees parent to go to reduce data. So it might be that the problem could just become lesser probable. Should paired KMP_MB be added after the reduce_data assignment? Or does atomic releasing of a flag serves as a memory barrier? Then my assumption is wrong and second MB is not needed.

I don't see paired memory barrier in a child thread between assigning th.th_local.reduce_data in __kmp_barrier_template() and releasing b_arrived barrier flag that frees parent to go to reduce data.

Thanks for the review. You are right, the atomic release does not serve as a memory barrier, and I should add a paired memory barrier after the child thread finishes writing to its local data.

bryanpkc updated this revision to Diff 256954.Apr 13 2020, 3:29 AM

Add a paired memory barrier to the child thread's path after it finishes writing to its own data and before releasing the parent thread.

AndreyChurbanov accepted this revision.Apr 13 2020, 5:08 AM

LGTM

May worth waiting for a day or two for others' comments, up to you.

This revision is now accepted and ready to land.Apr 13 2020, 5:08 AM
AndreyChurbanov requested changes to this revision.Apr 13 2020, 5:46 AM

Sorry, found one more issue.

I think the first MB should be moved inside the block:

for (level...
  if (((tid...

Rational: current code will work for one level tree of threads. For multiple levels tree, the parent thread on level 0 can become child on level 1, and it will miss MB for publishing its reduction data accumulated on level 0 to be used by its "new" parent on level 1. Former parent flushed its initial reduction data only, but not newly accumulated data (at line 612).

Once MB is moved inside the loop-if block - to before child thread flag releasing, it will work for all levels children (including those were parent at lower levels of the tree).

E.g. let's look at 8 threads t0 - t7 for current code.
At level 0 parent t0 has children t1, t2, t3; parent t4 has children t5, t6, t7.
At next level 1 parent t0 has one child t4, which didn't flush its reduction data after reducing data of the last child t7 at line 612. So t0 has a chance to reduce stale data of t4.
With suggested code thread t4 will flush its partial data right before its flag releasing (and won't flash initial data which is not needed, only final partial data matter).

This revision now requires changes to proceed.Apr 13 2020, 5:46 AM
bryanpkc updated this revision to Diff 257026.Apr 13 2020, 10:41 AM

Pair the memory barriers correctly at the same nesting level within the loop.

This revision is now accepted and ready to land.Apr 13 2020, 10:47 AM

Sorry, found one more issue.

I think the first MB should be moved inside the block:

for (level...
  if (((tid...

You are absolutely right about that. I have fixed the problem in the newest version.

Should I also add memory barriers to __kmp_linear_barrier_gather_template, __kmp_tree_barrier_gather, and __kmp_hierarchical_barrier_gather? They seem to suffer from the same problem on systems with weak memory order (but I don't have a handy test case to prove it).

Should I also add memory barriers to __kmp_linear_barrier_gather_template, __kmp_tree_barrier_gather, and __kmp_hierarchical_barrier_gather? They seem to suffer from the same problem on systems with weak memory order (but I don't have a handy test case to prove it).

Yes, ideally barriers of all types should have these extra synchronizations, though with lesser priority I think. Because other barriers do not work by default, they need to be explicitly requested. So it might be done in a separate patch. Or this patch can be extended, not sure what is better.

Yes, ideally barriers of all types should have these extra synchronizations, though with lesser priority I think. Because other barriers do not work by default, they need to be explicitly requested.

Thanks! In that case I will commit this one first, and follow up with another patch.

This revision was automatically updated to reflect the committed changes.