On systems with weak memory consistency, this patch fixes an intermittent crash
in the reduction function called by __kmp_hyper_barrier_gather, which suffers
from a race on a child thread's data.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I don't see paired memory barrier in a child thread between assigning th.th_local.reduce_data in __kmp_barrier_template() and releasing b_arrived barrier flag that frees parent to go to reduce data. So it might be that the problem could just become lesser probable. Should paired KMP_MB be added after the reduce_data assignment? Or does atomic releasing of a flag serves as a memory barrier? Then my assumption is wrong and second MB is not needed.
Thanks for the review. You are right, the atomic release does not serve as a memory barrier, and I should add a paired memory barrier after the child thread finishes writing to its local data.
Add a paired memory barrier to the child thread's path after it finishes writing to its own data and before releasing the parent thread.
Sorry, found one more issue.
I think the first MB should be moved inside the block:
for (level... if (((tid...
Rational: current code will work for one level tree of threads. For multiple levels tree, the parent thread on level 0 can become child on level 1, and it will miss MB for publishing its reduction data accumulated on level 0 to be used by its "new" parent on level 1. Former parent flushed its initial reduction data only, but not newly accumulated data (at line 612).
Once MB is moved inside the loop-if block - to before child thread flag releasing, it will work for all levels children (including those were parent at lower levels of the tree).
E.g. let's look at 8 threads t0 - t7 for current code.
At level 0 parent t0 has children t1, t2, t3; parent t4 has children t5, t6, t7.
At next level 1 parent t0 has one child t4, which didn't flush its reduction data after reducing data of the last child t7 at line 612. So t0 has a chance to reduce stale data of t4.
With suggested code thread t4 will flush its partial data right before its flag releasing (and won't flash initial data which is not needed, only final partial data matter).
You are absolutely right about that. I have fixed the problem in the newest version.
Should I also add memory barriers to __kmp_linear_barrier_gather_template, __kmp_tree_barrier_gather, and __kmp_hierarchical_barrier_gather? They seem to suffer from the same problem on systems with weak memory order (but I don't have a handy test case to prove it).
Yes, ideally barriers of all types should have these extra synchronizations, though with lesser priority I think. Because other barriers do not work by default, they need to be explicitly requested. So it might be done in a separate patch. Or this patch can be extended, not sure what is better.