Restructured dynamic loop dispatcher code.
Fixed work with dispatch buffers for nonmonotonic dynamic (static_steal) schedule:
- eliminated possibility of stealing iterations of the wrong loop when victim thread changed its buffer to work on another loop;
- fixed race when victim thread changed its buffer to work in nested parallel.
- eliminated "static" property of the schedule, that is now a single thread can execute whole loop.
Sometimes (not always, so it seems a data race) running this test in an Arm 64-bit machine with 46 cores (and in a Power9 machine with 40 cores) all the threads end waiting here, so the test doesn't progress anymore.
All the cases I've seen happen with KMP_DISP_NUM_BUFFERS=3 and -DMY_SCHEDULE=guided.
Any idea how I could debug this further?
A quick look about sh->buffer_index shows it is a volatile and it is updated in
Given that this is not an atomic operation (yet it goes followed by a memory barrier) my only hypothesis is that the original load of sh->buffer_index might have read an old value but that would suggest KMP_MB() is not effective in these targets? So I am at loss here.
Thanks!