Moved initial checks of num_threads vs. 1 from kmp_reserve_threads(), which is called under the forkjoin lock, to the upper level kmp_fork_call().
Thus most cases of serialized parallels do not interfere with the forkjoin lock (e.g. nested parallels when omp nested is disabled).
Details
- Reviewers
hfinkel
Diff Detail
- Repository
- rL LLVM
Event Timeline
runtime/src/kmp_runtime.c | ||
---|---|---|
1740 | Why do we only release the lock when nthreads == 1? Does __kmp_reserve_threads release it otherwise? (I realize that you've only moved this line from down below, but this seems non-obvious) |
runtime/src/kmp_runtime.c | ||
---|---|---|
1740 | No, the __kmp_reserve_threads does not release the lock. Let me detail the rational of the change: Old code: get lock always at the beginning, then release lock for nthreads==1 on line 1756, for nthreads>1 on line 2168 when a lot of multithread-sensitive actions have completed. New code: lock skipped for simple 1-thread cases, but still got lock for other 1-thread cases (e.g. when serial execution caused by dynamic threads adjustment inside __kmp_reserve_threads). As a result the lock releasing for 1-thread moved here, because it now cannot be done for all 1-thread cases. Multi-thread case releases the lock in the same place as earlier. Performance result - 10x or more speedup of the code like <long loop> #pragma omp parallel #pragma omp parallel where inner parallel region are serialized by default because OMP nesting is disabled, and number of threads in outer region is big (e.g. 60 threads on Xeon PHI to keep all cores busy). |
runtime/src/kmp_runtime.c | ||
---|---|---|
1740 | Okay, sounds good. Please add a comment here explaining that in the multi-thread case the lock is released later on in the function. Otherwise, LGTM. |
Why do we only release the lock when nthreads == 1? Does __kmp_reserve_threads release it otherwise?
(I realize that you've only moved this line from down below, but this seems non-obvious)