parallel_51 increments a shared variable array and then decrements it
This change deletes that. It leaves the comments in place. Change introduced
in D95976 which notes the implementation is a stopgap. Even if this is not code
which is expected to work, I'd like to drop the accesses to parallel level as
that simplifies reasoning about the other uses of it.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
---|---|---|
327 | Checked the implementation of this, it does not access parallelLevel. |
We don’t support nested parallelism so the parallel level doesn’t have too much effort, EXCEPT the query of parallel level from OpenMP APIs. I’m wondering if removing them will cause the API to return a wrong value.
- omp_get_level (or similar and friends) will not know we are in a parallel region now.
- if there is a nested parallel it will not know we are in a parallel region (see my comment) and we will most likely deadlock.
(Also, I don't think "optimizing" the old runtime is worth it at this point.)
openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
---|---|---|
297 | ^^^ |
The only code that executes between the increment and following decrement are two calls to barrier_simple_spmd, which do not read the parallel level. There can be no user code executing between the two parts deleted here, and from checking IR before and after this change opt already deletes it anyway. Dropping this dead code makes the source clearer (and compilation fractionally faster) at zero cost.
I'm interested in reducing overhead in the current runtime because codegen for the simple spmd case looks credibly close to cuda that I'm hopeful the gap can be narrowed, which would be a big deal for benchmarks until the new runtime comes online.
Could you please watch the webinar (https://www.openmp.org/events/webinar-a-compilers-view-of-the-openmp-api/) or read the tregion paper, in both it is explained what is happening here. The two barriers active workers and wait for them to finish. Workers can and do read the parallel level.
I'm interested in reducing overhead in the current runtime because codegen for the simple spmd case looks credibly close to cuda that I'm hopeful the gap can be narrowed, which would be a big deal for benchmarks until the new runtime comes online.
OK, I can accept that the code is not dead. Attempted to reproduce opt deleting it and failed, looks like that was an artefact of looking at spmd kernels - the library itself does not drop them under optimisation.
Given that it's not dead, and this is a way of passing information to worker threads, surely the parallel level needs to be incremented before whatever the workers are waiting on, which in this case looks like before __kmpc_begin_sharing_variables (or possibly before prepare_parallel), as otherwise the worker threads are fairly likely to run to completion before the parallel level has been incremented?
Workers wait for __kmpc_barrier_simple_spmd. Setup has to happen before the first, tear down after the second.
OK, caught up. No dead code here after all. Thank you. I'm still having a bad time with the control flow in this library.
^^^