This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget][nfc] Drop dead code in parallel_51
AbandonedPublic

Authored by JonChesterfield on Jul 9 2021, 6:01 AM.

Details

Summary

parallel_51 increments a shared variable array and then decrements it
This change deletes that. It leaves the comments in place. Change introduced
in D95976 which notes the implementation is a stopgap. Even if this is not code
which is expected to work, I'd like to drop the accesses to parallel level as
that simplifies reasoning about the other uses of it.

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Jul 9 2021, 6:01 AM
JonChesterfield created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 6:01 AM
JonChesterfield added inline comments.Jul 9 2021, 6:04 AM
openmp/libomptarget/deviceRTLs/common/src/parallel.cu
327

Checked the implementation of this, it does not access parallelLevel.

JonChesterfield edited the summary of this revision. (Show Details)Jul 9 2021, 6:17 AM

We don’t support nested parallelism so the parallel level doesn’t have too much effort, EXCEPT the query of parallel level from OpenMP APIs. I’m wondering if removing them will cause the API to return a wrong value.

jdoerfert requested changes to this revision.Jul 9 2021, 7:51 AM
  1. omp_get_level (or similar and friends) will not know we are in a parallel region now.
  2. if there is a nested parallel it will not know we are in a parallel region (see my comment) and we will most likely deadlock.

(Also, I don't think "optimizing" the old runtime is worth it at this point.)

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
297

^^^

This revision now requires changes to proceed.Jul 9 2021, 7:51 AM

The only code that executes between the increment and following decrement are two calls to barrier_simple_spmd, which do not read the parallel level. There can be no user code executing between the two parts deleted here, and from checking IR before and after this change opt already deletes it anyway. Dropping this dead code makes the source clearer (and compilation fractionally faster) at zero cost.

I'm interested in reducing overhead in the current runtime because codegen for the simple spmd case looks credibly close to cuda that I'm hopeful the gap can be narrowed, which would be a big deal for benchmarks until the new runtime comes online.

JonChesterfield added inline comments.Jul 9 2021, 9:16 AM
openmp/libomptarget/deviceRTLs/common/src/parallel.cu
297

This is the branch that can be folded after D105699, letting this whole function call turn into a tail call to invoke_microtask. It is unaffected by the code deleted in this patch.

The only code that executes between the increment and following decrement are two calls to barrier_simple_spmd, which do not read the parallel level. There can be no user code executing between the two parts deleted here, and from checking IR before and after this change opt already deletes it anyway. Dropping this dead code makes the source clearer (and compilation fractionally faster) at zero cost.

Could you please watch the webinar (https://www.openmp.org/events/webinar-a-compilers-view-of-the-openmp-api/) or read the tregion paper, in both it is explained what is happening here. The two barriers active workers and wait for them to finish. Workers can and do read the parallel level.

I'm interested in reducing overhead in the current runtime because codegen for the simple spmd case looks credibly close to cuda that I'm hopeful the gap can be narrowed, which would be a big deal for benchmarks until the new runtime comes online.

OK, I can accept that the code is not dead. Attempted to reproduce opt deleting it and failed, looks like that was an artefact of looking at spmd kernels - the library itself does not drop them under optimisation.

Given that it's not dead, and this is a way of passing information to worker threads, surely the parallel level needs to be incremented before whatever the workers are waiting on, which in this case looks like before __kmpc_begin_sharing_variables (or possibly before prepare_parallel), as otherwise the worker threads are fairly likely to run to completion before the parallel level has been incremented?

OK, I can accept that the code is not dead. Attempted to reproduce opt deleting it and failed, looks like that was an artefact of looking at spmd kernels - the library itself does not drop them under optimisation.

Given that it's not dead, and this is a way of passing information to worker threads, surely the parallel level needs to be incremented before whatever the workers are waiting on, which in this case looks like before __kmpc_begin_sharing_variables (or possibly before prepare_parallel), as otherwise the worker threads are fairly likely to run to completion before the parallel level has been incremented?

Workers wait for __kmpc_barrier_simple_spmd. Setup has to happen before the first, tear down after the second.

JonChesterfield abandoned this revision.Jul 9 2021, 10:07 AM

OK, caught up. No dead code here after all. Thank you. I'm still having a bad time with the control flow in this library.