This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/common/src/
-
libomptarget/
-
deviceRTLs/
-
common/
-
src/
2/3
parallel.cu

Differential D105697

[libomptarget][nfc] Drop dead code in parallel_51
AbandonedPublic

Authored by JonChesterfield on Jul 9 2021, 6:01 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
ggeorgakoudis
Meinersbur
tianshilei1992
grokos

Summary

parallel_51 increments a shared variable array and then decrements it
This change deletes that. It leaves the comments in place. Change introduced
in D95976 which notes the implementation is a stopgap. Even if this is not code
which is expected to work, I'd like to drop the accesses to parallel level as
that simplifies reasoning about the other uses of it.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield requested review of this revision.Jul 9 2021, 6:01 AM

JonChesterfield created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 6:01 AM

Herald added a subscriber: openmp-commits. · View Herald Transcript

JonChesterfield added inline comments.Jul 9 2021, 6:04 AM

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
327	Checked the implementation of this, it does not access parallelLevel.

Herald added a subscriber: sstefan1. · View Herald TranscriptJul 9 2021, 6:04 AM

JonChesterfield edited the summary of this revision. (Show Details)Jul 9 2021, 6:17 AM

We don’t support nested parallelism so the parallel level doesn’t have too much effort, EXCEPT the query of parallel level from OpenMP APIs. I’m wondering if removing them will cause the API to return a wrong value.

Harbormaster completed remote builds in B113185: Diff 357491.Jul 9 2021, 6:36 AM

omp_get_level (or similar and friends) will not know we are in a parallel region now.
if there is a nested parallel it will not know we are in a parallel region (see my comment) and we will most likely deadlock.

(Also, I don't think "optimizing" the old runtime is worth it at this point.)

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
297	^^^

This revision now requires changes to proceed.Jul 9 2021, 7:51 AM

The only code that executes between the increment and following decrement are two calls to barrier_simple_spmd, which do not read the parallel level. There can be no user code executing between the two parts deleted here, and from checking IR before and after this change opt already deletes it anyway. Dropping this dead code makes the source clearer (and compilation fractionally faster) at zero cost.

I'm interested in reducing overhead in the current runtime because codegen for the simple spmd case looks credibly close to cuda that I'm hopeful the gap can be narrowed, which would be a big deal for benchmarks until the new runtime comes online.

JonChesterfield added inline comments.Jul 9 2021, 9:16 AM

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
297	This is the branch that can be folded after D105699, letting this whole function call turn into a tail call to invoke_microtask. It is unaffected by the code deleted in this patch.

In D105697#2867330, @JonChesterfield wrote:

The only code that executes between the increment and following decrement are two calls to barrier_simple_spmd, which do not read the parallel level. There can be no user code executing between the two parts deleted here, and from checking IR before and after this change opt already deletes it anyway. Dropping this dead code makes the source clearer (and compilation fractionally faster) at zero cost.

Could you please watch the webinar (https://www.openmp.org/events/webinar-a-compilers-view-of-the-openmp-api/) or read the tregion paper, in both it is explained what is happening here. The two barriers active workers and wait for them to finish. Workers can and do read the parallel level.

I'm interested in reducing overhead in the current runtime because codegen for the simple spmd case looks credibly close to cuda that I'm hopeful the gap can be narrowed, which would be a big deal for benchmarks until the new runtime comes online.

OK, I can accept that the code is not dead. Attempted to reproduce opt deleting it and failed, looks like that was an artefact of looking at spmd kernels - the library itself does not drop them under optimisation.

Given that it's not dead, and this is a way of passing information to worker threads, surely the parallel level needs to be incremented before whatever the workers are waiting on, which in this case looks like before __kmpc_begin_sharing_variables (or possibly before prepare_parallel), as otherwise the worker threads are fairly likely to run to completion before the parallel level has been incremented?

In D105697#2867396, @JonChesterfield wrote:

OK, I can accept that the code is not dead. Attempted to reproduce opt deleting it and failed, looks like that was an artefact of looking at spmd kernels - the library itself does not drop them under optimisation.

Given that it's not dead, and this is a way of passing information to worker threads, surely the parallel level needs to be incremented before whatever the workers are waiting on, which in this case looks like before __kmpc_begin_sharing_variables (or possibly before prepare_parallel), as otherwise the worker threads are fairly likely to run to completion before the parallel level has been incremented?

Workers wait for __kmpc_barrier_simple_spmd. Setup has to happen before the first, tear down after the second.

OK, caught up. No dead code here after all. Thank you. I'm still having a bad time with the control flow in this library.

• post.kadirselcuk added a parent revision: D105762: [X86] Teach X86FloatingPoint's handleCall to only erase the FP stack if there is a regmask operand that clobbers the FP stack..Jul 10 2021, 8:06 PM

craig.topper removed a parent revision: D105762: [X86] Teach X86FloatingPoint's handleCall to only erase the FP stack if there is a regmask operand that clobbers the FP stack..Jul 10 2021, 9:47 PM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

common/

src/

parallel.cu

11 lines

Diff 357491

openmp/libomptarget/deviceRTLs/common/src/parallel.cu

Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines	EXTERN void __kmpc_parallel_51(kmp_Ident *ident, kmp_int32 global_tid,
int proc_bind, void fn, void wrapper_fn,		int proc_bind, void fn, void wrapper_fn,
void **args, size_t nargs) {		void **args, size_t nargs) {

// Handle the serialized case first, same for SPMD/non-SPMD except that in		// Handle the serialized case first, same for SPMD/non-SPMD except that in
// SPMD mode we already incremented the parallel level counter, account for		// SPMD mode we already incremented the parallel level counter, account for
// that.		// that.
bool InParallelRegion =		bool InParallelRegion =
(__kmpc_parallel_level(ident, global_tid) > __kmpc_is_spmd_exec_mode());		(__kmpc_parallel_level(ident, global_tid) > __kmpc_is_spmd_exec_mode());
if (!if_expr \|\| InParallelRegion) {		if (!if_expr \|\| InParallelRegion) {
jdoerfertUnsubmitted Not Done Reply Inline Actions ^^^ jdoerfert: ^^^
JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions This is the branch that can be folded after D105699, letting this whole function call turn into a tail call to invoke_microtask. It is unaffected by the code deleted in this patch. JonChesterfield: This is the branch that can be folded after D105699, letting this whole function call turn into…
__kmpc_serialized_parallel(ident, global_tid);		__kmpc_serialized_parallel(ident, global_tid);
__kmp_invoke_microtask(global_tid, 0, fn, args, nargs);		__kmp_invoke_microtask(global_tid, 0, fn, args, nargs);
__kmpc_end_serialized_parallel(ident, global_tid);		__kmpc_end_serialized_parallel(ident, global_tid);
return;		return;
}		}

if (__kmpc_is_spmd_exec_mode()) {		if (__kmpc_is_spmd_exec_mode()) {
__kmp_invoke_microtask(global_tid, 0, fn, args, nargs);		__kmp_invoke_microtask(global_tid, 0, fn, args, nargs);
Show All 11 Lines	if (nargs) {
__kmpc_begin_sharing_variables(&GlobalArgs, nargs);		__kmpc_begin_sharing_variables(&GlobalArgs, nargs);
// TODO: faster memcpy?		// TODO: faster memcpy?
for (int I = 0; I < nargs; I++)		for (int I = 0; I < nargs; I++)
GlobalArgs[I] = args[I];		GlobalArgs[I] = args[I];
}		}

// TODO: what if that's a parallel region with a single thread? this is		// TODO: what if that's a parallel region with a single thread? this is
// considered not active in the existing implementation.		// considered not active in the existing implementation.
bool IsActiveParallelRegion = threadsInTeam != 1;
int NumWarps =
threadsInTeam / WARPSIZE + ((threadsInTeam % WARPSIZE) ? 1 : 0);
// Increment parallel level for non-SPMD warps.
for (int I = 0; I < NumWarps; ++I)
parallelLevel[I] +=
(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));

// Master signals work to activate workers.		// Master signals work to activate workers.
__kmpc_barrier_simple_spmd(nullptr, 0);		__kmpc_barrier_simple_spmd(nullptr, 0);
		JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Checked the implementation of this, it does not access parallelLevel. JonChesterfield: Checked the implementation of this, it does not access parallelLevel.

// OpenMP [2.5, Parallel Construct, p.49]		// OpenMP [2.5, Parallel Construct, p.49]
// There is an implied barrier at the end of a parallel region. After the		// There is an implied barrier at the end of a parallel region. After the
// end of a parallel region, only the master thread of the team resumes		// end of a parallel region, only the master thread of the team resumes
// execution of the enclosing task region.		// execution of the enclosing task region.
//		//
// The master waits at this barrier until all workers are done.		// The master waits at this barrier until all workers are done.
__kmpc_barrier_simple_spmd(nullptr, 0);		__kmpc_barrier_simple_spmd(nullptr, 0);

// Decrement parallel level for non-SPMD warps.
for (int I = 0; I < NumWarps; ++I)
parallelLevel[I] -=
(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));
// TODO: Is synchronization needed since out of parallel execution?		// TODO: Is synchronization needed since out of parallel execution?

if (nargs)		if (nargs)
__kmpc_end_sharing_variables();		__kmpc_end_sharing_variables();

// TODO: proc_bind is a noop?		// TODO: proc_bind is a noop?
// if (proc_bind != proc_bind_default)		// if (proc_bind != proc_bind_default)
// __kmpc_push_proc_bind(ident, global_tid, proc_bind);		// __kmpc_push_proc_bind(ident, global_tid, proc_bind);
}		}

#pragma omp end declare target		#pragma omp end declare target