This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget][devicertl] Remove branches around setting parallelLevel
ClosedPublic

Authored by JonChesterfield on Jul 9 2021, 7:25 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
ABataev
grokos
tianshilei1992
ye-luo
ronlieb
carlo.bertolli
pdhaliwal
ggeorgakoudis
Meinersbur

Commits

rGb6b53ffef441: [libomptarget][devicertl] Remove branches around setting parallelLevel

Summary

Simplifies control flow to allow store/load forwarding

This change folds two basic blocks into one, leaving a single store to parallelLevel.
This is a step towards spmd kernels with sufficiently aggressive inlining folding
the loads from parallelLevel and thus discarding the nested parallel handling
when it is unused.

Transform:

int threadId = GetThreadIdInBlock();
if (threadId == 0) {
  parallelLevel[0] = expr;
} else if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// because
unsigned GetLaneId() { return GetThreadIdInBlock() & (WARPSIZE - 1);}
// so whenever threadId == 0, GetLaneId() is also 0.

That replaces a store in two distinct basic blocks with as single store.

A more aggressive follow up is possible if the threads in the warp/wave
race to write the same value to the same address. This is not done as
part of this change.

if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
parallelLevel[GetWarpId()] = expr;
// because
unsigned GetWarpId() { return GetThreadIdInBlock() / WARPSIZE; }
// so GetWarpId will index the same element for every thread in the warp
// and, because expr is lane-invariant in this case, every lane stores the
// same value to this unique address

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield requested review of this revision.Jul 9 2021, 7:25 AM

JonChesterfield created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 7:25 AM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Harbormaster completed remote builds in B113194: Diff 357502.Jul 9 2021, 8:19 AM

Herald added a subscriber: sstefan1. · View Herald TranscriptJul 9 2021, 8:19 AM

JonChesterfield mentioned this in D105697: [libomptarget][nfc] Drop dead code in parallel_51.Jul 9 2021, 9:16 AM

Introducing write races is something I'd prefer to avoid. Is there any measurable improvement through this change?

It cleans up the IR a lot but performance change is in the noise on amdgpu. I suspect our main bottlenecks are elsewhere. I'd be interested to hear how it changes a recent nvptx card.

My working theory is that once all the branches that are introduced by the openmp runtime have been optimised out, the IR will look much like it does under cuda and perform much the same, except for overheads in the host runtime.

Two options to avoid the race are to take only the first transform, which doesn't currently fold the loads but might do after some other optimisations improve, or to use a relaxed atomic store to make the race well defined (and probably emit exactly the same ISA).

I'll split this into the definitely good and racy subsections, see if I can get opt to drop parallelLevel for spmd without the latter.

drop second transform

JonChesterfield edited the summary of this revision. (Show Details)Jul 9 2021, 10:14 AM

JonChesterfield edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B113235: Diff 357556.Jul 9 2021, 11:17 AM

• post.kadirselcuk added a child revision: D34362: [LNT] Support for different DataSet usage in Polybench for "lnt runtest nt".Jul 10 2021, 5:55 PM

• post.kadirselcuk added a parent revision: D105762: [X86] Teach X86FloatingPoint's handleCall to only erase the FP stack if there is a regmask operand that clobbers the FP stack..Jul 10 2021, 8:06 PM

craig.topper removed a parent revision: D105762: [X86] Teach X86FloatingPoint's handleCall to only erase the FP stack if there is a regmask operand that clobbers the FP stack..Jul 10 2021, 9:47 PM

Ping. Trivial change, makes codegen better on both targets, step on the path to eliminating spmd overhead. Can we have this?

LG.

This revision is now accepted and ready to land.Jul 12 2021, 5:08 PM

Closed by commit rGb6b53ffef441: [libomptarget][devicertl] Remove branches around setting parallelLevel (authored by JonChesterfield). · Explain WhyJul 13 2021, 4:07 AM

This revision was automatically updated to reflect the committed changes.

JonChesterfield added a commit: rGb6b53ffef441: [libomptarget][devicertl] Remove branches around setting parallelLevel.

efriedma removed a child revision: D34362: [LNT] Support for different DataSet usage in Polybench for "lnt runtest nt".Jul 17 2021, 3:02 PM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

common/

src/

omptarget.cu

7 lines

Diff 358227

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu

	Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	static void __kmpc_spmd_kernel_init(bool RequiresFullRuntime) {			static void __kmpc_spmd_kernel_init(bool RequiresFullRuntime) {
	PRINT0(LD_IO, "call to __kmpc_spmd_kernel_init\n");			PRINT0(LD_IO, "call to __kmpc_spmd_kernel_init\n");

	setExecutionParameters(Spmd, RequiresFullRuntime ? RuntimeInitialized			setExecutionParameters(Spmd, RequiresFullRuntime ? RuntimeInitialized
	: RuntimeUninitialized);			: RuntimeUninitialized);
	int threadId = GetThreadIdInBlock();			int threadId = GetThreadIdInBlock();
	if (threadId == 0) {			if (threadId == 0) {
	usedSlotIdx = __kmpc_impl_smid() % MAX_SM;			usedSlotIdx = __kmpc_impl_smid() % MAX_SM;
	parallelLevel[0] =			}
	1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
	} else if (GetLaneId() == 0) {			if (GetLaneId() == 0) {
	parallelLevel[GetWarpId()] =			parallelLevel[GetWarpId()] =
	1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);			1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
	}			}

	__kmpc_data_sharing_init_stack();			__kmpc_data_sharing_init_stack();
	if (!RequiresFullRuntime)			if (!RequiresFullRuntime)
	return;			return;

	//			//
	// Team Context Initialization.			// Team Context Initialization.
	//			//
	// In SPMD mode there is no master thread so use any cuda thread for team			// In SPMD mode there is no master thread so use any cuda thread for team
	▲ Show 20 Lines • Show All 125 Lines • Show Last 20 Lines