This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/CodeGen/
-
lib/
-
CodeGen/
-
CGOpenMPRuntimeGPU.h
-
CGOpenMPRuntimeGPU.cpp
-
llvm/
-
include/llvm/Frontend/OpenMP/
-
llvm/
-
Frontend/
-
OpenMP/
-
OMPIRBuilder.h
-
OMPKinds.def
-
lib/
-
Frontend/OpenMP/
-
OpenMP/
-
OMPIRBuilder.cpp
-
Transforms/IPO/
-
IPO/
1/1
OpenMPOpt.cpp
-
test/Transforms/OpenMP/
-
Transforms/
-
OpenMP/
-
replace_globalization.ll
-
single_threaded_execution.ll
-
openmp/libomptarget/deviceRTLs/
-
libomptarget/
-
deviceRTLs/
-
common/
-
include/
-
target.h
-
src/
1/3
omptarget.cu
-
parallel.cu
1
interface.h
-
nvptx/src/
-
src/
1/3
target_impl.cu

Differential D101976

[OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
ClosedPublic

Authored by jdoerfert on May 6 2021, 12:00 AM.

Download Raw Diff

Details

Reviewers

ABataev
JonChesterfield
ggeorgakoudis
tianshilei1992
bollu

Commits

rGe2cfbfcc0c1f: [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
rG1d5711c3eeb6: [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL

Summary

In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.

The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.

[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11

This was in parts extracted from D59319.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.May 6 2021, 12:00 AM

Herald added a reviewer: bollu. · View Herald TranscriptMay 6 2021, 12:00 AM

Herald added subscribers: jfb, guansong, hiraditya, yaxunl. · View Herald Transcript

jdoerfert requested review of this revision.May 6 2021, 12:00 AM

Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptMay 6 2021, 12:00 AM

Herald added subscribers: llvm-commits, cfe-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B102923: Diff 343299.May 6 2021, 12:00 AM

NOTE: not all tests have been updated, only *codegen.cpp ones.

Remove itantium mangle change

Harbormaster completed remote builds in B102925: Diff 343301.May 6 2021, 12:03 AM

jdoerfert added a child revision: D101977: [OpenMP] Create custom state machines for generic target regions.May 6 2021, 12:04 AM

jdoerfert mentioned this in D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface.May 6 2021, 12:14 AM

ABataev added inline comments.May 6 2021, 4:37 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
65–68	Why not `__syncthreads`? It is safer to use `__syncthreads` as it is `convergent`. Would be good to mark this code somehow as `convergent` too to avoid incorrect optimizations

jdoerfert added inline comments.May 6 2021, 7:51 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
65–68	The problem is that syncthreads is basically a `bar.sync` which is a `barrier.sync.aligned`, if I understood everything properly. This worked so far because the "main thread" (lane 0, last warp) was alone in it's warp and all other threads have been terminated. Now, we simplify the control flow (and later get rid of the last warp) such that the threads of the last warp and the main thread will hit different barriers. The former hit the one in the state machine while the latter will be in `parallel_51`. The `.aligned` version doesn't allow that. Does that make sense? I'm not concerned about convergent though, we solved that wholesale: We mark all functions that clang compiles for the GPU via openmp-target as convergent (IIRC). The entire device runtime is certainly convergent.

openmp/libomptarget/deviceRTLs/interface.h
421	Formatting

This revision is now accepted and ready to land.May 6 2021, 8:26 AM

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

This revision now requires changes to proceed.May 6 2021, 8:34 AM

In D101976#2742166, @JonChesterfield wrote:

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

I have *not* tested AMDGCN but I was not expecting a problem. The semantics I need here is:
warp N, thread 0 hits a barrier instruction I0
warp N, threads 1-31 hit a barrier instruction I1
the entire warp synchronizes and moves on.

JonChesterfield mentioned this in D102016: [libomptarget][nfc] Refactor amdgpu partial barrier to simplify adding a second one.May 6 2021, 12:43 PM

In D101976#2742188, @jdoerfert wrote:

In D101976#2742166, @JonChesterfield wrote:

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

I have *not* tested AMDGCN but I was not expecting a problem. The semantics I need here is:
warp N, thread 0 hits a barrier instruction I0
warp N, threads 1-31 hit a barrier instruction I1
the entire warp synchronizes and moves on.

One hazard is the amdgpu devicertl only has one barrier. D102016 makes it simpler to add a second. I'd guess we want named_sync to call one barrier and syncthreads to call a different one, so we should probably rename those functions. The LDS barrier implementation needs to know how many threads to wait for, we may be OK passing 'all the threads' down from the __syncthreads entry point.

The other is the single instruction pointer per wavefront, like pre-volta nvidia cards (which I believe we also expect to work). I'm not sure whether totally independent barriers will work, or whether we'll need to arrange for thread 0 and thread 1-31 to call the two different barriers at the same point in control flow.

In D101976#2742788, @JonChesterfield wrote:

In D101976#2742188, @jdoerfert wrote:

In D101976#2742166, @JonChesterfield wrote:

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

I have *not* tested AMDGCN but I was not expecting a problem. The semantics I need here is:
warp N, thread 0 hits a barrier instruction I0
warp N, threads 1-31 hit a barrier instruction I1
the entire warp synchronizes and moves on.

One hazard is the amdgpu devicertl only has one barrier. D102016 makes it simpler to add a second. I'd guess we want named_sync to call one barrier and syncthreads to call a different one, so we should probably rename those functions. The LDS barrier implementation needs to know how many threads to wait for, we may be OK passing 'all the threads' down from the __syncthreads entry point.

The other is the single instruction pointer per wavefront, like pre-volta nvidia cards (which I believe we also expect to work). I'm not sure whether totally independent barriers will work, or whether we'll need to arrange for thread 0 and thread 1-31 to call the two different barriers at the same point in control flow.

So what do you wnat me to change for this patch now?

JonChesterfield mentioned this in rG44ee974e2f3e: [libomptarget][nfc] Refactor amdgpu partial barrier to simplify adding a second….May 6 2021, 3:53 PM

In D101976#2742919, @jdoerfert wrote:

So what do you wnat me to change for this patch now?

Equivalent change to amdgpu target_impl to the nvptx target_impl, which looks like syncthreads should call a new barrier.

Iiuc this has run successfully even without that change so hopefully that's sufficient that we won't regress on amdgpu. I'd like to get miniqmc running locally to verify as well, but we may not be able to wait for that.

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
195	why are these weak?
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
65–68	amdgcn presumably needs the same change. Add a barrier and call it from _kmpc_impl_syncthreads. I think barrier.sync defaults to all threads when the second argument is omitted, so we can use the corresponding kmpc call to get the num_threads argument for it.

In D101976#2743470, @JonChesterfield wrote:

In D101976#2742919, @jdoerfert wrote:

So what do you wnat me to change for this patch now?

Equivalent change to amdgpu target_impl to the nvptx target_impl, which looks like syncthreads should call a new barrier.

Iiuc this has run successfully even without that change so hopefully that's sufficient that we won't regress on amdgpu. I'd like to get miniqmc running locally to verify as well, but we may not be able to wait for that.

So we don't need changes? I'm not sure what the problem here is.

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
195	So we do not perform IPO but will inline them. If we perform IPO we specialize the arguments even though we still want to do potentially change the mode from non-SPMD to SPMD.

I'm not certain what this 'aligned' limitation for nvptx syncthreads is, but can't think of a corresponding one for amdgcn. So we may not need the LDS barrier construction, and it'll be much faster if we don't.

This was reported working on amdgpu by a third party against an earlier trunk build, but sadly the current trunk seems to have regressed (debugging offline). So I have no reason to believe this doesn't work, and some reason to believe it will do. Objection withdrawn.

The code itself always looked fine, was only nervous about the changes to concurrency primitives in nvptx.

This revision is now accepted and ready to land.May 12 2021, 12:16 PM

JonChesterfield added inline comments.May 12 2021, 12:17 PM

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
195	as discussed offline, weak_odr or drop the weak

Update tests

Herald added a subscriber: jvesely. · View Herald TranscriptMay 12 2021, 12:46 PM

Harbormaster completed remote builds in B104114: Diff 344918.May 12 2021, 1:38 PM

Drop the weak attribute, will solve the problem differently

Harbormaster completed remote builds in B105348: Diff 346621.May 19 2021, 8:40 PM

Fix tests

Harbormaster completed remote builds in B109752: Diff 352776.Jun 17 2021, 8:55 PM

jdoerfert mentioned this in D104911: [OpenMP] Match initial thread pattern on AMDGPU.Jun 25 2021, 7:33 AM

Adjust AAExecutionDomain properly to account for the new target agnostic kernel init,
this makes D104911 obsolete.

Herald added a subscriber: ormris. · View Herald TranscriptJun 29 2021, 4:10 PM

Harbormaster completed remote builds in B111636: Diff 355391.Jun 29 2021, 5:07 PM

Rebase, only look for __kmpc_target_init in AAExecutionDomain as that will isolate the initial thread.

Harbormaster completed remote builds in B112045: Diff 355958.Jul 1 2021, 11:22 AM

jhuber6 added a subscriber: jhuber6.Jul 1 2021, 11:30 AM

jhuber6 added inline comments.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
28–29	Not needed now.

This revision was landed with ongoing or failed builds.Jul 10 2021, 10:33 AM

Closed by commit rG1d5711c3eeb6: [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL (authored by jdoerfert). · Explain Why

This revision was automatically updated to reflect the committed changes.

jdoerfert marked an inline comment as done.

jdoerfert added a commit: rG1d5711c3eeb6: [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL.

jdoerfert added a commit: rGe2cfbfcc0c1f: [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL.Jul 10 2021, 3:59 PM

Meinersbur mentioned this in D105787: [AbstractAttributor] Fold function calls to `__kmpc_is_spmd_exec_mode` if possible.Jul 15 2021, 11:31 PM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGOpenMPRuntimeGPU.h

38 lines

CGOpenMPRuntimeGPU.cpp

353 lines

llvm/

include/

llvm/

Frontend/

OpenMP/

OMPIRBuilder.h

23 lines

OMPKinds.def

6 lines

lib/

Frontend/

OpenMP/

OMPIRBuilder.cpp

65 lines

Transforms/

IPO/

OpenMPOpt.cpp

49 lines

test/

Transforms/

OpenMP/

replace_globalization.ll

44 lines

single_threaded_execution.ll

21 lines

openmp/

libomptarget/

deviceRTLs/

common/

include/

target.h

94 lines

src/

omptarget.cu

109 lines

parallel.cu

4 lines

interface.h

10 lines

nvptx/

src/

target_impl.cu

10 lines

Diff 357736

clang/lib/CodeGen/CGOpenMPRuntimeGPU.h

Show All 32 Lines	enum ExecutionMode {
/// Unknown execution mode (orphaned directive).		/// Unknown execution mode (orphaned directive).
EM_Unknown,		EM_Unknown,
};		};
private:		private:
/// Parallel outlined function work for workers to execute.		/// Parallel outlined function work for workers to execute.
llvm::SmallVector<llvm::Function *, 16> Work;		llvm::SmallVector<llvm::Function *, 16> Work;

struct EntryFunctionState {		struct EntryFunctionState {
llvm::BasicBlock *ExitBB = nullptr;
};

class WorkerFunctionState {
public:
llvm::Function *WorkerFn;
const CGFunctionInfo &CGFI;
SourceLocation Loc;		SourceLocation Loc;

WorkerFunctionState(CodeGenModule &CGM, SourceLocation Loc);

private:
void createWorkerFunction(CodeGenModule &CGM);
};		};

ExecutionMode getExecutionMode() const;		ExecutionMode getExecutionMode() const;

bool requiresFullRuntime() const { return RequiresFullRuntime; }		bool requiresFullRuntime() const { return RequiresFullRuntime; }

/// Get barrier to synchronize all threads in a block.		/// Get barrier to synchronize all threads in a block.
void syncCTAThreads(CodeGenFunction &CGF);		void syncCTAThreads(CodeGenFunction &CGF);

/// Emit the worker function for the current target region.		/// Helper for target directive initialization.
void emitWorkerFunction(WorkerFunctionState &WST);		void emitKernelInit(CodeGenFunction &CGF, EntryFunctionState &EST,
		bool IsSPMD);
/// Helper for worker function. Emit body of worker loop.
void emitWorkerLoop(CodeGenFunction &CGF, WorkerFunctionState &WST);		/// Helper for target directive finalization.
		void emitKernelDeinit(CodeGenFunction &CGF, EntryFunctionState &EST,
/// Helper for non-SPMD target entry function. Guide the master and		bool IsSPMD);
/// worker threads to their respective locations.
void emitNonSPMDEntryHeader(CodeGenFunction &CGF, EntryFunctionState &EST,
WorkerFunctionState &WST);

/// Signal termination of OMP execution for non-SPMD target entry
/// function.
void emitNonSPMDEntryFooter(CodeGenFunction &CGF, EntryFunctionState &EST);

/// Helper for generic variables globalization prolog.		/// Helper for generic variables globalization prolog.
void emitGenericVarsProlog(CodeGenFunction &CGF, SourceLocation Loc,		void emitGenericVarsProlog(CodeGenFunction &CGF, SourceLocation Loc,
bool WithSPMDCheck = false);		bool WithSPMDCheck = false);

/// Helper for generic variables globalization epilog.		/// Helper for generic variables globalization epilog.
void emitGenericVarsEpilog(CodeGenFunction &CGF, bool WithSPMDCheck = false);		void emitGenericVarsEpilog(CodeGenFunction &CGF, bool WithSPMDCheck = false);

/// Helper for SPMD mode target directive's entry function.
void emitSPMDEntryHeader(CodeGenFunction &CGF, EntryFunctionState &EST,
const OMPExecutableDirective &D);

/// Signal termination of SPMD mode execution.
void emitSPMDEntryFooter(CodeGenFunction &CGF, EntryFunctionState &EST);

//		//
// Base class overrides.		// Base class overrides.
//		//

/// Creates offloading entry for the provided entry ID \a ID,		/// Creates offloading entry for the provided entry ID \a ID,
/// address \a Addr, size \a Size, and flags \a Flags.		/// address \a Addr, size \a Size, and flags \a Flags.
void createOffloadEntry(llvm::Constant ID, llvm::Constant Addr,		void createOffloadEntry(llvm::Constant ID, llvm::Constant Addr,
uint64_t Size, int32_t Flags,		uint64_t Size, int32_t Flags,
▲ Show 20 Lines • Show All 378 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

Show First 20 Lines • Show All 547 Lines • ▼ Show 20 Lines	static llvm::Value *getNVPTXLaneID(CodeGenFunction &CGF) {
CGBuilderTy &Bld = CGF.Builder;		CGBuilderTy &Bld = CGF.Builder;
unsigned LaneIDMask = CGF.getContext().getTargetInfo().getGridValue(		unsigned LaneIDMask = CGF.getContext().getTargetInfo().getGridValue(
llvm::omp::GV_Warp_Size_Log2_Mask);		llvm::omp::GV_Warp_Size_Log2_Mask);
auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());		auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
return Bld.CreateAnd(RT.getGPUThreadID(CGF), Bld.getInt32(LaneIDMask),		return Bld.CreateAnd(RT.getGPUThreadID(CGF), Bld.getInt32(LaneIDMask),
"nvptx_lane_id");		"nvptx_lane_id");
}		}

/// Get the value of the thread_limit clause in the teams directive.
/// For the 'generic' execution mode, the runtime encodes thread_limit in
/// the launch parameters, always starting thread_limit+warpSize threads per
/// CTA. The threads in the last warp are reserved for master execution.
/// For the 'spmd' execution mode, all threads in a CTA are part of the team.
static llvm::Value *getThreadLimit(CodeGenFunction &CGF,
bool IsInSPMDExecutionMode = false) {
CGBuilderTy &Bld = CGF.Builder;
auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
llvm::Value *ThreadLimit = nullptr;
if (IsInSPMDExecutionMode)
ThreadLimit = RT.getGPUNumThreads(CGF);
else {
llvm::Value *GPUNumThreads = RT.getGPUNumThreads(CGF);
llvm::Value *GPUWarpSize = RT.getGPUWarpSize(CGF);
ThreadLimit = Bld.CreateNUWSub(GPUNumThreads, GPUWarpSize, "thread_limit");
}
assert(ThreadLimit != nullptr && "Expected non-null ThreadLimit");
return ThreadLimit;
}

/// Get the thread id of the OMP master thread.
/// The master thread id is the first thread (lane) of the last warp in the
/// GPU block. Warp size is assumed to be some power of 2.
/// Thread id is 0 indexed.
/// E.g: If NumThreads is 33, master id is 32.
/// If NumThreads is 64, master id is 32.
/// If NumThreads is 1024, master id is 992.
static llvm::Value *getMasterThreadID(CodeGenFunction &CGF) {
CGBuilderTy &Bld = CGF.Builder;
auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
llvm::Value *NumThreads = RT.getGPUNumThreads(CGF);
// We assume that the warp size is a power of 2.
llvm::Value *Mask = Bld.CreateNUWSub(RT.getGPUWarpSize(CGF), Bld.getInt32(1));

llvm::Value *NumThreadsSubOne = Bld.CreateNUWSub(NumThreads, Bld.getInt32(1));
return Bld.CreateAnd(NumThreadsSubOne, Bld.CreateNot(Mask), "master_tid");
}

CGOpenMPRuntimeGPU::WorkerFunctionState::WorkerFunctionState(
CodeGenModule &CGM, SourceLocation Loc)
: WorkerFn(nullptr), CGFI(CGM.getTypes().arrangeNullaryFunction()),
Loc(Loc) {
createWorkerFunction(CGM);
}

void CGOpenMPRuntimeGPU::WorkerFunctionState::createWorkerFunction(
CodeGenModule &CGM) {
// Create an worker function with no arguments.

WorkerFn = llvm::Function::Create(
CGM.getTypes().GetFunctionType(CGFI), llvm::GlobalValue::InternalLinkage,
/placeholder=/"_worker", &CGM.getModule());
CGM.SetInternalFunctionAttributes(GlobalDecl(), WorkerFn, CGFI);
WorkerFn->setDoesNotRecurse();
}

CGOpenMPRuntimeGPU::ExecutionMode		CGOpenMPRuntimeGPU::ExecutionMode
CGOpenMPRuntimeGPU::getExecutionMode() const {		CGOpenMPRuntimeGPU::getExecutionMode() const {
return CurrentExecutionMode;		return CurrentExecutionMode;
}		}

static CGOpenMPRuntimeGPU::DataSharingMode		static CGOpenMPRuntimeGPU::DataSharingMode
getDataSharingMode(CodeGenModule &CGM) {		getDataSharingMode(CodeGenModule &CGM) {
return CGM.getLangOpts().OpenMPCUDAMode ? CGOpenMPRuntimeGPU::CUDA		return CGM.getLangOpts().OpenMPCUDAMode ? CGOpenMPRuntimeGPU::CUDA
▲ Show 20 Lines • Show All 447 Lines • ▼ Show 20 Lines
void CGOpenMPRuntimeGPU::emitNonSPMDKernel(const OMPExecutableDirective &D,		void CGOpenMPRuntimeGPU::emitNonSPMDKernel(const OMPExecutableDirective &D,
StringRef ParentName,		StringRef ParentName,
llvm::Function *&OutlinedFn,		llvm::Function *&OutlinedFn,
llvm::Constant *&OutlinedFnID,		llvm::Constant *&OutlinedFnID,
bool IsOffloadEntry,		bool IsOffloadEntry,
const RegionCodeGenTy &CodeGen) {		const RegionCodeGenTy &CodeGen) {
ExecutionRuntimeModesRAII ModeRAII(CurrentExecutionMode);		ExecutionRuntimeModesRAII ModeRAII(CurrentExecutionMode);
EntryFunctionState EST;		EntryFunctionState EST;
WorkerFunctionState WST(CGM, D.getBeginLoc());
Work.clear();
WrapperFunctionsMap.clear();		WrapperFunctionsMap.clear();

// Emit target region as a standalone region.		// Emit target region as a standalone region.
class NVPTXPrePostActionTy : public PrePostActionTy {		class NVPTXPrePostActionTy : public PrePostActionTy {
CGOpenMPRuntimeGPU::EntryFunctionState &EST;		CGOpenMPRuntimeGPU::EntryFunctionState &EST;
CGOpenMPRuntimeGPU::WorkerFunctionState &WST;

public:		public:
NVPTXPrePostActionTy(CGOpenMPRuntimeGPU::EntryFunctionState &EST,		NVPTXPrePostActionTy(CGOpenMPRuntimeGPU::EntryFunctionState &EST)
CGOpenMPRuntimeGPU::WorkerFunctionState &WST)		: EST(EST) {}
: EST(EST), WST(WST) {}
void Enter(CodeGenFunction &CGF) override {		void Enter(CodeGenFunction &CGF) override {
auto &RT =		auto &RT =
static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());		static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
RT.emitNonSPMDEntryHeader(CGF, EST, WST);		RT.emitKernelInit(CGF, EST, /* IsSPMD */ false);
// Skip target region initialization.		// Skip target region initialization.
RT.setLocThreadIdInsertPt(CGF, /AtCurrentPoint=/true);		RT.setLocThreadIdInsertPt(CGF, /AtCurrentPoint=/true);
}		}
void Exit(CodeGenFunction &CGF) override {		void Exit(CodeGenFunction &CGF) override {
auto &RT =		auto &RT =
static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());		static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
RT.clearLocThreadIdInsertPt(CGF);		RT.clearLocThreadIdInsertPt(CGF);
RT.emitNonSPMDEntryFooter(CGF, EST);		RT.emitKernelDeinit(CGF, EST, /* IsSPMD */ false);
}		}
} Action(EST, WST);		} Action(EST);
CodeGen.setAction(Action);		CodeGen.setAction(Action);
IsInTTDRegion = true;		IsInTTDRegion = true;
emitTargetOutlinedFunctionHelper(D, ParentName, OutlinedFn, OutlinedFnID,		emitTargetOutlinedFunctionHelper(D, ParentName, OutlinedFn, OutlinedFnID,
IsOffloadEntry, CodeGen);		IsOffloadEntry, CodeGen);
IsInTTDRegion = false;		IsInTTDRegion = false;

// Now change the name of the worker function to correspond to this target
// region's entry function.
WST.WorkerFn->setName(Twine(OutlinedFn->getName(), "_worker"));

// Create the worker function
emitWorkerFunction(WST);
}		}

// Setup NVPTX threads for master-worker OpenMP scheme.		void CGOpenMPRuntimeGPU::emitKernelInit(CodeGenFunction &CGF,
void CGOpenMPRuntimeGPU::emitNonSPMDEntryHeader(CodeGenFunction &CGF,		EntryFunctionState &EST, bool IsSPMD) {
EntryFunctionState &EST,
WorkerFunctionState &WST) {
CGBuilderTy &Bld = CGF.Builder;		CGBuilderTy &Bld = CGF.Builder;
		Bld.restoreIP(OMPBuilder.createTargetInit(Bld, IsSPMD, requiresFullRuntime()));
llvm::BasicBlock *WorkerBB = CGF.createBasicBlock(".worker");		IsInTargetMasterThreadRegion = IsSPMD;
llvm::BasicBlock *MasterCheckBB = CGF.createBasicBlock(".mastercheck");		if (!IsSPMD)
llvm::BasicBlock *MasterBB = CGF.createBasicBlock(".master");		emitGenericVarsProlog(CGF, EST.Loc);
EST.ExitBB = CGF.createBasicBlock(".exit");

auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
llvm::Value *GPUThreadID = RT.getGPUThreadID(CGF);
llvm::Value *ThreadLimit = getThreadLimit(CGF);
llvm::Value *IsWorker = Bld.CreateICmpULT(GPUThreadID, ThreadLimit);
Bld.CreateCondBr(IsWorker, WorkerBB, MasterCheckBB);

CGF.EmitBlock(WorkerBB);
emitCall(CGF, WST.Loc, WST.WorkerFn);
CGF.EmitBranch(EST.ExitBB);

CGF.EmitBlock(MasterCheckBB);
GPUThreadID = RT.getGPUThreadID(CGF);
llvm::Value *MasterThreadID = getMasterThreadID(CGF);
llvm::Value *IsMaster = Bld.CreateICmpEQ(GPUThreadID, MasterThreadID);
Bld.CreateCondBr(IsMaster, MasterBB, EST.ExitBB);

CGF.EmitBlock(MasterBB);
IsInTargetMasterThreadRegion = true;
// SEQUENTIAL (MASTER) REGION START
// First action in sequential region:
// Initialize the state of the OpenMP runtime library on the GPU.
// TODO: Optimize runtime initialization and pass in correct value.
llvm::Value *Args[] = {getThreadLimit(CGF),
Bld.getInt16(/RequiresOMPRuntime=/1)};
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_kernel_init),
Args);

emitGenericVarsProlog(CGF, WST.Loc);
}		}

void CGOpenMPRuntimeGPU::emitNonSPMDEntryFooter(CodeGenFunction &CGF,		void CGOpenMPRuntimeGPU::emitKernelDeinit(CodeGenFunction &CGF,
EntryFunctionState &EST) {		EntryFunctionState &EST,
IsInTargetMasterThreadRegion = false;		bool IsSPMD) {
if (!CGF.HaveInsertPoint())		if (!IsSPMD)
return;

emitGenericVarsEpilog(CGF);		emitGenericVarsEpilog(CGF);

if (!EST.ExitBB)		CGBuilderTy &Bld = CGF.Builder;
EST.ExitBB = CGF.createBasicBlock(".exit");		OMPBuilder.createTargetDeinit(Bld, IsSPMD, requiresFullRuntime());

llvm::BasicBlock *TerminateBB = CGF.createBasicBlock(".termination.notifier");
CGF.EmitBranch(TerminateBB);

CGF.EmitBlock(TerminateBB);
// Signal termination condition.
// TODO: Optimize runtime initialization and pass in correct value.
llvm::Value Args[] = {CGF.Builder.getInt16(/IsOMPRuntimeInitialized=*/1)};
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_kernel_deinit),
Args);
// Barrier to terminate worker threads.
syncCTAThreads(CGF);
// Master thread jumps to exit point.
CGF.EmitBranch(EST.ExitBB);

CGF.EmitBlock(EST.ExitBB);
EST.ExitBB = nullptr;
}		}

void CGOpenMPRuntimeGPU::emitSPMDKernel(const OMPExecutableDirective &D,		void CGOpenMPRuntimeGPU::emitSPMDKernel(const OMPExecutableDirective &D,
StringRef ParentName,		StringRef ParentName,
llvm::Function *&OutlinedFn,		llvm::Function *&OutlinedFn,
llvm::Constant *&OutlinedFnID,		llvm::Constant *&OutlinedFnID,
bool IsOffloadEntry,		bool IsOffloadEntry,
const RegionCodeGenTy &CodeGen) {		const RegionCodeGenTy &CodeGen) {
ExecutionRuntimeModesRAII ModeRAII(		ExecutionRuntimeModesRAII ModeRAII(
CurrentExecutionMode, RequiresFullRuntime,		CurrentExecutionMode, RequiresFullRuntime,
CGM.getLangOpts().OpenMPCUDAForceFullRuntime \|\|		CGM.getLangOpts().OpenMPCUDAForceFullRuntime \|\|
!supportsLightweightRuntime(CGM.getContext(), D));		!supportsLightweightRuntime(CGM.getContext(), D));
EntryFunctionState EST;		EntryFunctionState EST;

// Emit target region as a standalone region.		// Emit target region as a standalone region.
class NVPTXPrePostActionTy : public PrePostActionTy {		class NVPTXPrePostActionTy : public PrePostActionTy {
CGOpenMPRuntimeGPU &RT;		CGOpenMPRuntimeGPU &RT;
CGOpenMPRuntimeGPU::EntryFunctionState &EST;		CGOpenMPRuntimeGPU::EntryFunctionState &EST;
const OMPExecutableDirective &D;

public:		public:
NVPTXPrePostActionTy(CGOpenMPRuntimeGPU &RT,		NVPTXPrePostActionTy(CGOpenMPRuntimeGPU &RT,
CGOpenMPRuntimeGPU::EntryFunctionState &EST,		CGOpenMPRuntimeGPU::EntryFunctionState &EST)
const OMPExecutableDirective &D)		: RT(RT), EST(EST) {}
: RT(RT), EST(EST), D(D) {}
void Enter(CodeGenFunction &CGF) override {		void Enter(CodeGenFunction &CGF) override {
RT.emitSPMDEntryHeader(CGF, EST, D);		RT.emitKernelInit(CGF, EST, /* IsSPMD */ true);
// Skip target region initialization.		// Skip target region initialization.
RT.setLocThreadIdInsertPt(CGF, /AtCurrentPoint=/true);		RT.setLocThreadIdInsertPt(CGF, /AtCurrentPoint=/true);
}		}
void Exit(CodeGenFunction &CGF) override {		void Exit(CodeGenFunction &CGF) override {
RT.clearLocThreadIdInsertPt(CGF);		RT.clearLocThreadIdInsertPt(CGF);
RT.emitSPMDEntryFooter(CGF, EST);		RT.emitKernelDeinit(CGF, EST, /* IsSPMD */ true);
}		}
} Action(*this, EST, D);		} Action(*this, EST);
CodeGen.setAction(Action);		CodeGen.setAction(Action);
IsInTTDRegion = true;		IsInTTDRegion = true;
emitTargetOutlinedFunctionHelper(D, ParentName, OutlinedFn, OutlinedFnID,		emitTargetOutlinedFunctionHelper(D, ParentName, OutlinedFn, OutlinedFnID,
IsOffloadEntry, CodeGen);		IsOffloadEntry, CodeGen);
IsInTTDRegion = false;		IsInTTDRegion = false;
}		}

void CGOpenMPRuntimeGPU::emitSPMDEntryHeader(
CodeGenFunction &CGF, EntryFunctionState &EST,
const OMPExecutableDirective &D) {
CGBuilderTy &Bld = CGF.Builder;

// Setup BBs in entry function.
llvm::BasicBlock *ExecuteBB = CGF.createBasicBlock(".execute");
EST.ExitBB = CGF.createBasicBlock(".exit");

llvm::Value Args[] = {getThreadLimit(CGF, /IsInSPMDExecutionMode=*/true),
/RequiresOMPRuntime=/
Bld.getInt16(RequiresFullRuntime ? 1 : 0)};
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_spmd_kernel_init),
Args);

CGF.EmitBranch(ExecuteBB);

CGF.EmitBlock(ExecuteBB);

IsInTargetMasterThreadRegion = true;
}

void CGOpenMPRuntimeGPU::emitSPMDEntryFooter(CodeGenFunction &CGF,
EntryFunctionState &EST) {
IsInTargetMasterThreadRegion = false;
if (!CGF.HaveInsertPoint())
return;

if (!EST.ExitBB)
EST.ExitBB = CGF.createBasicBlock(".exit");

llvm::BasicBlock *OMPDeInitBB = CGF.createBasicBlock(".omp.deinit");
CGF.EmitBranch(OMPDeInitBB);

CGF.EmitBlock(OMPDeInitBB);
// DeInitialize the OMP state in the runtime; called by all active threads.
llvm::Value Args[] = {/RequiresOMPRuntime=*/
CGF.Builder.getInt16(RequiresFullRuntime ? 1 : 0)};
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_spmd_kernel_deinit_v2),
Args);
CGF.EmitBranch(EST.ExitBB);

CGF.EmitBlock(EST.ExitBB);
EST.ExitBB = nullptr;
}

// Create a unique global variable to indicate the execution mode of this target		// Create a unique global variable to indicate the execution mode of this target
// region. The execution mode is either 'generic', or 'spmd' depending on the		// region. The execution mode is either 'generic', or 'spmd' depending on the
// target directive. This variable is picked up by the offload library to setup		// target directive. This variable is picked up by the offload library to setup
// the device appropriately before kernel launch. If the execution mode is		// the device appropriately before kernel launch. If the execution mode is
// 'generic', the runtime reserves one warp for the master, otherwise, all		// 'generic', the runtime reserves one warp for the master, otherwise, all
// warps participate in parallel work.		// warps participate in parallel work.
static void setPropertyExecutionMode(CodeGenModule &CGM, StringRef Name,		static void setPropertyExecutionMode(CodeGenModule &CGM, StringRef Name,
bool Mode) {		bool Mode) {
auto *GVMode =		auto *GVMode =
new llvm::GlobalVariable(CGM.getModule(), CGM.Int8Ty, /isConstant=/true,		new llvm::GlobalVariable(CGM.getModule(), CGM.Int8Ty, /isConstant=/true,
llvm::GlobalValue::WeakAnyLinkage,		llvm::GlobalValue::WeakAnyLinkage,
llvm::ConstantInt::get(CGM.Int8Ty, Mode ? 0 : 1),		llvm::ConstantInt::get(CGM.Int8Ty, Mode ? 0 : 1),
Twine(Name, "_exec_mode"));		Twine(Name, "_exec_mode"));
CGM.addCompilerUsedGlobal(GVMode);		CGM.addCompilerUsedGlobal(GVMode);
}		}

void CGOpenMPRuntimeGPU::emitWorkerFunction(WorkerFunctionState &WST) {
ASTContext &Ctx = CGM.getContext();

CodeGenFunction CGF(CGM, /suppressNewContext=/true);
CGF.StartFunction(GlobalDecl(), Ctx.VoidTy, WST.WorkerFn, WST.CGFI, {},
WST.Loc, WST.Loc);
emitWorkerLoop(CGF, WST);
CGF.FinishFunction();
}

void CGOpenMPRuntimeGPU::emitWorkerLoop(CodeGenFunction &CGF,
WorkerFunctionState &WST) {
//
// The workers enter this loop and wait for parallel work from the master.
// When the master encounters a parallel region it sets up the work + variable
// arguments, and wakes up the workers. The workers first check to see if
// they are required for the parallel region, i.e., within the # of requested
// parallel threads. The activated workers load the variable arguments and
// execute the parallel work.
//

CGBuilderTy &Bld = CGF.Builder;

llvm::BasicBlock *AwaitBB = CGF.createBasicBlock(".await.work");
llvm::BasicBlock *SelectWorkersBB = CGF.createBasicBlock(".select.workers");
llvm::BasicBlock *ExecuteBB = CGF.createBasicBlock(".execute.parallel");
llvm::BasicBlock *TerminateBB = CGF.createBasicBlock(".terminate.parallel");
llvm::BasicBlock *BarrierBB = CGF.createBasicBlock(".barrier.parallel");
llvm::BasicBlock *ExitBB = CGF.createBasicBlock(".exit");

CGF.EmitBranch(AwaitBB);

// Workers wait for work from master.
CGF.EmitBlock(AwaitBB);
// Wait for parallel work
syncCTAThreads(CGF);

Address WorkFn =
CGF.CreateDefaultAlignTempAlloca(CGF.Int8PtrTy, /Name=/"work_fn");
Address ExecStatus =
CGF.CreateDefaultAlignTempAlloca(CGF.Int8Ty, /Name=/"exec_status");
CGF.InitTempAlloca(ExecStatus, Bld.getInt8(/C=/0));
CGF.InitTempAlloca(WorkFn, llvm::Constant::getNullValue(CGF.Int8PtrTy));

// TODO: Optimize runtime initialization and pass in correct value.
llvm::Value *Args[] = {WorkFn.getPointer()};
llvm::Value *Ret =
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_kernel_parallel),
Args);
Bld.CreateStore(Bld.CreateZExt(Ret, CGF.Int8Ty), ExecStatus);

// On termination condition (workid == 0), exit loop.
llvm::Value *WorkID = Bld.CreateLoad(WorkFn);
llvm::Value *ShouldTerminate = Bld.CreateIsNull(WorkID, "should_terminate");
Bld.CreateCondBr(ShouldTerminate, ExitBB, SelectWorkersBB);

// Activate requested workers.
CGF.EmitBlock(SelectWorkersBB);
llvm::Value *IsActive =
Bld.CreateIsNotNull(Bld.CreateLoad(ExecStatus), "is_active");
Bld.CreateCondBr(IsActive, ExecuteBB, BarrierBB);

// Signal start of parallel region.
CGF.EmitBlock(ExecuteBB);
// Skip initialization.
setLocThreadIdInsertPt(CGF, /AtCurrentPoint=/true);

// Process work items: outlined parallel functions.
for (llvm::Function *W : Work) {
// Try to match this outlined function.
llvm::Value *ID = Bld.CreatePointerBitCastOrAddrSpaceCast(W, CGM.Int8PtrTy);

llvm::Value *WorkFnMatch =
Bld.CreateICmpEQ(Bld.CreateLoad(WorkFn), ID, "work_match");

llvm::BasicBlock *ExecuteFNBB = CGF.createBasicBlock(".execute.fn");
llvm::BasicBlock *CheckNextBB = CGF.createBasicBlock(".check.next");
Bld.CreateCondBr(WorkFnMatch, ExecuteFNBB, CheckNextBB);

// Execute this outlined function.
CGF.EmitBlock(ExecuteFNBB);

// Insert call to work function via shared wrapper. The shared
// wrapper takes two arguments:
// - the parallelism level;
// - the thread ID;
emitCall(CGF, WST.Loc, W,
{Bld.getInt16(/ParallelLevel=/0), getThreadID(CGF, WST.Loc)});

// Go to end of parallel region.
CGF.EmitBranch(TerminateBB);

CGF.EmitBlock(CheckNextBB);
}
// Default case: call to outlined function through pointer if the target
// region makes a declare target call that may contain an orphaned parallel
// directive.
auto *ParallelFnTy =
llvm::FunctionType::get(CGM.VoidTy, {CGM.Int16Ty, CGM.Int32Ty},
/isVarArg=/false);
llvm::Value *WorkFnCast =
Bld.CreateBitCast(WorkID, ParallelFnTy->getPointerTo());
// Insert call to work function via shared wrapper. The shared
// wrapper takes two arguments:
// - the parallelism level;
// - the thread ID;
emitCall(CGF, WST.Loc, {ParallelFnTy, WorkFnCast},
{Bld.getInt16(/ParallelLevel=/0), getThreadID(CGF, WST.Loc)});
// Go to end of parallel region.
CGF.EmitBranch(TerminateBB);

// Signal end of parallel region.
CGF.EmitBlock(TerminateBB);
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_kernel_end_parallel),
llvm::None);
CGF.EmitBranch(BarrierBB);

// All active and inactive workers wait at a barrier after parallel region.
CGF.EmitBlock(BarrierBB);
// Barrier after parallel region.
syncCTAThreads(CGF);
CGF.EmitBranch(AwaitBB);

// Exit target region.
CGF.EmitBlock(ExitBB);
// Skip initialization.
clearLocThreadIdInsertPt(CGF);
}

void CGOpenMPRuntimeGPU::createOffloadEntry(llvm::Constant *ID,		void CGOpenMPRuntimeGPU::createOffloadEntry(llvm::Constant *ID,
llvm::Constant *Addr,		llvm::Constant *Addr,
uint64_t Size, int32_t,		uint64_t Size, int32_t,
llvm::GlobalValue::LinkageTypes) {		llvm::GlobalValue::LinkageTypes) {
// TODO: Add support for global variables on the device after declare target		// TODO: Add support for global variables on the device after declare target
// support.		// support.
if (!isa<llvm::Function>(Addr))		if (!isa<llvm::Function>(Addr))
return;		return;
▲ Show 20 Lines • Show All 369 Lines • ▼ Show 20 Lines	void CGOpenMPRuntimeGPU::emitParallelCall(CodeGenFunction &CGF,
if (!CGF.HaveInsertPoint())		if (!CGF.HaveInsertPoint())
return;		return;

auto &&ParallelGen = [this, Loc, OutlinedFn, CapturedVars,		auto &&ParallelGen = [this, Loc, OutlinedFn, CapturedVars,
IfCond](CodeGenFunction &CGF, PrePostActionTy &Action) {		IfCond](CodeGenFunction &CGF, PrePostActionTy &Action) {
CGBuilderTy &Bld = CGF.Builder;		CGBuilderTy &Bld = CGF.Builder;
llvm::Function *WFn = WrapperFunctionsMap[OutlinedFn];		llvm::Function *WFn = WrapperFunctionsMap[OutlinedFn];
llvm::Value *ID = llvm::ConstantPointerNull::get(CGM.Int8PtrTy);		llvm::Value *ID = llvm::ConstantPointerNull::get(CGM.Int8PtrTy);
if (WFn) {		if (WFn)
ID = Bld.CreateBitOrPointerCast(WFn, CGM.Int8PtrTy);		ID = Bld.CreateBitOrPointerCast(WFn, CGM.Int8PtrTy);
// Remember for post-processing in worker loop.
Work.emplace_back(WFn);
}
llvm::Value *FnPtr = Bld.CreateBitOrPointerCast(OutlinedFn, CGM.Int8PtrTy);		llvm::Value *FnPtr = Bld.CreateBitOrPointerCast(OutlinedFn, CGM.Int8PtrTy);

// Create a private scope that will globalize the arguments		// Create a private scope that will globalize the arguments
// passed from the outside of the target region.		// passed from the outside of the target region.
// TODO: Is that needed?		// TODO: Is that needed?
CodeGenFunction::OMPPrivateScope PrivateArgScope(CGF);		CodeGenFunction::OMPPrivateScope PrivateArgScope(CGF);

Address CapturedVarsAddrs = CGF.CreateDefaultAlignTempAlloca(		Address CapturedVarsAddrs = CGF.CreateDefaultAlignTempAlloca(
▲ Show 20 Lines • Show All 2,414 Lines • Show Last 20 Lines

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h

Show First 20 Lines • Show All 773 Lines • ▼ Show 20 Lines	public:
/// \param Name Name of call Instruction for callinst		/// \param Name Name of call Instruction for callinst
///		///
/// \returns CallInst to the thread private cache call.		/// \returns CallInst to the thread private cache call.
CallInst *createCachedThreadPrivate(const LocationDescription &Loc,		CallInst *createCachedThreadPrivate(const LocationDescription &Loc,
llvm::Value *Pointer,		llvm::Value *Pointer,
llvm::ConstantInt *Size,		llvm::ConstantInt *Size,
const llvm::Twine &Name = Twine(""));		const llvm::Twine &Name = Twine(""));

		/// The `omp target` interface
		///
		/// For more information about the usage of this interface,
		/// \see openmp/libomptarget/deviceRTLs/common/include/target.h
		///
		///{

		/// Create a runtime call for kmpc_target_init
		///
		/// \param Loc The insert and source location description.
		/// \param IsSPMD Flag to indicate if the kernel is an SPMD kernel or not.
		/// \param RequiresFullRuntime Indicate if a full device runtime is necessary.
		InsertPointTy createTargetInit(const LocationDescription &Loc, bool IsSPMD, bool RequiresFullRuntime);

		/// Create a runtime call for kmpc_target_deinit
		///
		/// \param Loc The insert and source location description.
		/// \param IsSPMD Flag to indicate if the kernel is an SPMD kernel or not.
		/// \param RequiresFullRuntime Indicate if a full device runtime is necessary.
		void createTargetDeinit(const LocationDescription &Loc, bool IsSPMD, bool RequiresFullRuntime);

		///}

/// Declarations for LLVM-IR types (simple, array, function and structure) are		/// Declarations for LLVM-IR types (simple, array, function and structure) are
/// generated below. Their names are defined and used in OpenMPKinds.def. Here		/// generated below. Their names are defined and used in OpenMPKinds.def. Here
/// we provide the declarations, the initializeTypes function will provide the		/// we provide the declarations, the initializeTypes function will provide the
/// values.		/// values.
///		///
///{		///{
#define OMP_TYPE(VarName, InitValue) Type *VarName = nullptr;		#define OMP_TYPE(VarName, InitValue) Type *VarName = nullptr;
#define OMP_ARRAY_TYPE(VarName, ElemTy, ArraySize) \		#define OMP_ARRAY_TYPE(VarName, ElemTy, ArraySize) \
▲ Show 20 Lines • Show All 413 Lines • Show Last 20 Lines

llvm/include/llvm/Frontend/OpenMP/OMPKinds.def

Show First 20 Lines • Show All 403 Lines • ▼ Show 20 Lines	__OMP_RTL(__tgt_target_data_update_nowait_mapper, false, Void, IdentPtr, Int64, Int32,
VoidPtrPtr, VoidPtrPtr, Int64Ptr, Int64Ptr, VoidPtrPtr, VoidPtrPtr)		VoidPtrPtr, VoidPtrPtr, Int64Ptr, Int64Ptr, VoidPtrPtr, VoidPtrPtr)
__OMP_RTL(__tgt_mapper_num_components, false, Int64, VoidPtr)		__OMP_RTL(__tgt_mapper_num_components, false, Int64, VoidPtr)
__OMP_RTL(__tgt_push_mapper_component, false, Void, VoidPtr, VoidPtr, VoidPtr,		__OMP_RTL(__tgt_push_mapper_component, false, Void, VoidPtr, VoidPtr, VoidPtr,
Int64, Int64, VoidPtr)		Int64, Int64, VoidPtr)
__OMP_RTL(__kmpc_task_allow_completion_event, false, VoidPtr, IdentPtr,		__OMP_RTL(__kmpc_task_allow_completion_event, false, VoidPtr, IdentPtr,
/* Int / Int32, / kmp_task_t */ VoidPtr)		/* Int / Int32, / kmp_task_t */ VoidPtr)

/// OpenMP Device runtime functions		/// OpenMP Device runtime functions
__OMP_RTL(__kmpc_kernel_init, false, Void, Int32, Int16)		__OMP_RTL(__kmpc_target_init, false, Int32, IdentPtr, Int1, Int1, Int1)
__OMP_RTL(__kmpc_kernel_deinit, false, Void, Int16)		__OMP_RTL(__kmpc_target_deinit, false, Void, IdentPtr, Int1, Int1)
__OMP_RTL(__kmpc_spmd_kernel_init, false, Void, Int32, Int16)
__OMP_RTL(__kmpc_spmd_kernel_deinit_v2, false, Void, Int16)
__OMP_RTL(__kmpc_kernel_prepare_parallel, false, Void, VoidPtr)		__OMP_RTL(__kmpc_kernel_prepare_parallel, false, Void, VoidPtr)
__OMP_RTL(__kmpc_parallel_51, false, Void, IdentPtr, Int32, Int32, Int32, Int32,		__OMP_RTL(__kmpc_parallel_51, false, Void, IdentPtr, Int32, Int32, Int32, Int32,
VoidPtr, VoidPtr, VoidPtrPtr, SizeTy)		VoidPtr, VoidPtr, VoidPtrPtr, SizeTy)
__OMP_RTL(__kmpc_kernel_parallel, false, Int1, VoidPtrPtr)		__OMP_RTL(__kmpc_kernel_parallel, false, Int1, VoidPtrPtr)
__OMP_RTL(__kmpc_kernel_end_parallel, false, Void, )		__OMP_RTL(__kmpc_kernel_end_parallel, false, Void, )
__OMP_RTL(__kmpc_serialized_parallel, false, Void, IdentPtr, Int32)		__OMP_RTL(__kmpc_serialized_parallel, false, Void, IdentPtr, Int32)
__OMP_RTL(__kmpc_end_serialized_parallel, false, Void, IdentPtr, Int32)		__OMP_RTL(__kmpc_end_serialized_parallel, false, Void, IdentPtr, Int32)
__OMP_RTL(__kmpc_shuffle_int32, false, Int32, Int32, Int16, Int16)		__OMP_RTL(__kmpc_shuffle_int32, false, Int32, Int32, Int16, Int16)
▲ Show 20 Lines • Show All 744 Lines • Show Last 20 Lines

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

Show All 14 Lines
#include "llvm/Frontend/OpenMP/OMPIRBuilder.h"		#include "llvm/Frontend/OpenMP/OMPIRBuilder.h"

#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/Triple.h"		#include "llvm/ADT/Triple.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/DebugInfo.h"		#include "llvm/IR/DebugInfo.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/MDBuilder.h"		#include "llvm/IR/MDBuilder.h"
		#include "llvm/IR/Value.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Error.h"		#include "llvm/Support/Error.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Transforms/Utils/CodeExtractor.h"		#include "llvm/Transforms/Utils/CodeExtractor.h"

#include <sstream>		#include <sstream>

#define DEBUG_TYPE "openmp-ir-builder"		#define DEBUG_TYPE "openmp-ir-builder"
▲ Show 20 Lines • Show All 2,155 Lines • ▼ Show 20 Lines	CallInst *OpenMPIRBuilder::createCachedThreadPrivate(
llvm::Value *Args[] = {Ident, ThreadId, Pointer, Size, ThreadPrivateCache};		llvm::Value *Args[] = {Ident, ThreadId, Pointer, Size, ThreadPrivateCache};

Function *Fn =		Function *Fn =
getOrCreateRuntimeFunctionPtr(OMPRTL___kmpc_threadprivate_cached);		getOrCreateRuntimeFunctionPtr(OMPRTL___kmpc_threadprivate_cached);

return Builder.CreateCall(Fn, Args);		return Builder.CreateCall(Fn, Args);
}		}

		OpenMPIRBuilder::InsertPointTy
		OpenMPIRBuilder::createTargetInit(const LocationDescription &Loc, bool IsSPMD, bool RequiresFullRuntime) {
		if (!updateToLocation(Loc))
		return Loc.IP;

		Constant *SrcLocStr = getOrCreateSrcLocStr(Loc);
		Value *Ident = getOrCreateIdent(SrcLocStr);
		ConstantInt *IsSPMDVal = ConstantInt::getBool(Int32->getContext(), IsSPMD);
		ConstantInt *UseGenericStateMachine =
		ConstantInt::getBool(Int32->getContext(), !IsSPMD);
		ConstantInt *RequiresFullRuntimeVal = ConstantInt::getBool(Int32->getContext(), RequiresFullRuntime);

		Function *Fn = getOrCreateRuntimeFunctionPtr(
		omp::RuntimeFunction::OMPRTL___kmpc_target_init);

		CallInst *ThreadKind =
		Builder.CreateCall(Fn, {Ident, IsSPMDVal, UseGenericStateMachine, RequiresFullRuntimeVal});

		Value *ExecUserCode = Builder.CreateICmpEQ(
		ThreadKind, ConstantInt::get(ThreadKind->getType(), -1), "exec_user_code");

		// ThreadKind = __kmpc_target_init(...)
		// if (ThreadKind == -1)
		// user_code
		// else
		// return;

		auto *UI = Builder.CreateUnreachable();
		BasicBlock *CheckBB = UI->getParent();
		BasicBlock *UserCodeEntryBB = CheckBB->splitBasicBlock(UI, "user_code.entry");

		BasicBlock *WorkerExitBB = BasicBlock::Create(
		CheckBB->getContext(), "worker.exit", CheckBB->getParent());
		Builder.SetInsertPoint(WorkerExitBB);
		Builder.CreateRetVoid();

		auto *CheckBBTI = CheckBB->getTerminator();
		Builder.SetInsertPoint(CheckBBTI);
		Builder.CreateCondBr(ExecUserCode, UI->getParent(), WorkerExitBB);

		CheckBBTI->eraseFromParent();
		UI->eraseFromParent();

		// Continue in the "user_code" block, see diagram above and in
		// openmp/libomptarget/deviceRTLs/common/include/target.h .
		return InsertPointTy(UserCodeEntryBB, UserCodeEntryBB->getFirstInsertionPt());
		}

		void OpenMPIRBuilder::createTargetDeinit(const LocationDescription &Loc,
		bool IsSPMD, bool RequiresFullRuntime) {
		if (!updateToLocation(Loc))
		return;

		Constant *SrcLocStr = getOrCreateSrcLocStr(Loc);
		Value *Ident = getOrCreateIdent(SrcLocStr);
		ConstantInt *IsSPMDVal = ConstantInt::getBool(Int32->getContext(), IsSPMD);
		ConstantInt *RequiresFullRuntimeVal = ConstantInt::getBool(Int32->getContext(), RequiresFullRuntime);

		Function *Fn = getOrCreateRuntimeFunctionPtr(
		omp::RuntimeFunction::OMPRTL___kmpc_target_deinit);

		Builder.CreateCall(Fn, {Ident, IsSPMDVal, RequiresFullRuntimeVal});
		}

std::string OpenMPIRBuilder::getNameWithSeparators(ArrayRef<StringRef> Parts,		std::string OpenMPIRBuilder::getNameWithSeparators(ArrayRef<StringRef> Parts,
StringRef FirstSeparator,		StringRef FirstSeparator,
StringRef Separator) {		StringRef Separator) {
SmallString<128> Buffer;		SmallString<128> Buffer;
llvm::raw_svector_ostream OS(Buffer);		llvm::raw_svector_ostream OS(Buffer);
StringRef Sep = FirstSeparator;		StringRef Sep = FirstSeparator;
for (StringRef Part : Parts) {		for (StringRef Part : Parts) {
OS << Sep << Part;		OS << Sep << Part;
▲ Show 20 Lines • Show All 526 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/OpenMPOpt.cpp

Show All 19 Lines

#include "llvm/ADT/PostOrderIterator.h"

#include "llvm/ADT/Statistic.h"

#include "llvm/Analysis/CallGraph.h"

#include "llvm/Analysis/CallGraphSCCPass.h"

#include "llvm/Analysis/OptimizationRemarkEmitter.h"

#include "llvm/Analysis/ValueTracking.h"

#include "llvm/Frontend/OpenMP/OMPConstants.h"

#include "llvm/Frontend/OpenMP/OMPIRBuilder.h"

#include "llvm/IR/IntrinsicInst.h"

#include "llvm/IR/IntrinsicsAMDGPU.h"

#include "llvm/IR/IntrinsicsNVPTX.h"

#include "llvm/IR/PatternMatch.h"

#include "llvm/InitializePasses.h"

jhuber6Unsubmitted

Done

#include "llvm/IR/IntrinsicInst.h"

- #include "llvm/IR/IntrinsicsAMDGPU.h"

- #include "llvm/IR/IntrinsicsNVPTX.h"

- #include "llvm/IR/PatternMatch.h"

#include "llvm/InitializePasses.h"

Not needed now.

jhuber6: Not needed now.

#include "llvm/Support/CommandLine.h"

#include "llvm/Transforms/IPO.h"

#include "llvm/Transforms/IPO/Attributor.h"

#include "llvm/Transforms/Utils/BasicBlockUtils.h"

#include "llvm/Transforms/Utils/CallGraphUpdater.h"

#include "llvm/Transforms/Utils/CodeExtractor.h"

using namespace llvm::PatternMatch;

using namespace llvm;

using namespace omp;

#define DEBUG_TYPE "openmp-opt"

static cl::opt<bool> DisableOpenMPOptimizations(

"openmp-opt-disable", cl::ZeroOrMore,

cl::desc("Disable OpenMP specific optimizations."), cl::Hidden,

▲ Show 20 Lines • Show All 2,287 Lines • ▼ Show 20 Lines

return ACS.isDirectCall() &&

*ACS.getInstruction());

};

if (!A.checkForAllCallSites(PredForCallSite, *this,

/* RequiresAllCallSites */ true,

AllCallSitesKnown))

SingleThreadedBBs.erase(&F->getEntryBlock());

// Check if the edge into the successor block compares a thread-id function to

auto &OMPInfoCache = static_cast<OMPInformationCache &>(A.getInfoCache());

// a constant zero.

auto &RFI = OMPInfoCache.RFIs[OMPRTL___kmpc_target_init];

// TODO: Use AAValueSimplify to simplify and propogate constants.

// TODO: Check more than a single use for thread ID's.

// Check if the edge into the successor block compares the __kmpc_target_init

// result with -1. If we are in non-SPMD-mode that signals only the main

// thread will execute the edge.

auto IsInitialThreadOnly = [&](BranchInst *Edge, BasicBlock *SuccessorBB) {

if (!Edge || !Edge->isConditional())

return false;

if (Edge->getSuccessor(0) != SuccessorBB)

return false;

auto *Cmp = dyn_cast<CmpInst>(Edge->getCondition());

if (!Cmp || !Cmp->isTrueWhenEqual() || !Cmp->isEquality())

return false;

// Temporarily match the pattern generated by clang for teams regions.

// TODO: Remove this once the new runtime is in place.

ConstantInt *One, *NegOne;

CmpInst::Predicate Pred;

auto &&m_ThreadID = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_tid_x>();

auto &&m_WarpSize = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_warpsize>();

auto &&m_BlockSize = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_ntid_x>();

if (match(Cmp, m_Cmp(Pred, m_ThreadID,

m_And(m_Sub(m_BlockSize, m_ConstantInt(One)),

m_Xor(m_Sub(m_WarpSize, m_ConstantInt(One)),

m_ConstantInt(NegOne))))))

if (One->isOne() && NegOne->isMinusOne() &&

Pred == CmpInst::Predicate::ICMP_EQ)

return true;

ConstantInt *C = dyn_cast<ConstantInt>(Cmp->getOperand(1));

if (!C || !C->isZero())

if (!C)

return false;

if (auto *II = dyn_cast<IntrinsicInst>(Cmp->getOperand(0)))

// Match: -1 == __kmpc_target_init (for non-SPMD kernels only!)

if (II->getIntrinsicID() == Intrinsic::nvvm_read_ptx_sreg_tid_x)

if (C->isAllOnesValue()) {

return true;

auto *CB = dyn_cast<CallBase>(Cmp->getOperand(0));

if (auto *II = dyn_cast<IntrinsicInst>(Cmp->getOperand(0)))

if (!CB || CB->getCalledFunction() != RFI.Declaration)

if (II->getIntrinsicID() == Intrinsic::amdgcn_workitem_id_x)

return false;

return true;

const int InitIsSPMDArgNo = 1;

auto *IsSPMDModeCI =

dyn_cast<ConstantInt>(CB->getOperand(InitIsSPMDArgNo));

return IsSPMDModeCI && IsSPMDModeCI->isZero();

}

return false;

};

// Merge all the predecessor states into the current basic block. A basic

// block is executed by a single thread if all of its predecessors are.

auto MergePredecessorStates = [&](BasicBlock *BB) {

if (pred_begin(BB) == pred_end(BB))

return SingleThreadedBBs.contains(BB);

bool IsInitialThread = true;

for (auto PredBB = pred_begin(BB), PredEndBB = pred_end(BB);

PredBB != PredEndBB; ++PredBB) {

if (!IsInitialThreadOnly(dyn_cast<BranchInst>((*PredBB)->getTerminator()),

BB))

IsInitialThread &= SingleThreadedBBs.contains(*PredBB);

}

return IsInitialThread;

};

for (auto *BB : RPOT) {

if (!MergePredecessorStates(BB))

▲ Show 20 Lines • Show All 473 Lines • Show Last 20 Lines

llvm/test/Transforms/OpenMP/replace_globalization.ll

	; RUN: opt -S -passes='openmp-opt' < %s \| FileCheck %s			; RUN: opt -S -passes='openmp-opt' < %s \| FileCheck %s
	; RUN: opt -passes=openmp-opt -pass-remarks=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s -check-prefix=CHECK-REMARKS			; RUN: opt -passes=openmp-opt -pass-remarks=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s -check-prefix=CHECK-REMARKS
	target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"			target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
	target triple = "nvptx64"			target triple = "nvptx64"

				%struct.ident_t = type { i32, i32, i32, i32, i8* }

	@S = external local_unnamed_addr global i8*			@S = external local_unnamed_addr global i8*
				@0 = private unnamed_addr constant [113 x i8] c";llvm/test/Transforms/OpenMP/custom_state_machines_remarks.c;__omp_offloading_2a_d80d3d_test_fallback_l11;11;1;;\00", align 1
				@1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 0, i8* getelementptr inbounds ([113 x i8], [113 x i8]* @0, i32 0, i32 0) }, align 8

	; CHECK-REMARKS: remark: replace_globalization.c:5:7: Replaced globalized variable with 16 bytes of shared memory			; CHECK-REMARKS: remark: replace_globalization.c:5:7: Replaced globalized variable with 16 bytes of shared memory
	; CHECK-REMARKS: remark: replace_globalization.c:5:14: Replaced globalized variable with 4 bytes of shared memory			; CHECK-REMARKS: remark: replace_globalization.c:5:14: Replaced globalized variable with 4 bytes of shared memory
				; CHECK-REMARKS-NOT: 6 bytes

	; CHECK: [[SHARED_X:@.+]] = internal addrspace(3) global [16 x i8] undef			; CHECK: [[SHARED_X:@.+]] = internal addrspace(3) global [16 x i8] undef
	; CHECK: [[SHARED_Y:@.+]] = internal addrspace(3) global [4 x i8] undef			; CHECK: [[SHARED_Y:@.+]] = internal addrspace(3) global [4 x i8] undef

	; CHECK: %{{.}} = call i8 @__kmpc_alloc_shared({{.*}})			; CHECK: %{{.}} = call i8 @__kmpc_alloc_shared({{.*}})
	; CHECK: call void @__kmpc_free_shared({{.*}})			; CHECK: call void @__kmpc_free_shared({{.*}})
	define dso_local void @foo() {			define dso_local void @foo() {
	entry:			entry:
	%x = call i8* @__kmpc_alloc_shared(i64 4)			%x = call i8* @__kmpc_alloc_shared(i64 4)
	%x_on_stack = bitcast i8* %x to i32*			%x_on_stack = bitcast i8* %x to i32*
	%0 = bitcast i32* %x_on_stack to i8*			%0 = bitcast i32* %x_on_stack to i8*
	call void @use(i8* %0)			call void @use(i8* %0)
	call void @__kmpc_free_shared(i8* %x)			call void @__kmpc_free_shared(i8* %x)
	ret void			ret void
	}			}

	define void @bar() {			define void @bar() {
	call void @baz()			call void @baz()
	call void @qux()			call void @qux()
				call void @negative_qux_spmd()
	ret void			ret void
	}			}

	; CHECK: call void @use.internalized(i8* nofree writeonly addrspacecast (i8 addrspace(3)* getelementptr inbounds ([16 x i8], [16 x i8] addrspace(3)* [[SHARED_X]], i32 0, i32 0) to i8*))			; CHECK: call void @use.internalized(i8* nofree writeonly addrspacecast (i8 addrspace(3)* getelementptr inbounds ([16 x i8], [16 x i8] addrspace(3)* [[SHARED_X]], i32 0, i32 0) to i8*))
	define internal void @baz() {			define internal void @baz() {
	entry:			entry:
	%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()			%call = call i32 @__kmpc_target_init(%struct.ident_t* nonnull @1, i1 false, i1 false, i1 true)
	%cmp = icmp eq i32 %tid, 0			%cmp = icmp eq i32 %call, -1
	br i1 %cmp, label %master, label %exit			br i1 %cmp, label %master, label %exit
	master:			master:
	%x = call i8* @__kmpc_alloc_shared(i64 16), !dbg !11			%x = call i8* @__kmpc_alloc_shared(i64 16), !dbg !11
	%x_on_stack = bitcast i8* %x to [4 x i32]*			%x_on_stack = bitcast i8* %x to [4 x i32]*
	%0 = bitcast [4 x i32]* %x_on_stack to i8*			%0 = bitcast [4 x i32]* %x_on_stack to i8*
	call void @use(i8* %0)			call void @use(i8* %0)
	call void @__kmpc_free_shared(i8* %x)			call void @__kmpc_free_shared(i8* %x)
	br label %exit			br label %exit
	exit:			exit:
	ret void			ret void
	}			}

	; CHECK: call void @use.internalized(i8* nofree writeonly addrspacecast (i8 addrspace(3)* getelementptr inbounds ([4 x i8], [4 x i8] addrspace(3)* [[SHARED_Y]], i32 0, i32 0) to i8*))			; CHECK: call void @use.internalized(i8* nofree writeonly addrspacecast (i8 addrspace(3)* getelementptr inbounds ([4 x i8], [4 x i8] addrspace(3)* [[SHARED_Y]], i32 0, i32 0) to i8*))
	define internal void @qux() {			define internal void @qux() {
	entry:			entry:
	%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()			%call = call i32 @__kmpc_target_init(%struct.ident_t* nonnull @1, i1 false, i1 true, i1 true)
	%ntid = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()			%0 = icmp eq i32 %call, -1
	%warpsize = call i32 @llvm.nvvm.read.ptx.sreg.warpsize()			br i1 %0, label %master, label %exit
	%0 = sub nuw i32 %warpsize, 1
	%1 = sub nuw i32 %ntid, 1
	%2 = xor i32 %0, -1
	%master_tid = and i32 %1, %2
	%3 = icmp eq i32 %tid, %master_tid
	br i1 %3, label %master, label %exit
	master:			master:
	%y = call i8* @__kmpc_alloc_shared(i64 4), !dbg !12			%y = call i8* @__kmpc_alloc_shared(i64 4), !dbg !12
	%y_on_stack = bitcast i8* %y to [4 x i32]*			%y_on_stack = bitcast i8* %y to [4 x i32]*
	%4 = bitcast [4 x i32]* %y_on_stack to i8*			%1 = bitcast [4 x i32]* %y_on_stack to i8*
	call void @use(i8* %4)			call void @use(i8* %1)
				call void @__kmpc_free_shared(i8* %y)
				br label %exit
				exit:
				ret void
				}

				define internal void @negative_qux_spmd() {
				entry:
				%call = call i32 @__kmpc_target_init(%struct.ident_t* nonnull @1, i1 true, i1 true, i1 true)
				%0 = icmp eq i32 %call, -1
				br i1 %0, label %master, label %exit
				master:
				%y = call i8* @__kmpc_alloc_shared(i64 6), !dbg !12
				%y_on_stack = bitcast i8* %y to [6 x i32]*
				%1 = bitcast [6 x i32]* %y_on_stack to i8*
				call void @use(i8* %1)
	call void @__kmpc_free_shared(i8* %y)			call void @__kmpc_free_shared(i8* %y)
	br label %exit			br label %exit
	exit:			exit:
	ret void			ret void
	}			}


	define void @use(i8* %x) {			define void @use(i8* %x) {
	entry:			entry:
	store i8* %x, i8** @S			store i8* %x, i8** @S
	ret void			ret void
	}			}

	declare i8* @__kmpc_alloc_shared(i64)			declare i8* @__kmpc_alloc_shared(i64)

	declare void @__kmpc_free_shared(i8*)			declare void @__kmpc_free_shared(i8*)

	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()			declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()			declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()

	declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()			declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()

				declare i32 @__kmpc_target_init(%struct.ident_t*, i1, i1, i1)

	!llvm.dbg.cu = !{!0}			!llvm.dbg.cu = !{!0}
	!llvm.module.flags = !{!3, !4, !5, !6}			!llvm.module.flags = !{!3, !4, !5, !6}
	!nvvm.annotations = !{!7, !8}			!nvvm.annotations = !{!7, !8}

	!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 12.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, splitDebugInlining: false, nameTableKind: None)			!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 12.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, splitDebugInlining: false, nameTableKind: None)
	!1 = !DIFile(filename: "replace_globalization.c", directory: "/tmp/replace_globalization.c")			!1 = !DIFile(filename: "replace_globalization.c", directory: "/tmp/replace_globalization.c")
	!2 = !{}			!2 = !{}
	Show All 10 Lines

llvm/test/Transforms/OpenMP/single_threaded_execution.ll

	; RUN: opt -passes=openmp-opt -debug-only=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s			; RUN: opt -passes=openmp-opt -debug-only=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s
	; RUN: opt -passes=openmp-opt -pass-remarks-analysis=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s --check-prefix=REMARKS			; RUN: opt -passes=openmp-opt -pass-remarks-analysis=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s --check-prefix=REMARKS
	; REQUIRES: asserts			; REQUIRES: asserts
	; ModuleID = 'single_threaded_exeuction.c'			; ModuleID = 'single_threaded_exeuction.c'

	define weak void @kernel() {			%struct.ident_t = type { i32, i32, i32, i32, i8* }
	call void @__kmpc_kernel_init(i32 512, i16 1)
				@0 = private unnamed_addr constant [1 x i8] c"\00", align 1
				@1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 0, i8* getelementptr inbounds ([1 x i8], [1 x i8]* @0, i32 0, i32 0) }, align 8

				define void @kernel() {
				call void @__kmpc_kernel_prepare_parallel(i8* null)
	call void @nvptx()			call void @nvptx()
	call void @amdgcn()			call void @amdgcn()
	ret void			ret void
	}			}

	; REMARKS: remark: single_threaded_execution.c:1:0: Could not internalize function. Some optimizations may not be possible.			; REMARKS: remark: single_threaded_execution.c:1:0: Could not internalize function. Some optimizations may not be possible.
	; REMARKS-NOT: remark: single_threaded_execution.c:1:0: Could not internalize function. Some optimizations may not be possible.			; REMARKS-NOT: remark: single_threaded_execution.c:1:0: Could not internalize function. Some optimizations may not be possible.

	; CHECK-NOT: [openmp-opt] Basic block @nvptx entry is executed by a single thread.			; CHECK-NOT: [openmp-opt] Basic block @nvptx entry is executed by a single thread.
	; CHECK: [openmp-opt] Basic block @nvptx if.then is executed by a single thread.			; CHECK: [openmp-opt] Basic block @nvptx if.then is executed by a single thread.
	; CHECK-NOT: [openmp-opt] Basic block @nvptx if.end is executed by a single thread.			; CHECK-NOT: [openmp-opt] Basic block @nvptx if.end is executed by a single thread.
	; Function Attrs: noinline			; Function Attrs: noinline
	define internal void @nvptx() {			define internal void @nvptx() {
	entry:			entry:
	%call = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()			%call = call i32 @__kmpc_target_init(%struct.ident_t* nonnull @1, i1 false, i1 false, i1 false)
	%cmp = icmp eq i32 %call, 0			%cmp = icmp eq i32 %call, -1
	br i1 %cmp, label %if.then, label %if.end			br i1 %cmp, label %if.then, label %if.end

	if.then:			if.then:
	call void @foo()			call void @foo()
	call void @bar()			call void @bar()
	call void @baz()			call void @baz()
	call void @cold()			call void @cold()
	br label %if.end			br label %if.end

	if.end:			if.end:
	ret void			ret void
	}			}

	; CHECK-NOT: [openmp-opt] Basic block @amdgcn entry is executed by a single thread.			; CHECK-NOT: [openmp-opt] Basic block @amdgcn entry is executed by a single thread.
	; CHECK: [openmp-opt] Basic block @amdgcn if.then is executed by a single thread.			; CHECK: [openmp-opt] Basic block @amdgcn if.then is executed by a single thread.
	; CHECK-NOT: [openmp-opt] Basic block @amdgcn if.end is executed by a single thread.			; CHECK-NOT: [openmp-opt] Basic block @amdgcn if.end is executed by a single thread.
	; Function Attrs: noinline			; Function Attrs: noinline
	define internal void @amdgcn() {			define internal void @amdgcn() {
	entry:			entry:
	%call = call i32 @llvm.amdgcn.workitem.id.x()			%call = call i32 @__kmpc_target_init(%struct.ident_t* nonnull @1, i1 false, i1 true, i1 true)
	%cmp = icmp eq i32 %call, 0			%cmp = icmp eq i32 %call, -1
	br i1 %cmp, label %if.then, label %if.end			br i1 %cmp, label %if.then, label %if.end

	if.then:			if.then:
	call void @foo()			call void @foo()
	call void @bar()			call void @bar()
	call void @baz()			call void @baz()
	call void @cold()			call void @cold()
	br label %if.end			br label %if.end
	Show All 29 Lines
	entry:			entry:
	ret void			ret void
	}			}

	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()			declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

	declare i32 @llvm.amdgcn.workitem.id.x()			declare i32 @llvm.amdgcn.workitem.id.x()

	declare void @__kmpc_kernel_init(i32, i16)			declare void @__kmpc_kernel_prepare_parallel(i8*)

				declare i32 @__kmpc_target_init(%struct.ident_t*, i1, i1, i1)

	attributes #0 = { cold noinline }			attributes #0 = { cold noinline }

	!llvm.dbg.cu = !{!0}			!llvm.dbg.cu = !{!0}
	!llvm.module.flags = !{!3, !4, !5, !6}			!llvm.module.flags = !{!3, !4, !5, !6}
	!nvvm.annotations = !{!7}			!nvvm.annotations = !{!7}

	!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 12.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, splitDebugInlining: false, nameTableKind: None)			!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 12.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, splitDebugInlining: false, nameTableKind: None)
	Show All 10 Lines

openmp/libomptarget/deviceRTLs/common/include/target.h

This file was added.

				//===-- target.h ---------- OpenMP device runtime target implementation ---===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Target region interfaces are simple interfaces designed to allow middle-end
				// (=LLVM) passes to analyze and transform the code. To achieve good performance
				// it may be required to run the associated passes. However, implementations of
				// this interface shall always provide a correct implementation as close to the
				// user expected code as possible.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_OPENMP_LIBOMPTARGET_DEVICERTLS_COMMON_TARGET_H
				#define LLVM_OPENMP_LIBOMPTARGET_DEVICERTLS_COMMON_TARGET_H

				#include <stdint.h>

				extern "C" {

				/// Forward declaration of the source location identifier "ident".
				typedef struct ident ident_t;

				/// The target region _kernel_ interface for GPUs
				///
				/// This deliberatly simple interface provides the middle-end (=LLVM) with
				/// easier means to reason about the semantic of the code and transform it as
				/// well. The runtime calls are therefore also desiged to carry sufficient
				/// information necessary for optimizations.
				///
				///
				/// Intended usage:
				///
				/// \code
				/// void kernel(...) {
				/// ThreadKind = __kmpc_target_init(Ident, /* IsSPMD */ false,
				/// /* UseGenericStateMachine */ true,
				/// /* RequiresFullRuntime */ ... );
				/// if (ThreadKind == -1) {
				/// // User defined kernel code.
				/// }
				/// __kmpc_target_deinit(...);
				/// }
				/// \endcode
				///
				/// Which can be transformed to:
				///
				/// \code
				/// void kernel(...) {
				/// ThreadKind = __kmpc_target_init(Ident, /* IsSPMD */ false,
				/// /* UseGenericStateMachine */ false,
				/// /* RequiresFullRuntime */ ... );
				/// if (ThreadKind == -1) {
				/// // User defined kernel code.
				/// } else {
				/// assume(ThreadKind == ThreadId);
				/// // Custom, kernel-specific state machine code.
				/// }
				/// __kmpc_target_deinit(...);
				/// }
				/// \endcode
				///
				///
				///{

				/// Initialization
				///
				/// Must be called by all threads.
				///
				/// \param Ident Source location identification, can be NULL.
				///
				int32_t __kmpc_target_init(ident_t *Ident, bool IsSPMD,
				bool UseGenericStateMachine,
				bool RequiresFullRuntime);

				/// De-Initialization
				///
				/// Must be called by the main thread in generic mode, can be called by all
				/// threads. Must be called by all threads in SPMD mode.
				///
				/// In non-SPMD, this function releases the workers trapped in a state machine
				/// and also any memory dynamically allocated by the runtime.
				///
				/// \param Ident Source location identification, can be NULL.
				///
				void __kmpc_target_deinit(ident_t *Ident, bool IsSPMD,
				bool RequiresFullRuntime);

				///}
				}
				#endif

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu

//===--- omptarget.cu - OpenMP GPU initialization ---------------- CUDA -*-===//		//===--- omptarget.cu - OpenMP GPU initialization ---------------- CUDA -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file contains the initialization code for the GPU		// This file contains the initialization code for the GPU
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
#pragma omp declare target		#pragma omp declare target

#include "common/omptarget.h"		#include "common/omptarget.h"
		#include "common/support.h"
#include "target_impl.h"		#include "target_impl.h"

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// global data tables		// global data tables
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

extern omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext,		extern omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext,
OMP_STATE_COUNT>		OMP_STATE_COUNT>
omptarget_nvptx_device_State[MAX_SM];		omptarget_nvptx_device_State[MAX_SM];

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// init entry points		// init entry points
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

EXTERN void __kmpc_kernel_init(int ThreadLimit, int16_t RequiresOMPRuntime) {		static void __kmpc_generic_kernel_init() {
PRINT(LD_IO, "call to __kmpc_kernel_init with version %f\n",		PRINT(LD_IO, "call to __kmpc_kernel_init with version %f\n",
OMPTARGET_NVPTX_VERSION);		OMPTARGET_NVPTX_VERSION);
ASSERT0(LT_FUSSY, RequiresOMPRuntime,
"Generic always requires initialized runtime.");		if (GetLaneId() == 0)
setExecutionParameters(Generic, RuntimeInitialized);		parallelLevel[GetWarpId()] = 0;
for (int I = 0; I < MAX_THREADS_PER_TEAM / WARPSIZE; ++I)
parallelLevel[I] = 0;

int threadIdInBlock = GetThreadIdInBlock();		int threadIdInBlock = GetThreadIdInBlock();
		if (threadIdInBlock != GetMasterThreadID())
		return;

		setExecutionParameters(Generic, RuntimeInitialized);
ASSERT0(LT_FUSSY, threadIdInBlock == GetMasterThreadID(),		ASSERT0(LT_FUSSY, threadIdInBlock == GetMasterThreadID(),
"__kmpc_kernel_init() must be called by team master warp only!");		"__kmpc_kernel_init() must be called by team master warp only!");
PRINT0(LD_IO, "call to __kmpc_kernel_init for master\n");		PRINT0(LD_IO, "call to __kmpc_kernel_init for master\n");

// Get a state object from the queue.		// Get a state object from the queue.
int slot = __kmpc_impl_smid() % MAX_SM;		int slot = __kmpc_impl_smid() % MAX_SM;
usedSlotIdx = slot;		usedSlotIdx = slot;
omptarget_nvptx_threadPrivateContext =		omptarget_nvptx_threadPrivateContext =
omptarget_nvptx_device_State[slot].Dequeue();		omptarget_nvptx_device_State[slot].Dequeue();

// init thread private		// init thread private
int threadId = GetLogicalThreadIdInBlock(/isSPMDExecutionMode=/false);		int threadId = 0;
omptarget_nvptx_threadPrivateContext->InitThreadPrivateContext(threadId);		omptarget_nvptx_threadPrivateContext->InitThreadPrivateContext(threadId);

// init team context		// init team context
omptarget_nvptx_TeamDescr &currTeamDescr = getMyTeamDescriptor();		omptarget_nvptx_TeamDescr &currTeamDescr = getMyTeamDescriptor();
currTeamDescr.InitTeamDescr();		currTeamDescr.InitTeamDescr();
// this thread will start execution... has to update its task ICV		// this thread will start execution... has to update its task ICV
// to point to the level zero task ICV. That ICV was init in		// to point to the level zero task ICV. That ICV was init in
// InitTeamDescr()		// InitTeamDescr()
omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(		omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(
threadId, currTeamDescr.LevelZeroTaskDescr());		threadId, currTeamDescr.LevelZeroTaskDescr());

// set number of threads and thread limit in team to started value		// set number of threads and thread limit in team to started value
omptarget_nvptx_TaskDescr *currTaskDescr =		omptarget_nvptx_TaskDescr *currTaskDescr =
omptarget_nvptx_threadPrivateContext->GetTopLevelTaskDescr(threadId);		omptarget_nvptx_threadPrivateContext->GetTopLevelTaskDescr(threadId);
nThreads = GetNumberOfThreadsInBlock();		nThreads = GetNumberOfWorkersInTeam();
threadLimit = ThreadLimit;		threadLimit = nThreads;

if (!__kmpc_is_spmd_exec_mode())
omptarget_nvptx_globalArgs.Init();		omptarget_nvptx_globalArgs.Init();

__kmpc_data_sharing_init_stack();		__kmpc_data_sharing_init_stack();
__kmpc_impl_target_init();		__kmpc_impl_target_init();
}		}

EXTERN void __kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized) {		static void __kmpc_generic_kernel_deinit() {
PRINT0(LD_IO, "call to __kmpc_kernel_deinit\n");		PRINT0(LD_IO, "call to __kmpc_kernel_deinit\n");
ASSERT0(LT_FUSSY, IsOMPRuntimeInitialized,
"Generic always requires initialized runtime.");
// Enqueue omp state object for use by another team.		// Enqueue omp state object for use by another team.
int slot = usedSlotIdx;		int slot = usedSlotIdx;
omptarget_nvptx_device_State[slot].Enqueue(		omptarget_nvptx_device_State[slot].Enqueue(
omptarget_nvptx_threadPrivateContext);		omptarget_nvptx_threadPrivateContext);
// Done with work. Kill the workers.		// Done with work. Kill the workers.
omptarget_nvptx_workFn = 0;		omptarget_nvptx_workFn = 0;
}		}

EXTERN void __kmpc_spmd_kernel_init(int ThreadLimit,		static void __kmpc_spmd_kernel_init(bool RequiresFullRuntime) {
int16_t RequiresOMPRuntime) {
PRINT0(LD_IO, "call to __kmpc_spmd_kernel_init\n");		PRINT0(LD_IO, "call to __kmpc_spmd_kernel_init\n");

setExecutionParameters(Spmd, RequiresOMPRuntime ? RuntimeInitialized		setExecutionParameters(Spmd, RequiresFullRuntime ? RuntimeInitialized
: RuntimeUninitialized);		: RuntimeUninitialized);
int threadId = GetThreadIdInBlock();		int threadId = GetThreadIdInBlock();
if (threadId == 0) {		if (threadId == 0) {
usedSlotIdx = __kmpc_impl_smid() % MAX_SM;		usedSlotIdx = __kmpc_impl_smid() % MAX_SM;
parallelLevel[0] =		parallelLevel[0] =
1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);		1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
} else if (GetLaneId() == 0) {		} else if (GetLaneId() == 0) {
parallelLevel[GetWarpId()] =		parallelLevel[GetWarpId()] =
1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);		1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
}		}
__kmpc_data_sharing_init_stack();		__kmpc_data_sharing_init_stack();
if (!RequiresOMPRuntime) {		if (!RequiresFullRuntime)
// Runtime is not required - exit.
__kmpc_impl_syncthreads();
return;		return;
}

//		//
// Team Context Initialization.		// Team Context Initialization.
//		//
// In SPMD mode there is no master thread so use any cuda thread for team		// In SPMD mode there is no master thread so use any cuda thread for team
// context initialization.		// context initialization.
if (threadId == 0) {		if (threadId == 0) {
// Get a state object from the queue.		// Get a state object from the queue.
Show All 17 Lines	omptarget_nvptx_TaskDescr *newTaskDescr =
omptarget_nvptx_threadPrivateContext->Level1TaskDescr(threadId);		omptarget_nvptx_threadPrivateContext->Level1TaskDescr(threadId);
ASSERT0(LT_FUSSY, newTaskDescr, "expected a task descr");		ASSERT0(LT_FUSSY, newTaskDescr, "expected a task descr");
newTaskDescr->InitLevelOneTaskDescr(currTeamDescr.LevelZeroTaskDescr());		newTaskDescr->InitLevelOneTaskDescr(currTeamDescr.LevelZeroTaskDescr());
// install new top descriptor		// install new top descriptor
omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(threadId,		omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(threadId,
newTaskDescr);		newTaskDescr);

// init thread private from init value		// init thread private from init value
		int ThreadLimit = GetNumberOfProcsInTeam(/* IsSPMD */ true);
PRINT(LD_PAR,		PRINT(LD_PAR,
"thread will execute parallel region with id %d in a team of "		"thread will execute parallel region with id %d in a team of "
"%d threads\n",		"%d threads\n",
(int)newTaskDescr->ThreadId(), (int)ThreadLimit);		(int)newTaskDescr->ThreadId(), (int)ThreadLimit);
}		}

EXTERN void __kmpc_spmd_kernel_deinit_v2(int16_t RequiresOMPRuntime) {		static void __kmpc_spmd_kernel_deinit(bool RequiresFullRuntime) {
// We're not going to pop the task descr stack of each thread since		// We're not going to pop the task descr stack of each thread since
// there are no more parallel regions in SPMD mode.		// there are no more parallel regions in SPMD mode.
if (!RequiresOMPRuntime)		if (!RequiresFullRuntime)
return;		return;

__kmpc_impl_syncthreads();		__kmpc_impl_syncthreads();
int threadId = GetThreadIdInBlock();		int threadId = GetThreadIdInBlock();
if (threadId == 0) {		if (threadId == 0) {
// Enqueue omp state object for use by another team.		// Enqueue omp state object for use by another team.
int slot = usedSlotIdx;		int slot = usedSlotIdx;
omptarget_nvptx_device_State[slot].Enqueue(		omptarget_nvptx_device_State[slot].Enqueue(
omptarget_nvptx_threadPrivateContext);		omptarget_nvptx_threadPrivateContext);
}		}
}		}

// Return true if the current target region is executed in SPMD mode.		// Return true if the current target region is executed in SPMD mode.
EXTERN int8_t __kmpc_is_spmd_exec_mode() {		EXTERN int8_t __kmpc_is_spmd_exec_mode() {
return (execution_param & ModeMask) == Spmd;		return (execution_param & ModeMask) == Spmd;
}		}

		EXTERN bool __kmpc_kernel_parallel(void**WorkFn);

		static void __kmpc_target_region_state_machine(ident_t *Ident) {

		int TId = GetThreadIdInBlock();
		do {
		void* WorkFn = 0;

		// Wait for the signal that we have a new work function.
		__kmpc_barrier_simple_spmd(Ident, TId);


		// Retrieve the work function from the runtime.
		bool IsActive = __kmpc_kernel_parallel(&WorkFn);

		// If there is nothing more to do, break out of the state machine by
		// returning to the caller.
		if (!WorkFn)
		return;

		if (IsActive) {
		((void(*)(uint32_t,uint32_t))WorkFn)(0, TId);
		__kmpc_kernel_end_parallel();
		}

		__kmpc_barrier_simple_spmd(Ident, TId);

		} while (true);
		}

		EXTERN
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions why are these weak? JonChesterfield: why are these weak?
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions So we do not perform IPO but will inline them. If we perform IPO we specialize the arguments even though we still want to do potentially change the mode from non-SPMD to SPMD. jdoerfert: So we do not perform IPO but will inline them. If we perform IPO we specialize the arguments…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions as discussed offline, weak_odr or drop the weak JonChesterfield: as discussed offline, weak_odr or drop the weak
		int32_t __kmpc_target_init(ident_t *Ident, bool IsSPMD,
		bool UseGenericStateMachine,
		bool RequiresFullRuntime) {
		int TId = GetThreadIdInBlock();
		if (IsSPMD)
		__kmpc_spmd_kernel_init(RequiresFullRuntime);
		else
		__kmpc_generic_kernel_init();

		if (IsSPMD) {
		__kmpc_barrier_simple_spmd(Ident, TId);
		return -1;
		}

		if (TId == GetMasterThreadID())
		return -1;

		if (UseGenericStateMachine)
		__kmpc_target_region_state_machine(Ident);

		return TId;
		}

		EXTERN
		void __kmpc_target_deinit(ident_t *Ident, bool IsSPMD,
		bool RequiresFullRuntime) {
		if (IsSPMD)
		__kmpc_spmd_kernel_deinit(RequiresFullRuntime);
		else
		__kmpc_generic_kernel_deinit();
		}


#pragma omp end declare target		#pragma omp end declare target

openmp/libomptarget/deviceRTLs/common/src/parallel.cu

Show First 20 Lines • Show All 325 Lines • ▼ Show 20 Lines	EXTERN void __kmpc_parallel_51(kmp_Ident *ident, kmp_int32 global_tid,
int NumWarps =		int NumWarps =
threadsInTeam / WARPSIZE + ((threadsInTeam % WARPSIZE) ? 1 : 0);		threadsInTeam / WARPSIZE + ((threadsInTeam % WARPSIZE) ? 1 : 0);
// Increment parallel level for non-SPMD warps.		// Increment parallel level for non-SPMD warps.
for (int I = 0; I < NumWarps; ++I)		for (int I = 0; I < NumWarps; ++I)
parallelLevel[I] +=		parallelLevel[I] +=
(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));		(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));

// Master signals work to activate workers.		// Master signals work to activate workers.
__kmpc_barrier_simple_spmd(nullptr, 0);		__kmpc_barrier_simple_spmd(ident, 0);

// OpenMP [2.5, Parallel Construct, p.49]		// OpenMP [2.5, Parallel Construct, p.49]
// There is an implied barrier at the end of a parallel region. After the		// There is an implied barrier at the end of a parallel region. After the
// end of a parallel region, only the master thread of the team resumes		// end of a parallel region, only the master thread of the team resumes
// execution of the enclosing task region.		// execution of the enclosing task region.
//		//
// The master waits at this barrier until all workers are done.		// The master waits at this barrier until all workers are done.
__kmpc_barrier_simple_spmd(nullptr, 0);		__kmpc_barrier_simple_spmd(ident, 0);

// Decrement parallel level for non-SPMD warps.		// Decrement parallel level for non-SPMD warps.
for (int I = 0; I < NumWarps; ++I)		for (int I = 0; I < NumWarps; ++I)
parallelLevel[I] -=		parallelLevel[I] -=
(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));		(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));
// TODO: Is synchronization needed since out of parallel execution?		// TODO: Is synchronization needed since out of parallel execution?

if (nargs)		if (nargs)
__kmpc_end_sharing_variables();		__kmpc_end_sharing_variables();

// TODO: proc_bind is a noop?		// TODO: proc_bind is a noop?
// if (proc_bind != proc_bind_default)		// if (proc_bind != proc_bind_default)
// __kmpc_push_proc_bind(ident, global_tid, proc_bind);		// __kmpc_push_proc_bind(ident, global_tid, proc_bind);
}		}

#pragma omp end declare target		#pragma omp end declare target

openmp/libomptarget/deviceRTLs/interface.h

	Show First 20 Lines • Show All 410 Lines • ▼ Show 20 Lines

	// cancel			// cancel
	EXTERN int32_t __kmpc_cancellationpoint(kmp_Ident *loc, int32_t global_tid,			EXTERN int32_t __kmpc_cancellationpoint(kmp_Ident *loc, int32_t global_tid,
	int32_t cancelVal);			int32_t cancelVal);
	EXTERN int32_t __kmpc_cancel(kmp_Ident *loc, int32_t global_tid,			EXTERN int32_t __kmpc_cancel(kmp_Ident *loc, int32_t global_tid,
	int32_t cancelVal);			int32_t cancelVal);

	// non standard			// non standard
	EXTERN void __kmpc_kernel_init(int ThreadLimit, int16_t RequiresOMPRuntime);			EXTERN int32_t __kmpc_target_init(ident_t *Ident, bool IsSPMD,
	EXTERN void __kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized);			bool UseGenericStateMachine,
	EXTERN void __kmpc_spmd_kernel_init(int ThreadLimit,			bool RequiresFullRuntime);
				ABataevUnsubmitted Not Done Reply Inline Actions Formatting ABataev: Formatting
	int16_t RequiresOMPRuntime);			EXTERN void __kmpc_target_deinit(ident_t *Ident, bool IsSPMD,
	EXTERN void __kmpc_spmd_kernel_deinit_v2(int16_t RequiresOMPRuntime);			bool RequiresFullRuntime);
	EXTERN void __kmpc_kernel_prepare_parallel(void *WorkFn);			EXTERN void __kmpc_kernel_prepare_parallel(void *WorkFn);
	EXTERN bool __kmpc_kernel_parallel(void **WorkFn);			EXTERN bool __kmpc_kernel_parallel(void **WorkFn);
	EXTERN void __kmpc_kernel_end_parallel();			EXTERN void __kmpc_kernel_end_parallel();

	EXTERN void __kmpc_data_sharing_init_stack();			EXTERN void __kmpc_data_sharing_init_stack();
	EXTERN void __kmpc_begin_sharing_variables(void ***GlobalArgs, size_t nArgs);			EXTERN void __kmpc_begin_sharing_variables(void ***GlobalArgs, size_t nArgs);
	EXTERN void __kmpc_end_sharing_variables();			EXTERN void __kmpc_end_sharing_variables();
	EXTERN void __kmpc_get_shared_variables(void ***GlobalArgs);			EXTERN void __kmpc_get_shared_variables(void ***GlobalArgs);
	Show All 38 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

	Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	}			}

	EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask() {			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask() {
	unsigned int Mask;			unsigned int Mask;
	asm volatile("activemask.b32 %0;" : "=r"(Mask));			asm volatile("activemask.b32 %0;" : "=r"(Mask));
	return Mask;			return Mask;
	}			}

	EXTERN void __kmpc_impl_syncthreads() { __syncthreads(); }			EXTERN void __kmpc_impl_syncthreads() {
				int barrier = 2;
				asm volatile("barrier.sync %0;"
				:
				: "r"(barrier)
				: "memory");
				ABataevUnsubmitted Not Done Reply Inline Actions Why not `__syncthreads`? It is safer to use `__syncthreads` as it is `convergent`. Would be good to mark this code somehow as `convergent` too to avoid incorrect optimizations ABataev: Why not `__syncthreads`? It is safer to use `__syncthreads` as it is `convergent`. Would be…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions The problem is that syncthreads is basically a `bar.sync` which is a `barrier.sync.aligned`, if I understood everything properly. This worked so far because the "main thread" (lane 0, last warp) was alone in it's warp and all other threads have been terminated. Now, we simplify the control flow (and later get rid of the last warp) such that the threads of the last warp and the main thread will hit different barriers. The former hit the one in the state machine while the latter will be in `parallel_51`. The `.aligned` version doesn't allow that. Does that make sense? I'm not concerned about convergent though, we solved that wholesale: We mark all functions that clang compiles for the GPU via openmp-target as convergent (IIRC). The entire device runtime is certainly convergent. jdoerfert: The problem is that syncthreads is basically a `bar.sync` which is a `barrier.sync.aligned`, if…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions amdgcn presumably needs the same change. Add a barrier and call it from _kmpc_impl_syncthreads. I think barrier.sync defaults to all threads when the second argument is omitted, so we can use the corresponding kmpc call to get the num_threads argument for it. JonChesterfield: amdgcn presumably needs the same change. Add a barrier and call it from _kmpc_impl_syncthreads.
				}

	EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {			EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {
	__nvvm_bar_warp_sync(Mask);			__nvvm_bar_warp_sync(Mask);
	}			}

	// NVPTX specific kernel initialization			// NVPTX specific kernel initialization
	EXTERN void __kmpc_impl_target_init() { /* nvptx needs no extra setup */			EXTERN void __kmpc_impl_target_init() { /* nvptx needs no extra setup */
	}			}

	// Barrier until num_threads arrive.			// Barrier until num_threads arrive.
	EXTERN void __kmpc_impl_named_sync(uint32_t num_threads) {			EXTERN void __kmpc_impl_named_sync(uint32_t num_threads) {
	// The named barrier for active parallel threads of a team in an L1 parallel			// The named barrier for active parallel threads of a team in an L1 parallel
	// region to synchronize with each other.			// region to synchronize with each other.
	int barrier = 1;			int barrier = 1;
	asm volatile("bar.sync %0, %1;"			asm volatile("barrier.sync %0, %1;"
	:			:
	: "r"(barrier), "r"(num_threads)			: "r"(barrier), "r"(num_threads)
	: "memory");			: "memory");
	}			}

	EXTERN void __kmpc_impl_threadfence() { __nvvm_membar_gl(); }			EXTERN void __kmpc_impl_threadfence() { __nvvm_membar_gl(); }
	EXTERN void __kmpc_impl_threadfence_block() { __nvvm_membar_cta(); }			EXTERN void __kmpc_impl_threadfence_block() { __nvvm_membar_cta(); }
	EXTERN void __kmpc_impl_threadfence_system() { __nvvm_membar_sys(); }			EXTERN void __kmpc_impl_threadfence_system() { __nvvm_membar_sys(); }
	▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Unified entry point for SPMD & generic kernels in the device RTLClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 357736

clang/lib/CodeGen/CGOpenMPRuntimeGPU.h

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h

llvm/include/llvm/Frontend/OpenMP/OMPKinds.def

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

llvm/lib/Transforms/IPO/OpenMPOpt.cpp

llvm/test/Transforms/OpenMP/replace_globalization.ll

llvm/test/Transforms/OpenMP/single_threaded_execution.ll

openmp/libomptarget/deviceRTLs/common/include/target.h

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu

openmp/libomptarget/deviceRTLs/common/src/parallel.cu

openmp/libomptarget/deviceRTLs/interface.h

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

[OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
ClosedPublic