This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/CodeGen/
-
lib/
-
CodeGen/
6/8
CGOpenMPRuntimeGPU.cpp
-
llvm/include/llvm/Frontend/OpenMP/
-
include/
-
llvm/
-
Frontend/
-
OpenMP/
-
OMPKinds.def
-
openmp/libomptarget/deviceRTLs/
-
libomptarget/
-
deviceRTLs/
-
common/
-
src/
-
omptarget.cu
5/5
parallel.cu
1/1
support.cu
-
support.h
-
interface.h

Differential D95976

[OpenMP] Simplify offloading parallel call codegen
ClosedPublic

Authored by ggeorgakoudis on Feb 3 2021, 1:51 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
Meinersbur

Commits

rGa2dbfb6b72db: [OpenMP] Simplify offloading parallel call codegen

Summary

This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single kmpc_parallel_51 runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using update_cc_test_checks.py. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	190 ms	x64 windows > Clang.OpenMP::nvptx_allocate_codegen.cpp
	230 ms	x64 windows > Clang.OpenMP::nvptx_data_sharing.cpp
	250 ms	x64 windows > Clang.OpenMP::nvptx_distribute_parallel_generic_mode_codegen.cpp
	420 ms	x64 windows > Clang.OpenMP::nvptx_lambda_capturing.cpp
	400 ms	x64 windows > Clang.OpenMP::nvptx_parallel_codegen.cpp
		View Full Test Results (19 Failed)

Event Timeline

ggeorgakoudis created this revision.Feb 3 2021, 1:51 PM

Herald added subscribers: jfb, guansong, yaxunl. · View Herald TranscriptFeb 3 2021, 1:51 PM

ggeorgakoudis requested review of this revision.Feb 3 2021, 1:51 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptFeb 3 2021, 1:51 PM

Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, openmp-commits, cfe-commits, sstefan1. · View Herald Transcript

ggeorgakoudis edited the summary of this revision. (Show Details)Feb 3 2021, 1:58 PM

Harbormaster completed remote builds in B87775: Diff 321219.Feb 3 2021, 3:37 PM

Fix type for IfCond, formatting

Harbormaster completed remote builds in B87869: Diff 321375.Feb 4 2021, 5:08 AM

Add tests, update OpenMPOpt, rebase to main

Herald added a subscriber: hiraditya. · View Herald TranscriptApr 13 2021, 7:15 AM

ggeorgakoudis edited the summary of this revision. (Show Details)Apr 13 2021, 7:27 AM

Harbormaster completed remote builds in B98481: Diff 337141.Apr 13 2021, 8:02 AM

Add aux-triple to one test, check unit test builder on windows

Harbormaster completed remote builds in B98504: Diff 337183.Apr 13 2021, 11:33 AM

Fix llvm test

Harbormaster completed remote builds in B98763: Diff 337556.Apr 14 2021, 3:32 PM

ggeorgakoudis added a reviewer: Meinersbur.Apr 15 2021, 7:30 AM

Hi @Meinersbur (got word you are a windows user), @jdoerfert, could I ask your help in detecting why the clang tests on windows are failing? There are two failures I'm spotting, one is that calls to llvm.nvvm intrinsics seem transposed (https://reviews.llvm.org/harbormaster/unit/view/552591/) and another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17). Maybe there is something else I'm missing and I'd appreciate the extra eyeballing on the problem.

I have only minor remarks but I'd like you to check if my hunch is correct and the proposed modifications will fix fix PR49777 *and* fix PR49779.
Also, the number of arguments need to be increased, let's go big and automatic here.

Other than that I think this looks good.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
2199	Can we remove SeqGen while we are here please. We need to check in the runtime anyway. That check is later folded, no need to make things more complicated here.
openmp/libomptarget/deviceRTLs/common/src/parallel.cu
295	This should allow us to remove the `SeqGen` in the Clang CodeGen and fix PR49777 and fix PR49779, a win-win-win situation.
369	FWIW, The implementation here is a stopgap until we move to the new runtime. The codegen and interface are the important parts.
openmp/libomptarget/deviceRTLs/common/src/support.cu
370	Not a return but a `__builtin_trap()`, please. We also need this for more than 16 unfortunately, I've seen 20 in miniqmc. We might want to create a script to print the cases, and then generate 128 or something like that in a file we include. The script can be in the utils folder too.

The transposition problem arises from:

static llvm::Value *getThreadLimit(CodeGenFunction &CGF,
                                   bool IsInSPMDExecutionMode = false) {
  CGBuilderTy &Bld = CGF.Builder;
  auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
  return IsInSPMDExecutionMode
             ? RT.getGPUNumThreads(CGF)
             : Bld.CreateNUWSub(RT.getGPUNumThreads(CGF),
                                RT.getGPUWarpSize(CGF), "thread_limit");
}

The order in which getGPUNumThreads(), getGPUNumThreads(), getGPUWarpSize() is called is undefined, only has to have happened at a sequence point. The idea is that it would depend on the order in which the function arguments are put on the stack.

Turns out, clang/gcc evaluate the left argument first, msvc starts with the right one.

Fix for getThreadLimit

Harbormaster completed remote builds in B99167: Diff 338102.Apr 16 2021, 8:15 AM

Meinersbur requested changes to this revision.Apr 16 2021, 11:35 AM

Meinersbur added inline comments.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
572–573	getGPUNumThreads and getGPUWarpSize still have undefined call order.

This revision now requires changes to proceed.Apr 16 2021, 11:35 AM

Update for comments, fix for windows fix

ggeorgakoudis marked 4 inline comments as done.Apr 16 2021, 2:45 PM

ggeorgakoudis added inline comments.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
2199	Done
openmp/libomptarget/deviceRTLs/common/src/parallel.cu
295	Please check

With the nit to add the two reproducers, LGTM. (please make sure to run FAROS or some benchmarks we have before commiting).

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
295	Check? Can we add the two reproducers as tests, please. One should be a clang test, the other maybe a runtime test, though clang test might suffice.
openmp/libomptarget/utils/generate_microtask_cases.py
31 ↗	(On Diff #338246)	Great. The output is not pretty but that was not the objective ;)

I have not looked at the other mentioned problem yet:

another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17)

Which might still be there.

I would like to wait for Harbormaster to complete the pre-merge check.

Harbormaster completed remote builds in B99272: Diff 338246.Apr 16 2021, 6:00 PM

Meinersbur added inline comments.Apr 16 2021, 7:41 PM

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1135	There seem to be more unordered codegen calls, such as this one.
1144	There seem to be more unordered codegen calls, such as this one.

Update for comments, fixes

ggeorgakoudis marked 4 inline comments as done.Apr 19 2021, 12:55 AM

ggeorgakoudis added inline comments.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1144	Some previous emitted values can be re-used, e.g., GPUThreadID in line 1150 can re-use the value from line 1140 , instead of re-emitted. I've kept emitting them as it was previously done. What is the preferred way to handle those?
openmp/libomptarget/deviceRTLs/common/src/parallel.cu
295	Ack, will do

Harbormaster completed remote builds in B99430: Diff 338441.Apr 19 2021, 1:51 AM

Meinersbur added inline comments.Apr 19 2021, 8:58 AM

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
585–586	This is another undefined codegen call order which causes the current pre-merge checks to fail.

Fix

Harbormaster completed remote builds in B99508: Diff 338554.Apr 19 2021, 11:31 AM

Add tests, reduce microtask cases to avoid stack problems

ggeorgakoudis marked an inline comment as done.Apr 21 2021, 9:07 AM

Harbormaster completed remote builds in B100008: Diff 339265.Apr 21 2021, 10:31 AM

This test seem to pass on Windows now. Please still fix the clang-format remarks, such as going over 80 characters on a line.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1144	`getGPUThreadID`/`getMasterThreadID` could cache the value if used multiple times., but it would also require to put them into the entry block to be available anywhere in the function. Otherwise, use a best-effort to minimize overhead even if the optimizer cannot unify them or in debug builds.

This revision is now accepted and ready to land.Apr 21 2021, 10:54 AM

Fix clang-format

Harbormaster completed remote builds in B100061: Diff 339334.Apr 21 2021, 1:50 PM

This revision was landed with ongoing or failed builds.Apr 21 2021, 6:46 PM

Closed by commit rGa2dbfb6b72db: [OpenMP] Simplify offloading parallel call codegen (authored by ggeorgakoudis). · Explain Why

This revision was automatically updated to reflect the committed changes.

ggeorgakoudis added a commit: rGa2dbfb6b72db: [OpenMP] Simplify offloading parallel call codegen.

JonChesterfield mentioned this in D101123: [OpenMP] Avoid reading uninitialized parallel level values.Apr 22 2021, 6:55 PM

jdoerfert mentioned this in D99762: [OPENMP]Fix PR49777: Clang should not try to specialize orphaned directives in device codegen..Apr 27 2021, 9:42 PM

Please update the test with a NFC commit.

openmp/libomptarget/test/offloading/bug49779.cpp
1–5 ↗	(On Diff #339441)	See D101326
29–36 ↗	(On Diff #339441)	Since the output goes to Filecheck anyways, I think we should avoid asserts, but let Filecheck test for expected results. The output for failing tests has more information with this approach.

In D95976#2725027, @protze.joachim wrote:

Please update the test with a NFC commit.

Thanks, @protze.joachim. The changes look good. I'll get that NFC commit in soon-ish, unless you would like to take over.

JonChesterfield added a subscriber: JonChesterfield.May 12 2021, 6:48 AM

JonChesterfield added inline comments.May 12 2021, 6:55 AM

openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen
1 ↗	(On Diff #339441)	This is not very pretty. Why do we need runtime dispatch to a function pointer?

jdoerfert added inline comments.May 12 2021, 8:06 AM

openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen
1 ↗	(On Diff #339441)	because we have variadic functions right now. A patch to remove this is already underway: https://reviews.llvm.org/D102107

JonChesterfield mentioned this in D105697: [libomptarget][nfc] Drop dead code in parallel_51.Jul 9 2021, 6:01 AM

JonChesterfield mentioned this in D102107: [OpenMP] Codegen aggregate for outlined function captures.Jun 6 2023, 6:24 AM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGOpenMPRuntimeGPU.cpp

256 lines

llvm/

include/

llvm/

Frontend/

OpenMP/

OMPKinds.def

2 lines

openmp/

libomptarget/

deviceRTLs/

common/

src/

5 lines

108 lines

106 lines

5 lines

17 lines

Diff 321375

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

Show First 20 Lines • Show All 563 Lines • ▼ Show 20 Lines	static llvm::Value *getThreadLimit(CodeGenFunction &CGF,
auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());		auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
return IsInSPMDExecutionMode		return IsInSPMDExecutionMode
? RT.getGPUNumThreads(CGF)		? RT.getGPUNumThreads(CGF)
: Bld.CreateNUWSub(RT.getGPUNumThreads(CGF),		: Bld.CreateNUWSub(RT.getGPUNumThreads(CGF),
RT.getGPUWarpSize(CGF), "thread_limit");		RT.getGPUWarpSize(CGF), "thread_limit");
}		}

/// Get the thread id of the OMP master thread.		/// Get the thread id of the OMP master thread.
/// The master thread id is the first thread (lane) of the last warp in the		/// The master thread id is the first thread (lane) of the last warp in the
/// GPU block. Warp size is assumed to be some power of 2.		/// GPU block. Warp size is assumed to be some power of 2.
		MeinersburUnsubmitted Done Reply Inline Actions getGPUNumThreads and getGPUWarpSize still have undefined call order. Meinersbur: getGPUNumThreads and getGPUWarpSize still have undefined call order.
/// Thread id is 0 indexed.		/// Thread id is 0 indexed.
/// E.g: If NumThreads is 33, master id is 32.		/// E.g: If NumThreads is 33, master id is 32.
/// If NumThreads is 64, master id is 32.		/// If NumThreads is 64, master id is 32.
/// If NumThreads is 1024, master id is 992.		/// If NumThreads is 1024, master id is 992.
static llvm::Value *getMasterThreadID(CodeGenFunction &CGF) {		static llvm::Value *getMasterThreadID(CodeGenFunction &CGF) {
CGBuilderTy &Bld = CGF.Builder;		CGBuilderTy &Bld = CGF.Builder;
auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());		auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
llvm::Value *NumThreads = RT.getGPUNumThreads(CGF);		llvm::Value *NumThreads = RT.getGPUNumThreads(CGF);
// We assume that the warp size is a power of 2.		// We assume that the warp size is a power of 2.
llvm::Value *Mask = Bld.CreateNUWSub(RT.getGPUWarpSize(CGF), Bld.getInt32(1));		llvm::Value *Mask = Bld.CreateNUWSub(RT.getGPUWarpSize(CGF), Bld.getInt32(1));

return Bld.CreateAnd(Bld.CreateNUWSub(NumThreads, Bld.getInt32(1)),		return Bld.CreateAnd(Bld.CreateNUWSub(NumThreads, Bld.getInt32(1)),
Bld.CreateNot(Mask), "master_tid");		Bld.CreateNot(Mask), "master_tid");
		MeinersburUnsubmitted Done Reply Inline Actions This is another undefined codegen call order which causes the current pre-merge checks to fail. Meinersbur: This is another undefined codegen call order which causes the current pre-merge checks to fail.
}		}

CGOpenMPRuntimeGPU::WorkerFunctionState::WorkerFunctionState(		CGOpenMPRuntimeGPU::WorkerFunctionState::WorkerFunctionState(
CodeGenModule &CGM, SourceLocation Loc)		CodeGenModule &CGM, SourceLocation Loc)
: WorkerFn(nullptr), CGFI(CGM.getTypes().arrangeNullaryFunction()),		: WorkerFn(nullptr), CGFI(CGM.getTypes().arrangeNullaryFunction()),
Loc(Loc) {		Loc(Loc) {
createWorkerFunction(CGM);		createWorkerFunction(CGM);
}		}
▲ Show 20 Lines • Show All 532 Lines • ▼ Show 20 Lines	void CGOpenMPRuntimeGPU::emitNonSPMDEntryHeader(CodeGenFunction &CGF,

llvm::BasicBlock *WorkerBB = CGF.createBasicBlock(".worker");		llvm::BasicBlock *WorkerBB = CGF.createBasicBlock(".worker");
llvm::BasicBlock *MasterCheckBB = CGF.createBasicBlock(".mastercheck");		llvm::BasicBlock *MasterCheckBB = CGF.createBasicBlock(".mastercheck");
llvm::BasicBlock *MasterBB = CGF.createBasicBlock(".master");		llvm::BasicBlock *MasterBB = CGF.createBasicBlock(".master");
EST.ExitBB = CGF.createBasicBlock(".exit");		EST.ExitBB = CGF.createBasicBlock(".exit");

auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());		auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
llvm::Value *IsWorker =		llvm::Value *IsWorker =
Bld.CreateICmpULT(RT.getGPUThreadID(CGF), getThreadLimit(CGF));		Bld.CreateICmpULT(RT.getGPUThreadID(CGF), getThreadLimit(CGF));
		MeinersburUnsubmitted Done Reply Inline Actions There seem to be more unordered codegen calls, such as this one. Meinersbur: There seem to be more unordered codegen calls, such as this one.
Bld.CreateCondBr(IsWorker, WorkerBB, MasterCheckBB);		Bld.CreateCondBr(IsWorker, WorkerBB, MasterCheckBB);

CGF.EmitBlock(WorkerBB);		CGF.EmitBlock(WorkerBB);
emitCall(CGF, WST.Loc, WST.WorkerFn);		emitCall(CGF, WST.Loc, WST.WorkerFn);
CGF.EmitBranch(EST.ExitBB);		CGF.EmitBranch(EST.ExitBB);

CGF.EmitBlock(MasterCheckBB);		CGF.EmitBlock(MasterCheckBB);
llvm::Value *IsMaster =		llvm::Value *IsMaster =
Bld.CreateICmpEQ(RT.getGPUThreadID(CGF), getMasterThreadID(CGF));		Bld.CreateICmpEQ(RT.getGPUThreadID(CGF), getMasterThreadID(CGF));
		MeinersburUnsubmitted Done Reply Inline Actions There seem to be more unordered codegen calls, such as this one. Meinersbur: There seem to be more unordered codegen calls, such as this one.
		ggeorgakoudisAuthorUnsubmitted Not Done Reply Inline Actions Some previous emitted values can be re-used, e.g., GPUThreadID in line 1150 can re-use the value from line 1140 , instead of re-emitted. I've kept emitting them as it was previously done. What is the preferred way to handle those? ggeorgakoudis: Some previous emitted values can be re-used, e.g., GPUThreadID in line 1150 can re-use the…
		MeinersburUnsubmitted Not Done Reply Inline Actions `getGPUThreadID`/`getMasterThreadID` could cache the value if used multiple times., but it would also require to put them into the entry block to be available anywhere in the function. Otherwise, use a best-effort to minimize overhead even if the optimizer cannot unify them or in debug builds. Meinersbur: `getGPUThreadID`/`getMasterThreadID` could cache the value if used multiple times., but it…
Bld.CreateCondBr(IsMaster, MasterBB, EST.ExitBB);		Bld.CreateCondBr(IsMaster, MasterBB, EST.ExitBB);

CGF.EmitBlock(MasterBB);		CGF.EmitBlock(MasterBB);
IsInTargetMasterThreadRegion = true;		IsInTargetMasterThreadRegion = true;
// SEQUENTIAL (MASTER) REGION START		// SEQUENTIAL (MASTER) REGION START
// First action in sequential region:		// First action in sequential region:
// Initialize the state of the OpenMP runtime library on the GPU.		// Initialize the state of the OpenMP runtime library on the GPU.
// TODO: Optimize runtime initialization and pass in correct value.		// TODO: Optimize runtime initialization and pass in correct value.
▲ Show 20 Lines • Show All 914 Lines • ▼ Show 20 Lines	void CGOpenMPRuntimeGPU::emitTeamsCall(CodeGenFunction &CGF,
CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/C/ 0));		CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/C/ 0));
llvm::SmallVector<llvm::Value *, 16> OutlinedFnArgs;		llvm::SmallVector<llvm::Value *, 16> OutlinedFnArgs;
OutlinedFnArgs.push_back(emitThreadIDAddress(CGF, Loc).getPointer());		OutlinedFnArgs.push_back(emitThreadIDAddress(CGF, Loc).getPointer());
OutlinedFnArgs.push_back(ZeroAddr.getPointer());		OutlinedFnArgs.push_back(ZeroAddr.getPointer());
OutlinedFnArgs.append(CapturedVars.begin(), CapturedVars.end());		OutlinedFnArgs.append(CapturedVars.begin(), CapturedVars.end());
emitOutlinedFunctionCall(CGF, Loc, OutlinedFn, OutlinedFnArgs);		emitOutlinedFunctionCall(CGF, Loc, OutlinedFn, OutlinedFnArgs);
}		}

void CGOpenMPRuntimeGPU::emitParallelCall(		void CGOpenMPRuntimeGPU::emitParallelCall(CodeGenFunction &CGF,
CodeGenFunction &CGF, SourceLocation Loc, llvm::Function *OutlinedFn,		SourceLocation Loc,
ArrayRef<llvm::Value > CapturedVars, const Expr IfCond) {		llvm::Function *OutlinedFn,
		ArrayRef<llvm::Value *> CapturedVars,
		const Expr *IfCond) {
if (!CGF.HaveInsertPoint())		if (!CGF.HaveInsertPoint())
return;		return;

if (getExecutionMode() == CGOpenMPRuntimeGPU::EM_SPMD)		auto &&CodeGen = [this, OutlinedFn, CapturedVars,
emitSPMDParallelCall(CGF, Loc, OutlinedFn, CapturedVars, IfCond);		Loc](CodeGenFunction &CGF, PrePostActionTy &Action) {
else		Action.Enter(CGF);
emitNonSPMDParallelCall(CGF, Loc, OutlinedFn, CapturedVars, IfCond);
}

void CGOpenMPRuntimeGPU::emitNonSPMDParallelCall(
CodeGenFunction &CGF, SourceLocation Loc, llvm::Value *OutlinedFn,
ArrayRef<llvm::Value > CapturedVars, const Expr IfCond) {
llvm::Function *Fn = cast<llvm::Function>(OutlinedFn);		llvm::Function *Fn = cast<llvm::Function>(OutlinedFn);

// Force inline this outlined function at its call site.		// Force inline this outlined function at its call site.
Fn->setLinkage(llvm::GlobalValue::InternalLinkage);		Fn->setLinkage(llvm::GlobalValue::InternalLinkage);

// Ensure we do not inline the function. This is trivially true for the ones		// Ensure we do not inline the function. This is trivially true for the ones
// passed to __kmpc_fork_call but the ones calles in serialized regions		// passed to __kmpc_fork_call but the ones calles in serialized regions
// could be inlined. This is not a perfect but it is closer to the invariant		// could be inlined. This is not a perfect but it is closer to the invariant
// we want, namely, every data environment starts with a new function.		// we want, namely, every data environment starts with a new function.
// TODO: We should pass the if condition to the runtime function and do the		// TODO: We should pass the if condition to the runtime function and do the
// handling there. Much cleaner code.		// handling there. Much cleaner code.
cast<llvm::Function>(OutlinedFn)->addFnAttr(llvm::Attribute::NoInline);		cast<llvm::Function>(OutlinedFn)->addFnAttr(llvm::Attribute::NoInline);

Address ZeroAddr = CGF.CreateDefaultAlignTempAlloca(CGF.Int32Ty,		Address ZeroAddr = CGF.CreateDefaultAlignTempAlloca(CGF.Int32Ty,
/Name=/".zero.addr");		/Name=/".zero.addr");
CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/C/ 0));		CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/C/ 0));
// ThreadId for serialized parallels is 0.		// ThreadId for serialized parallels is 0.
Address ThreadIDAddr = ZeroAddr;		Address ThreadIDAddr = ZeroAddr;
auto &&CodeGen = [this, Fn, CapturedVars, Loc, &ThreadIDAddr](
CodeGenFunction &CGF, PrePostActionTy &Action) {
Action.Enter(CGF);

Address ZeroAddr =
CGF.CreateDefaultAlignTempAlloca(CGF.Int32Ty,
/Name=/".bound.zero.addr");
CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/C/ 0));
llvm::SmallVector<llvm::Value *, 16> OutlinedFnArgs;		llvm::SmallVector<llvm::Value *, 16> OutlinedFnArgs;
OutlinedFnArgs.push_back(ThreadIDAddr.getPointer());		OutlinedFnArgs.push_back(ThreadIDAddr.getPointer());
OutlinedFnArgs.push_back(ZeroAddr.getPointer());		OutlinedFnArgs.push_back(ZeroAddr.getPointer());
OutlinedFnArgs.append(CapturedVars.begin(), CapturedVars.end());		OutlinedFnArgs.append(CapturedVars.begin(), CapturedVars.end());
emitOutlinedFunctionCall(CGF, Loc, Fn, OutlinedFnArgs);		emitOutlinedFunctionCall(CGF, Loc, Fn, OutlinedFnArgs);
};		};

auto &&SeqGen = [this, &CodeGen, Loc](CodeGenFunction &CGF,		auto &&SeqGen = [this, &CodeGen, Loc](CodeGenFunction &CGF,
PrePostActionTy &) {		PrePostActionTy &) {

RegionCodeGenTy RCG(CodeGen);		RegionCodeGenTy RCG(CodeGen);
llvm::Value *RTLoc = emitUpdateLocation(CGF, Loc);		llvm::Value *RTLoc = emitUpdateLocation(CGF, Loc);
llvm::Value *ThreadID = getThreadID(CGF, Loc);		llvm::Value *ThreadID = getThreadID(CGF, Loc);
llvm::Value *Args[] = {RTLoc, ThreadID};		llvm::Value *Args[] = {RTLoc, ThreadID};

NVPTXActionTy Action(		NVPTXActionTy Action(
OMPBuilder.getOrCreateRuntimeFunction(		OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_serialized_parallel),		CGM.getModule(), OMPRTL___kmpc_serialized_parallel),
Args,		Args,
OMPBuilder.getOrCreateRuntimeFunction(		OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_end_serialized_parallel),		CGM.getModule(), OMPRTL___kmpc_end_serialized_parallel),
Args);		Args);
RCG.setAction(Action);		RCG.setAction(Action);
RCG(CGF);		RCG(CGF);
};		};

auto &&L0ParallelGen = [this, CapturedVars, Fn](CodeGenFunction &CGF,		auto &&ParallelGen = [this, Loc, OutlinedFn, CapturedVars,
PrePostActionTy &Action) {		IfCond](CodeGenFunction &CGF, PrePostActionTy &Action) {
CGBuilderTy &Bld = CGF.Builder;		CGBuilderTy &Bld = CGF.Builder;
llvm::Function *WFn = WrapperFunctionsMap[Fn];		llvm::Function *WFn = WrapperFunctionsMap[OutlinedFn];
assert(WFn && "Wrapper function does not exist!");		llvm::Value *ID = llvm::ConstantPointerNull::get(CGM.Int8PtrTy);
llvm::Value *ID = Bld.CreateBitOrPointerCast(WFn, CGM.Int8PtrTy);		if (WFn) {
		ID = Bld.CreateBitOrPointerCast(WFn, CGM.Int8PtrTy);
// Prepare for parallel region. Indicate the outlined function.		// Remember for post-processing in worker loop.
llvm::Value *Args[] = {ID};		Work.emplace_back(WFn);
CGF.EmitRuntimeCall(		}
OMPBuilder.getOrCreateRuntimeFunction(		llvm::Value *FnPtr = Bld.CreateBitOrPointerCast(OutlinedFn, CGM.Int8PtrTy);
CGM.getModule(), OMPRTL___kmpc_kernel_prepare_parallel),
Args);

// Create a private scope that will globalize the arguments		// Create a private scope that will globalize the arguments
// passed from the outside of the target region.		// passed from the outside of the target region.
		// TODO: Is that needed?
CodeGenFunction::OMPPrivateScope PrivateArgScope(CGF);		CodeGenFunction::OMPPrivateScope PrivateArgScope(CGF);

		Address CapturedVarsAddrs = CGF.CreateDefaultAlignTempAlloca(
		llvm::ArrayType::get(CGM.VoidPtrTy, CapturedVars.size()),
		"captured_vars_addrs");
// There's something to share.		// There's something to share.
if (!CapturedVars.empty()) {		if (!CapturedVars.empty()) {
// Prepare for parallel region. Indicate the outlined function.		// Prepare for parallel region. Indicate the outlined function.
Address SharedArgs =
CGF.CreateDefaultAlignTempAlloca(CGF.VoidPtrPtrTy, "shared_arg_refs");
llvm::Value *SharedArgsPtr = SharedArgs.getPointer();

llvm::Value *DataSharingArgs[] = {
SharedArgsPtr,
llvm::ConstantInt::get(CGM.SizeTy, CapturedVars.size())};
CGF.EmitRuntimeCall(
OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_begin_sharing_variables),
DataSharingArgs);

// Store variable address in a list of references to pass to workers.
unsigned Idx = 0;
ASTContext &Ctx = CGF.getContext();		ASTContext &Ctx = CGF.getContext();
Address SharedArgListAddress = CGF.EmitLoadOfPointer(		unsigned Idx = 0;
SharedArgs, Ctx.getPointerType(Ctx.getPointerType(Ctx.VoidPtrTy))
.castAs<PointerType>());
for (llvm::Value *V : CapturedVars) {		for (llvm::Value *V : CapturedVars) {
Address Dst = Bld.CreateConstInBoundsGEP(SharedArgListAddress, Idx);		Address Dst = Bld.CreateConstArrayGEP(CapturedVarsAddrs, Idx);
llvm::Value *PtrV;		llvm::Value *PtrV;
if (V->getType()->isIntegerTy())		if (V->getType()->isIntegerTy())
PtrV = Bld.CreateIntToPtr(V, CGF.VoidPtrTy);		PtrV = Bld.CreateIntToPtr(V, CGF.VoidPtrTy);
else		else
PtrV = Bld.CreatePointerBitCastOrAddrSpaceCast(V, CGF.VoidPtrTy);		PtrV = Bld.CreatePointerBitCastOrAddrSpaceCast(V, CGF.VoidPtrTy);
CGF.EmitStoreOfScalar(PtrV, Dst, /Volatile=/false,		CGF.EmitStoreOfScalar(PtrV, Dst, /Volatile=/false,
Ctx.getPointerType(Ctx.VoidPtrTy));		Ctx.getPointerType(Ctx.VoidPtrTy));
++Idx;		++Idx;
}		}
}		}

// Activate workers. This barrier is used by the master to signal		llvm::Value *IfCondVal = nullptr;
// work for the workers.		if (IfCond)
syncCTAThreads(CGF);		IfCondVal = Bld.CreateIntCast(CGF.EvaluateExprAsBool(IfCond), CGF.Int32Ty,
		/* isSigned */ false);
// OpenMP [2.5, Parallel Construct, p.49]		else
// There is an implied barrier at the end of a parallel region. After the		IfCondVal = llvm::ConstantInt::get(CGF.Int32Ty, 1);
// end of a parallel region, only the master thread of the team resumes
// execution of the enclosing task region.
//
// The master waits at this barrier until all workers are done.
syncCTAThreads(CGF);

if (!CapturedVars.empty())
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_end_sharing_variables));

// Remember for post-processing in worker loop.
Work.emplace_back(WFn);
};

auto &&LNParallelGen = [this, Loc, &SeqGen, &L0ParallelGen](
CodeGenFunction &CGF, PrePostActionTy &Action) {
if (IsInParallelRegion) {
SeqGen(CGF, Action);
} else if (IsInTargetMasterThreadRegion) {
L0ParallelGen(CGF, Action);
} else {
// Check for master and then parallelism:
// if (__kmpc_is_spmd_exec_mode() \|\| __kmpc_parallel_level(loc, gtid)) {
// Serialized execution.
// } else {
// Worker call.
// }
CGBuilderTy &Bld = CGF.Builder;
llvm::BasicBlock *ExitBB = CGF.createBasicBlock(".exit");
llvm::BasicBlock *SeqBB = CGF.createBasicBlock(".sequential");
llvm::BasicBlock *ParallelCheckBB = CGF.createBasicBlock(".parcheck");
llvm::BasicBlock *MasterBB = CGF.createBasicBlock(".master");
llvm::Value *IsSPMD = Bld.CreateIsNotNull(
CGF.EmitNounwindRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_is_spmd_exec_mode)));
Bld.CreateCondBr(IsSPMD, SeqBB, ParallelCheckBB);
// There is no need to emit line number for unconditional branch.
(void)ApplyDebugLocation::CreateEmpty(CGF);
CGF.EmitBlock(ParallelCheckBB);
llvm::Value *RTLoc = emitUpdateLocation(CGF, Loc);
llvm::Value *ThreadID = getThreadID(CGF, Loc);
llvm::Value *PL = CGF.EmitRuntimeCall(
OMPBuilder.getOrCreateRuntimeFunction(CGM.getModule(),
OMPRTL___kmpc_parallel_level),
{RTLoc, ThreadID});
llvm::Value *Res = Bld.CreateIsNotNull(PL);
Bld.CreateCondBr(Res, SeqBB, MasterBB);
CGF.EmitBlock(SeqBB);
SeqGen(CGF, Action);
CGF.EmitBranch(ExitBB);
// There is no need to emit line number for unconditional branch.
(void)ApplyDebugLocation::CreateEmpty(CGF);
CGF.EmitBlock(MasterBB);
L0ParallelGen(CGF, Action);
CGF.EmitBranch(ExitBB);
// There is no need to emit line number for unconditional branch.
(void)ApplyDebugLocation::CreateEmpty(CGF);
// Emit the continuation block for code after the if.
CGF.EmitBlock(ExitBB, /IsFinished=/true);
}
};

if (IfCond) {
emitIfClause(CGF, IfCond, LNParallelGen, SeqGen);
} else {
CodeGenFunction::RunCleanupsScope Scope(CGF);
RegionCodeGenTy ThenRCG(LNParallelGen);
ThenRCG(CGF);
}
}

void CGOpenMPRuntimeGPU::emitSPMDParallelCall(
CodeGenFunction &CGF, SourceLocation Loc, llvm::Function *OutlinedFn,
ArrayRef<llvm::Value > CapturedVars, const Expr IfCond) {
// Just call the outlined function to execute the parallel region.
// OutlinedFn(&GTid, &zero, CapturedStruct);
//
llvm::SmallVector<llvm::Value *, 16> OutlinedFnArgs;

Address ZeroAddr = CGF.CreateDefaultAlignTempAlloca(CGF.Int32Ty,
/Name=/".zero.addr");
CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/C/ 0));
// ThreadId for serialized parallels is 0.
Address ThreadIDAddr = ZeroAddr;
auto &&CodeGen = [this, OutlinedFn, CapturedVars, Loc, &ThreadIDAddr](
CodeGenFunction &CGF, PrePostActionTy &Action) {
Action.Enter(CGF);

Address ZeroAddr =
CGF.CreateDefaultAlignTempAlloca(CGF.Int32Ty,
/Name=/".bound.zero.addr");
CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/C/ 0));
llvm::SmallVector<llvm::Value *, 16> OutlinedFnArgs;
OutlinedFnArgs.push_back(ThreadIDAddr.getPointer());
OutlinedFnArgs.push_back(ZeroAddr.getPointer());
OutlinedFnArgs.append(CapturedVars.begin(), CapturedVars.end());
emitOutlinedFunctionCall(CGF, Loc, OutlinedFn, OutlinedFnArgs);
};
auto &&SeqGen = [this, &CodeGen, Loc](CodeGenFunction &CGF,
PrePostActionTy &) {

RegionCodeGenTy RCG(CodeGen);		assert(IfCondVal && "Expected a value");
llvm::Value *RTLoc = emitUpdateLocation(CGF, Loc);		llvm::Value *RTLoc = emitUpdateLocation(CGF, Loc);
llvm::Value *ThreadID = getThreadID(CGF, Loc);		llvm::Value *Args[] = {
llvm::Value *Args[] = {RTLoc, ThreadID};		RTLoc,
		getThreadID(CGF, Loc),
NVPTXActionTy Action(		IfCondVal,
OMPBuilder.getOrCreateRuntimeFunction(		llvm::ConstantInt::get(CGF.Int32Ty, -1),
CGM.getModule(), OMPRTL___kmpc_serialized_parallel),		llvm::ConstantInt::get(CGF.Int32Ty, -1),
Args,		FnPtr,
OMPBuilder.getOrCreateRuntimeFunction(		ID,
CGM.getModule(), OMPRTL___kmpc_end_serialized_parallel),		Bld.CreateBitOrPointerCast(CapturedVarsAddrs.getPointer(),
		CGF.VoidPtrPtrTy),
		llvm::ConstantInt::get(CGM.SizeTy, CapturedVars.size())};
		CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
		CGM.getModule(), OMPRTL___kmpc_parallel_51),
Args);		Args);
RCG.setAction(Action);
RCG(CGF);
};		};

if (IsInTargetMasterThreadRegion) {		if (IsInParallelRegion) {
// In the worker need to use the real thread id.		RegionCodeGenTy RCG(SeqGen);
ThreadIDAddr = emitThreadIDAddress(CGF, Loc);
RegionCodeGenTy RCG(CodeGen);
RCG(CGF);		RCG(CGF);
} else {		} else {
// If we are not in the target region, it is definitely L2 parallelism or		RegionCodeGenTy RCG(ParallelGen);
// more, because for SPMD mode we always has L1 parallel level, sowe don't
// need to check for orphaned directives.
RegionCodeGenTy RCG(SeqGen);
RCG(CGF);		RCG(CGF);
}		}
		jdoerfertUnsubmitted Done Reply Inline Actions Can we remove SeqGen while we are here please. We need to check in the runtime anyway. That check is later folded, no need to make things more complicated here. jdoerfert: Can we remove SeqGen while we are here please. We need to check in the runtime anyway. That…
		ggeorgakoudisAuthorUnsubmitted Done Reply Inline Actions Done ggeorgakoudis: Done
}		}

void CGOpenMPRuntimeGPU::syncCTAThreads(CodeGenFunction &CGF) {		void CGOpenMPRuntimeGPU::syncCTAThreads(CodeGenFunction &CGF) {
// Always emit simple barriers!		// Always emit simple barriers!
if (!CGF.HaveInsertPoint())		if (!CGF.HaveInsertPoint())
return;		return;
// Build call __kmpc_barrier_simple_spmd(nullptr, 0);		// Build call __kmpc_barrier_simple_spmd(nullptr, 0);
// This function does not use parameters, so we can emit just default values.		// This function does not use parameters, so we can emit just default values.
▲ Show 20 Lines • Show All 2,529 Lines • Show Last 20 Lines

llvm/include/llvm/Frontend/OpenMP/OMPKinds.def

Show First 20 Lines • Show All 406 Lines • ▼ Show 20 Lines	__OMP_RTL(__kmpc_task_allow_completion_event, false, VoidPtr, IdentPtr,
/* Int / Int32, / kmp_task_t */ VoidPtr)		/* Int / Int32, / kmp_task_t */ VoidPtr)

/// OpenMP Device runtime functions		/// OpenMP Device runtime functions
__OMP_RTL(__kmpc_kernel_init, false, Void, Int32, Int16)		__OMP_RTL(__kmpc_kernel_init, false, Void, Int32, Int16)
__OMP_RTL(__kmpc_kernel_deinit, false, Void, Int16)		__OMP_RTL(__kmpc_kernel_deinit, false, Void, Int16)
__OMP_RTL(__kmpc_spmd_kernel_init, false, Void, Int32, Int16)		__OMP_RTL(__kmpc_spmd_kernel_init, false, Void, Int32, Int16)
__OMP_RTL(__kmpc_spmd_kernel_deinit_v2, false, Void, Int16)		__OMP_RTL(__kmpc_spmd_kernel_deinit_v2, false, Void, Int16)
__OMP_RTL(__kmpc_kernel_prepare_parallel, false, Void, VoidPtr)		__OMP_RTL(__kmpc_kernel_prepare_parallel, false, Void, VoidPtr)
		__OMP_RTL(__kmpc_parallel_51, false, Void, IdentPtr, Int32, Int32, Int32, Int32,
		VoidPtr, VoidPtr, VoidPtrPtr, SizeTy)
__OMP_RTL(__kmpc_kernel_parallel, false, Int1, VoidPtrPtr)		__OMP_RTL(__kmpc_kernel_parallel, false, Int1, VoidPtrPtr)
__OMP_RTL(__kmpc_kernel_end_parallel, false, Void, )		__OMP_RTL(__kmpc_kernel_end_parallel, false, Void, )
__OMP_RTL(__kmpc_serialized_parallel, false, Void, IdentPtr, Int32)		__OMP_RTL(__kmpc_serialized_parallel, false, Void, IdentPtr, Int32)
__OMP_RTL(__kmpc_end_serialized_parallel, false, Void, IdentPtr, Int32)		__OMP_RTL(__kmpc_end_serialized_parallel, false, Void, IdentPtr, Int32)
__OMP_RTL(__kmpc_shuffle_int32, false, Int32, Int32, Int16, Int16)		__OMP_RTL(__kmpc_shuffle_int32, false, Int32, Int32, Int16, Int16)
__OMP_RTL(__kmpc_nvptx_parallel_reduce_nowait_v2, false, Int32, IdentPtr, Int32,		__OMP_RTL(__kmpc_nvptx_parallel_reduce_nowait_v2, false, Int32, IdentPtr, Int32,
Int32, SizeTy, VoidPtr, ShuffleReducePtr, InterWarpCopyPtr)		Int32, SizeTy, VoidPtr, ShuffleReducePtr, InterWarpCopyPtr)
__OMP_RTL(__kmpc_nvptx_end_reduce_nowait, false, Void, Int32)		__OMP_RTL(__kmpc_nvptx_end_reduce_nowait, false, Void, Int32)
▲ Show 20 Lines • Show All 719 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu

	Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	EXTERN void __kmpc_spmd_kernel_init(int ThreadLimit, int16_t RequiresOMPRuntime) {			EXTERN void __kmpc_spmd_kernel_init(int ThreadLimit, int16_t RequiresOMPRuntime) {
	PRINT0(LD_IO, "call to __kmpc_spmd_kernel_init\n");			PRINT0(LD_IO, "call to __kmpc_spmd_kernel_init\n");

	setExecutionParameters(Spmd, RequiresOMPRuntime ? RuntimeInitialized			setExecutionParameters(Spmd, RequiresOMPRuntime ? RuntimeInitialized
	: RuntimeUninitialized);			: RuntimeUninitialized);
	int threadId = GetThreadIdInBlock();			int threadId = GetThreadIdInBlock();
	if (threadId == 0) {			if (threadId == 0) {
	usedSlotIdx = __kmpc_impl_smid() % MAX_SM;			usedSlotIdx = __kmpc_impl_smid() % MAX_SM;
	parallelLevel[0] =
	1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
	} else if (GetLaneId() == 0) {
	parallelLevel[GetWarpId()] =
	1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);
	}			}
	if (!RequiresOMPRuntime) {			if (!RequiresOMPRuntime) {
	// Runtime is not required - exit.			// Runtime is not required - exit.
	__kmpc_impl_syncthreads();			__kmpc_impl_syncthreads();
	return;			return;
	}			}

	//			//
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/parallel.cu

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(threadId,

newTaskDescr); newTaskDescr);

// init private from int value // init private from int value

PRINT(LD_PAR, PRINT(LD_PAR,

"thread will execute parallel region with id %d in a team of " "thread will execute parallel region with id %d in a team of "

"%d threads\n", "%d threads\n",

(int)newTaskDescr->ThreadId(), (int)nThreads); (int)newTaskDescr->ThreadId(), (int)nThreads);

isActive = true; isActive = true;

// Reconverge the threads at the end of the parallel region to correctly

// handle parallel levels.

// In Cuda9+ in non-SPMD mode we have either 1 worker thread or the whole

// warp. If only 1 thread is active, not need to reconverge the threads.

// If we have the whole warp, reconverge all the threads in the warp before

// actually trying to change the parallel level. Otherwise, parallel level

// can be changed incorrectly because of threads divergence.

bool IsActiveParallelRegion = threadsInTeam != 1;

IncParallelLevel(IsActiveParallelRegion,

IsActiveParallelRegion ? __kmpc_impl_all_lanes : 1u);

} }

return isActive; return isActive;

} }

EXTERN void __kmpc_kernel_end_parallel() { EXTERN void __kmpc_kernel_end_parallel() {

// pop stack // pop stack

PRINT0(LD_IO | LD_PAR, "call to __kmpc_kernel_end_parallel\n"); PRINT0(LD_IO | LD_PAR, "call to __kmpc_kernel_end_parallel\n");

ASSERT0(LT_FUSSY, isRuntimeInitialized(), "Expected initialized runtime."); ASSERT0(LT_FUSSY, isRuntimeInitialized(), "Expected initialized runtime.");

// Only the worker threads call this routine and the master warp // Only the worker threads call this routine and the master warp

// never arrives here. Therefore, use the nvptx thread id. // never arrives here. Therefore, use the nvptx thread id.

int threadId = GetThreadIdInBlock(); int threadId = GetThreadIdInBlock();

omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(threadId); omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(threadId);

omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr( omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(

threadId, currTaskDescr->GetPrevTaskDescr()); threadId, currTaskDescr->GetPrevTaskDescr());

// Reconverge the threads at the end of the parallel region to correctly

// handle parallel levels.

// In Cuda9+ in non-SPMD mode we have either 1 worker thread or the whole

// warp. If only 1 thread is active, not need to reconverge the threads.

// If we have the whole warp, reconverge all the threads in the warp before

// actually trying to change the parallel level. Otherwise, parallel level can

// be changed incorrectly because of threads divergence.

bool IsActiveParallelRegion = threadsInTeam != 1;

DecParallelLevel(IsActiveParallelRegion,

IsActiveParallelRegion ? __kmpc_impl_all_lanes : 1u);

} }

//////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////////

// support for parallel that goes sequential // support for parallel that goes sequential

//////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////////

EXTERN void __kmpc_serialized_parallel(kmp_Ident *loc, uint32_t global_tid) { EXTERN void __kmpc_serialized_parallel(kmp_Ident *loc, uint32_t global_tid) {

PRINT0(LD_IO, "call to __kmpc_serialized_parallel\n"); PRINT0(LD_IO, "call to __kmpc_serialized_parallel\n");

▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines ASSERT0(LT_FUSSY, 0,

"should never have anything with new teams on device"); "should never have anything with new teams on device");

} }

EXTERN void __kmpc_push_proc_bind(kmp_Ident *loc, uint32_t tid, EXTERN void __kmpc_push_proc_bind(kmp_Ident *loc, uint32_t tid,

int proc_bind) { int proc_bind) {

PRINT(LD_IO, "call kmpc_push_proc_bind %d\n", (int)proc_bind); PRINT(LD_IO, "call kmpc_push_proc_bind %d\n", (int)proc_bind);

} }

////////////////////////////////////////////////////////////////////////////////

// parallel interface

////////////////////////////////////////////////////////////////////////////////

EXTERN void __kmpc_parallel_51(kmp_Ident *ident, kmp_int32 global_tid,

kmp_int32 if_expr, kmp_int32 num_threads,

int proc_bind, void *fn, void *wrapper_fn,

void **args, size_t nargs) {

// Handle the serialized case first, same for SPMD/non-SPMD.

// TODO: Add UNLIKELY to optimize?

if (!if_expr) {

jdoerfertUnsubmitted

Done

// TODO: Add UNLIKELY to optimize?

- if (!if_expr) {

+ if (!if_expr || currTaskDescr->InParallelRegion()) {

__kmpc_serialized_parallel(ident, global_tid);

This should allow us to remove the SeqGen in the Clang CodeGen *and* fix PR49777 *and* fix PR49779, a win-win-win situation.

jdoerfert: This should allow us to remove the `SeqGen` in the Clang CodeGen *and* fix PR49777 *and* fix…

ggeorgakoudisAuthorUnsubmitted

Done

Please check

ggeorgakoudis: Please check

jdoerfertUnsubmitted

Done

Check? Can we add the two reproducers as tests, please. One should be a clang test, the other maybe a runtime test, though clang test might suffice.

jdoerfert: Check? Can we add the two reproducers as tests, please. One should be a clang test, the other…

ggeorgakoudisAuthorUnsubmitted

Done

Ack, will do

ggeorgakoudis: Ack, will do

__kmpc_serialized_parallel(ident, global_tid);

__kmp_invoke_microtask(global_tid, 0, fn, args, nargs);

__kmpc_end_serialized_parallel(ident, global_tid);

return;

}

if (__kmpc_is_spmd_exec_mode()) {

// Increment parallel level for SPMD warps.

if (GetThreadIdInBlock() == 0)

parallelLevel[0] =

1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);

else if (GetLaneId() == 0)

parallelLevel[GetWarpId()] =

1 + (GetNumberOfThreadsInBlock() > 1 ? OMP_ACTIVE_PARALLEL_LEVEL : 0);

// TODO: Is that synchronization correct/needed? Can only using a memory

// fence ensure consistency?

__kmpc_impl_syncthreads();

__kmp_invoke_microtask(global_tid, 0, fn, args, nargs);

// TODO: is decrementing parallel level needed? parallelLevel will reset to

// the next SPMD/non-SPMD parallel region execution, existing implementation

// does not decrement?

// parallelLevel[GetWarpId()] = 0;

return;

}

// Handle the num_threads clause.

if (num_threads != -1)

__kmpc_push_num_threads(ident, global_tid, num_threads);

__kmpc_kernel_prepare_parallel((void *)wrapper_fn);

if (nargs) {

void **GlobalArgs;

__kmpc_begin_sharing_variables(&GlobalArgs, nargs);

// TODO: faster memcpy?

for (int I = 0; I < nargs; I++)

GlobalArgs[I] = args[I];

}

// TODO: what if that's a parallel region with a single thread? this is considered

Lint: Pre-merge checks

clang-format: please reformat the code

-  // TODO: what if that's a parallel region with a single thread? this is considered
-  // not active in the existing implementation.
+  // TODO: what if that's a parallel region with a single thread? this is
+  // considered not active in the existing implementation.

Lint: Pre-merge checks: clang-format: please reformat the code ``` - // TODO: what if that's a parallel region with a…

// not active in the existing implementation.

bool IsActiveParallelRegion = threadsInTeam != 1;

// Increment parallel level for non-SPMD warps.

for (int I = 0; I < threadsInTeam / WARPSIZE; ++I)

parallelLevel[I] +=

(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));

// Master signals work to activate workers.

__kmpc_barrier_simple_spmd(nullptr, 0);

// OpenMP [2.5, Parallel Construct, p.49]

// There is an implied barrier at the end of a parallel region. After the

// end of a parallel region, only the master thread of the team resumes

// execution of the enclosing task region.

// The master waits at this barrier until all workers are done.

__kmpc_barrier_simple_spmd(nullptr, 0);

// Decrement parallel level for non-SPMD warps.

for (int I = 0; I < threadsInTeam / WARPSIZE; ++I)

parallelLevel[I] -=

(1 + (IsActiveParallelRegion ? OMP_ACTIVE_PARALLEL_LEVEL : 0));

// TODO: Is synchronization needed since out of parallel execution?

if (nargs)

__kmpc_end_sharing_variables();

// TODO: proc_bind is a noop?

// if (proc_bind != proc_bind_default)

// __kmpc_push_proc_bind(ident, global_tid, proc_bind);

}

jdoerfertUnsubmitted

Done

FWIW, The implementation here is a stopgap until we move to the new runtime. The codegen and interface are the important parts.

jdoerfert: FWIW, The implementation here is a stopgap until we move to the new runtime. The codegen and…

#pragma omp end declare target #pragma omp end declare target

openmp/libomptarget/deviceRTLs/common/src/support.cu

	Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines
	DEVICE unsigned int *GetTeamsReductionTimestamp() {			DEVICE unsigned int *GetTeamsReductionTimestamp() {
	return static_cast<unsigned int *>(ReductionScratchpadPtr);			return static_cast<unsigned int *>(ReductionScratchpadPtr);
	}			}

	DEVICE char *GetTeamsReductionScratchpad() {			DEVICE char *GetTeamsReductionScratchpad() {
	return static_cast<char *>(ReductionScratchpadPtr) + 256;			return static_cast<char *>(ReductionScratchpadPtr) + 256;
	}			}

				// Invoke an outlined parallel function unwrapping arguments (up
				// to 16).
				DEVICE void __kmp_invoke_microtask(kmp_int32 global_tid, kmp_int32 bound_tid,
				void fn, void *args, size_t nargs) {
				switch (nargs) {
				case 0:
				((void ()(kmp_int32 , kmp_int32 *))fn)(&global_tid, &bound_tid);
				break;
				case 1:
				((void ()(kmp_int32 , kmp_int32 , void ))fn)(&global_tid, &bound_tid,
				args[0]);
				break;
				case 2:
				((void ()(kmp_int32 , kmp_int32 , void , void *))fn)(
				&global_tid, &bound_tid, args[0], args[1]);
				break;
				case 3:
				((void ()(kmp_int32 , kmp_int32 , void , void , void ))fn)(
				&global_tid, &bound_tid, args[0], args[1], args[2]);
				break;
				case 4:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void *))fn)(
				&global_tid, &bound_tid, args[0], args[1], args[2], args[3]);
				break;
				case 5:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void *,
				void *))fn)(&global_tid, &bound_tid, args[0], args[1], args[2],
				args[3], args[4]);
				break;
				case 6:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void *))fn)(&global_tid, &bound_tid, args[0], args[1], args[2],
				args[3], args[4], args[5]);
				break;
				case 7:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void ))fn)(&global_tid, &bound_tid, args[0], args[1],
				args[2], args[3], args[4], args[5], args[6]);
				break;
				case 8:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void *))fn)(&global_tid, &bound_tid, args[0],
				args[1], args[2], args[3], args[4],
				args[5], args[6], args[7]);
				break;
				case 9:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void ))fn)(
				&global_tid, &bound_tid, args[0], args[1], args[2], args[3], args[4],
				args[5], args[6], args[7], args[8]);
				break;
				case 10:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void , void *))fn)(
				&global_tid, &bound_tid, args[0], args[1], args[2], args[3], args[4],
				args[5], args[6], args[7], args[8], args[9]);
				break;
				case 11:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void , void , void ))fn)(
				&global_tid, &bound_tid, args[0], args[1], args[2], args[3], args[4],
				args[5], args[6], args[7], args[8], args[9], args[10]);
				break;
				case 12:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void , void , void , void *))fn)(
				&global_tid, &bound_tid, args[0], args[1], args[2], args[3], args[4],
				args[5], args[6], args[7], args[8], args[9], args[10], args[11]);
				break;
				case 13:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void , void , void , void *,
				void *))fn)(&global_tid, &bound_tid, args[0], args[1], args[2],
				args[3], args[4], args[5], args[6], args[7], args[8],
				args[9], args[10], args[11], args[12]);
				break;
				case 14:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void , void , void , void , void ,
				void *))fn)(&global_tid, &bound_tid, args[0], args[1], args[2],
				args[3], args[4], args[5], args[6], args[7], args[8],
				args[9], args[10], args[11], args[12], args[13]);
				break;
				case 15:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void , void , void , void , void ,
				void , void ))fn)(&global_tid, &bound_tid, args[0], args[1],
				args[2], args[3], args[4], args[5], args[6],
				args[7], args[8], args[9], args[10],
				args[11], args[12], args[13], args[14]);
				break;
				case 16:
				((void ()(kmp_int32 , kmp_int32 , void , void , void , void , void ,
				void , void , void , void , void , void , void , void ,
				void , void , void *))fn)(
				&global_tid, &bound_tid, args[0], args[1], args[2], args[3], args[4],
				args[5], args[6], args[7], args[8], args[9], args[10], args[11],
				args[12], args[13], args[14], args[15]);
				break;
				default:
				// TODO: assert
				printf("Too many arguments in kmp_invoke_microtask, aborting execution.\n");
				return;
				jdoerfertUnsubmitted Done Reply Inline Actions Not a return but a `__builtin_trap()`, please. We also need this for more than 16 unfortunately, I've seen 20 in miniqmc. We might want to create a script to print the cases, and then generate 128 or something like that in a file we include. The script can be in the utils folder too. jdoerfert: Not a return but a `__builtin_trap()`, please. We also need this for more than 16 unfortunately…
				}
				}

	#pragma omp end declare target			#pragma omp end declare target

openmp/libomptarget/deviceRTLs/common/support.h

//===--------- support.h - OpenMP GPU support functions ---------- CUDA -*-===//		//===--------- support.h - OpenMP GPU support functions ---------- CUDA -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Wrapper to some functions natively supported by the GPU.		// Wrapper to some functions natively supported by the GPU.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef OMPTARGET_SUPPORT_H		#ifndef OMPTARGET_SUPPORT_H
#define OMPTARGET_SUPPORT_H		#define OMPTARGET_SUPPORT_H

#include "interface.h"		#include "interface.h"
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'interface.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'interface.h' file not found [clang-diagnostic-error] [[https://github.
#include "target_impl.h"		#include "target_impl.h"

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// Execution Parameters		// Execution Parameters
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
enum ExecutionMode {		enum ExecutionMode {
Spmd = 0x00u,		Spmd = 0x00u,
Generic = 0x01u,		Generic = 0x01u,
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	#define SUB_BYTES(_addr, _bytes) \
((void )((char )((void *)(_addr)) - (_bytes)))		((void )((char )((void *)(_addr)) - (_bytes)))

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// Teams Reduction Scratchpad Helpers		// Teams Reduction Scratchpad Helpers
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
DEVICE unsigned int *GetTeamsReductionTimestamp();		DEVICE unsigned int *GetTeamsReductionTimestamp();
DEVICE char *GetTeamsReductionScratchpad();		DEVICE char *GetTeamsReductionScratchpad();

		// Invoke an outlined parallel function unwrapping global, shared arguments (up
		// to 16).
		DEVICE void __kmp_invoke_microtask(kmp_int32 global_tid, kmp_int32 bound_tid,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmp_invoke_microtask' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmp_invoke_microtask' [readability…
		void fn, void *args, size_t nargs);

#endif		#endif

openmp/libomptarget/deviceRTLs/interface.h

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
typedef enum omp_proc_bind_t {		typedef enum omp_proc_bind_t {
omp_proc_bind_false = 0,		omp_proc_bind_false = 0,
omp_proc_bind_true = 1,		omp_proc_bind_true = 1,
omp_proc_bind_master = 2,		omp_proc_bind_master = 2,
omp_proc_bind_close = 3,		omp_proc_bind_close = 3,
omp_proc_bind_spread = 4		omp_proc_bind_spread = 4
} omp_proc_bind_t;		} omp_proc_bind_t;

EXTERN double omp_get_wtick(void);		EXTERN double omp_get_wtick(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN double omp_get_wtime(void);		EXTERN double omp_get_wtime(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.

EXTERN void omp_set_num_threads(int num);		EXTERN void omp_set_num_threads(int num);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_num_threads(void);		EXTERN int omp_get_num_threads(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_max_threads(void);		EXTERN int omp_get_max_threads(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_thread_limit(void);		EXTERN int omp_get_thread_limit(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_thread_num(void);		EXTERN int omp_get_thread_num(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_num_procs(void);		EXTERN int omp_get_num_procs(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_in_parallel(void);		EXTERN int omp_in_parallel(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_in_final(void);		EXTERN int omp_in_final(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN void omp_set_dynamic(int flag);		EXTERN void omp_set_dynamic(int flag);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_dynamic(void);		EXTERN int omp_get_dynamic(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN void omp_set_nested(int flag);		EXTERN void omp_set_nested(int flag);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_nested(void);		EXTERN int omp_get_nested(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN void omp_set_max_active_levels(int level);		EXTERN void omp_set_max_active_levels(int level);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_max_active_levels(void);		EXTERN int omp_get_max_active_levels(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_level(void);		EXTERN int omp_get_level(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_active_level(void);		EXTERN int omp_get_active_level(void);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_ancestor_thread_num(int level);		EXTERN int omp_get_ancestor_thread_num(int level);
		Lint: Pre-merge checks Inline Actions clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: unknown type name 'EXTERN' [clang-diagnostic-error] [[https://github.
EXTERN int omp_get_team_size(int level);		EXTERN int omp_get_team_size(int level);

EXTERN void omp_init_lock(omp_lock_t *lock);		EXTERN void omp_init_lock(omp_lock_t *lock);
EXTERN void omp_init_nest_lock(omp_nest_lock_t *lock);		EXTERN void omp_init_nest_lock(omp_nest_lock_t *lock);
EXTERN void omp_destroy_lock(omp_lock_t *lock);		EXTERN void omp_destroy_lock(omp_lock_t *lock);
EXTERN void omp_destroy_nest_lock(omp_nest_lock_t *lock);		EXTERN void omp_destroy_nest_lock(omp_nest_lock_t *lock);
EXTERN void omp_set_lock(omp_lock_t *lock);		EXTERN void omp_set_lock(omp_lock_t *lock);
EXTERN void omp_set_nest_lock(omp_nest_lock_t *lock);		EXTERN void omp_set_nest_lock(omp_nest_lock_t *lock);
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	enum {
KMP_IDENT_SIMPLE_RT_MODE = 0x02,		KMP_IDENT_SIMPLE_RT_MODE = 0x02,
};		};

/*!		/*!
* The ident structure that describes a source location.		* The ident structure that describes a source location.
* The struct is identical to the one in the kmp.h file.		* The struct is identical to the one in the kmp.h file.
* We maintain the same data structure for compatibility.		* We maintain the same data structure for compatibility.
*/		*/
		typedef short kmp_int16;
typedef int kmp_int32;		typedef int kmp_int32;
typedef struct ident {		typedef struct ident {
kmp_int32 reserved_1; /*< might be used in Fortran; see above /		kmp_int32 reserved_1; /*< might be used in Fortran; see above /
kmp_int32 flags; /**< also f.flags; KMP_IDENT_xxx flags; KMP_IDENT_KMPC		kmp_int32 flags; /**< also f.flags; KMP_IDENT_xxx flags; KMP_IDENT_KMPC
identifies this union member */		identifies this union member */
kmp_int32 reserved_2; /*< not really used in Fortran any more; see above /		kmp_int32 reserved_2; /*< not really used in Fortran any more; see above /
kmp_int32 reserved_3; /*< source[4] in Fortran, do not use for C++ /		kmp_int32 reserved_3; /*< source[4] in Fortran, do not use for C++ /
char const psource; /*< String describing the source location.		char const psource; /*< String describing the source location.
▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines
EXTERN void *__kmpc_data_sharing_coalesced_push_stack(size_t size,		EXTERN void *__kmpc_data_sharing_coalesced_push_stack(size_t size,
int16_t UseSharedMemory);		int16_t UseSharedMemory);
EXTERN void *__kmpc_data_sharing_push_stack(size_t size, int16_t UseSharedMemory);		EXTERN void *__kmpc_data_sharing_push_stack(size_t size, int16_t UseSharedMemory);
EXTERN void __kmpc_data_sharing_pop_stack(void *a);		EXTERN void __kmpc_data_sharing_pop_stack(void *a);
EXTERN void __kmpc_begin_sharing_variables(void ***GlobalArgs, size_t nArgs);		EXTERN void __kmpc_begin_sharing_variables(void ***GlobalArgs, size_t nArgs);
EXTERN void __kmpc_end_sharing_variables();		EXTERN void __kmpc_end_sharing_variables();
EXTERN void __kmpc_get_shared_variables(void ***GlobalArgs);		EXTERN void __kmpc_get_shared_variables(void ***GlobalArgs);

		/// Entry point to start a new parallel region.
		///
		/// \param ident The source identifier.
		/// \param global_tid The global thread ID.
		/// \param if_expr The if(expr), or 1 if none given.
		/// \param num_threads The num_threads(expr), or -1 if none given.
		/// \param proc_bind The proc_bind, or `proc_bind_default` if none given.
		/// \param fn The outlined parallel region function.
		/// \param wrapper_fn The worker wrapper function of fn.
		/// \param args The pointer array of arguments to fn.
		/// \param nargs The number of arguments to fn.
		EXTERN void __kmpc_parallel_51(ident_t *ident, kmp_int32 global_tid,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_parallel_51' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_parallel_51' [readability…
		kmp_int32 if_expr, kmp_int32 num_threads,
		int proc_bind, void fn, void wrapper_fn,
		void **args, size_t nargs);

// SPMD execution mode interrogation function.		// SPMD execution mode interrogation function.
EXTERN int8_t __kmpc_is_spmd_exec_mode();		EXTERN int8_t __kmpc_is_spmd_exec_mode();

EXTERN void __kmpc_get_team_static_memory(int16_t isSPMDExecutionMode,		EXTERN void __kmpc_get_team_static_memory(int16_t isSPMDExecutionMode,
const void *buf, size_t size,		const void *buf, size_t size,
int16_t is_shared, const void **res);		int16_t is_shared, const void **res);

EXTERN void __kmpc_restore_team_static_memory(int16_t isSPMDExecutionMode,		EXTERN void __kmpc_restore_team_static_memory(int16_t isSPMDExecutionMode,
int16_t is_shared);		int16_t is_shared);

#endif		#endif

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Simplify offloading parallel call codegenClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 321375

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

llvm/include/llvm/Frontend/OpenMP/OMPKinds.def

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu

openmp/libomptarget/deviceRTLs/common/src/parallel.cu

openmp/libomptarget/deviceRTLs/common/src/support.cu

openmp/libomptarget/deviceRTLs/common/support.h

openmp/libomptarget/deviceRTLs/interface.h

[OpenMP] Simplify offloading parallel call codegen
ClosedPublic