This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/
-
lib/
-
Basic/Targets/
-
Targets/
1/2
X86.h
-
CodeGen/
1
CGOpenMPRuntimeGPU.cpp
1
CodeGenModule.cpp
-
Driver/ToolChains/
-
ToolChains/
1/3
Gnu.cpp
-
Frontend/
1/1
CompilerInvocation.cpp
-
llvm/
-
include/llvm/
-
llvm/
-
ADT/
1/1
Triple.h
-
Frontend/OpenMP/
-
OpenMP/
-
OMPGridValues.h
-
lib/Support/
-
Support/
2/2
Triple.cpp
-
openmp/
-
CMakeLists.txt
-
libomptarget/
-
DeviceRTL/
1/3
CMakeLists.txt
-
src/
1
Debug.cpp
4
Kernel.cpp
2/3
Mapping.cpp
-
Misc.cpp
3/6
Synchronization.cpp
2/5
Utils.cpp
-
plugins/
-
CMakeLists.txt
-
vgpu/
-
CMakeLists.txt
-
src/
-
ThreadEnvironment.h
1
ThreadEnvironment.cpp
1
ThreadEnvironmentImpl.h
4
rtl.cpp
-
src/
1/2
rtl.cpp

Differential D113359

[Libomptarget][WIP] Introduce VGPU Plugin
AcceptedPublic

Authored by atmnpatel on Nov 6 2021, 7:41 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992
JonChesterfield

Summary

This patch introduces a virtual GPU (x86) plugin. This allows for the
emulation of the GPU environment on the host. This re-uses the same
execution model, compilation paths, runtimes as a physical GPU. The
number of threads, warps, and CTAs are set through the environment
variables VGPU_{NUM_THREADS,NUM_WARPS,WARPS_PER_CTA} respectively.

Known Bugs:

There is UB somewhere in the DeviceRTL that occasionally sets stride to zero, causing a FPE segfault.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

atmnpatel created this revision.Nov 6 2021, 7:41 PM

Herald added subscribers: ormris, dexonsmith, pengfei and 2 others. · View Herald TranscriptNov 6 2021, 7:41 PM

atmnpatel requested review of this revision.Nov 6 2021, 7:41 PM

Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptNov 6 2021, 7:41 PM

Herald added subscribers: llvm-commits, openmp-commits, cfe-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B132881: Diff 385318.Nov 6 2021, 10:12 PM

jdoerfert added inline comments.Nov 8 2021, 7:59 AM

clang/lib/CodeGen/CGOpenMPRuntimeVirtualGPU.cpp
54 ↗	(On Diff #385318)	We should be able to get rid of this file (and the cuda/hip) version. Might be the right time now as a precommit.
llvm/include/llvm/ADT/Triple.h
168	Let's call it OpenMP_VGPU or something like that to make it clear.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
2177 ↗	(On Diff #385318)	@tianshilei1992 This needs a test.
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
110	I don't think we should do this. Instead, the plugin should signal as threads finish the kernel.
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
240	We probably should use kind(CPU) or something instead. Nothing x86 specific about it I think.
openmp/libomptarget/include/DeviceEnvironment.h
83 ↗	(On Diff #385318)	This should go into a new file (ThreadEnvironment)

I removed the shared var opt - might be best to keep this in a separate patch @tianshilei1992. Also addressed comments.

small nit fix

Harbormaster completed remote builds in B133662: Diff 386426.Nov 10 2021, 11:49 PM

tianshilei1992 added inline comments.Nov 11 2021, 9:04 AM

clang/lib/Driver/ToolChains/Gnu.cpp
3082	Maybe `"x86_64-openmp_vpu"` now?
llvm/lib/Support/Triple.cpp
189	`"openmp_vpu"`?

I can't see it in the diff - does the cmake somewhere enable the existing tests on this new target?

A bit surprised to see ffi involved, are we thinking of spawning a separate process for the target?

clang/lib/Basic/Targets/X86.h
49	It's not clear to me what this is x86 specific. Being able to run our tests on power / arm etc seems like an advantage. Would also mean we would avoid adding openmp stuff the x86 specific files. Maybe OpenMPVGPUAddrSpaceMap and put it in one of the openmp source files?
clang/lib/Frontend/CompilerInvocation.cpp
3988	Add a isOpenmpVGPU function?
openmp/libomptarget/DeviceRTL/CMakeLists.txt
135	Should only add this include to the vgu, not all the plugins. May be able to use relative include paths to drop it entirely

Fixed lifetime issue around ffi_call
Addressed comments

The existing x86 plugin uses ffi, so this does as well, no explicit benefit in doing so. Is it worth keeping?

Harbormaster completed remote builds in B142248: Diff 398370.Jan 8 2022, 2:06 PM

jdoerfert added inline comments.Jan 10 2022, 7:00 AM

llvm/lib/Support/Triple.cpp
512
openmp/libomptarget/DeviceRTL/src/Debug.cpp
53
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
114
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
28
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
290
314	Pass the memory order, also rename the arguments to match the coding convention.
317	Pass the mask
openmp/libomptarget/DeviceRTL/src/Utils.cpp
56
68	Can't we merge this with AMDGPU?
138
openmp/libomptarget/plugins/vgpu/src/rtl.cpp
304	Can we split this up and create some helper functions maybe?
openmp/libomptarget/src/rtl.cpp
34	Introduce an environment variable, if it is set, X86 target should skip the image. Also, add a TODO such that we later look into the image and inspect it to decide automatically.
openmp/libomptarget/test/lit.cfg
189 ↗	(On Diff #398370)	Leftovers.

tianshilei1992 added inline comments.Jan 12 2022, 6:55 AM

openmp/libomptarget/DeviceRTL/CMakeLists.txt
232	It's not a good practice to specify include directories in CMake in this way. Use `include_directories` instead.
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
113	Are these code here unintentional? We don't need to specialize this function for vgpu IIRC.

ormris removed a subscriber: ormris.Jan 18 2022, 10:04 AM

jdoerfert added inline comments.Jan 18 2022, 12:50 PM

openmp/libomptarget/DeviceRTL/src/Kernel.cpp
113	we might be able to avoid it if we move the synchronize::threads "effect" into the VGPU instead.

atmnpatel edited the summary of this revision. (Show Details)Jan 18 2022, 11:32 PM

Addressed comments

atmnpatel added inline comments.Jan 18 2022, 11:36 PM

openmp/libomptarget/DeviceRTL/CMakeLists.txt
232	can't quite do that here I think, afaik both `include_directories` and `target_include_directories` require that CMake builds the target, but we specify custom targets/build commands so they don't get pulled in

Harbormaster completed remote builds in B144209: Diff 401112.Jan 19 2022, 12:44 AM

jdoerfert added inline comments.Feb 2 2022, 9:45 AM

clang/lib/Driver/ToolChains/Gnu.cpp
3082	not x86, right? triple contains the proper arch
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
29	Move up to the beginning.
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
291	Move up.
342	We should simply use omp locks. Either here, or maybe better, in VGPUImpl. So redirect all calls to there and use a proper lock. no OMP_SPIN and stuff
openmp/libomptarget/DeviceRTL/src/Utils.cpp
139	Move up
148	Pass the mask, both times.
openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.cpp
50	see above.
openmp/libomptarget/src/rtl.cpp
92	Not only x86, also let's not do strcmp. Extend RTLNAmes to be an array of structs with more elaborate information, e.g., is host flag. That said, unsure if not loading the plugin is the right way to not grab the image. Good enough for now.
openmp/libomptarget/test/CMakeLists.txt
23 ↗	(On Diff #401112)	This is to disable the tests? Not sure this is a good way though. For one, can we check against -vgpu not x86, also openmp-vgpu or something, right?

updates

openmp/libomptarget/test/CMakeLists.txt
23 ↗	(On Diff #401112)	Yep

Harbormaster completed remote builds in B147239: Diff 405407.Feb 2 2022, 4:55 PM

LG, with some things to address before the merge though.

Didn't we have a pass to expand shared memory (and such)?

clang/lib/Basic/TargetInfo.cpp
155 ↗	(On Diff #405407)	use isOpenMPVGPU
clang/lib/Basic/Targets/X86.h
420	Do we need the changes in this file at all? I couldn't see why.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1125	isOpenMPVGPU
clang/lib/CodeGen/CodeGenModule.cpp
252–254	isOpenMPVGPU
clang/lib/Driver/ToolChains/Gnu.cpp
3081	isOpenMPVGPU
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
322	Remove these. Also the TODO below (copied from somewhere)
openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.cpp
85 ↗	(On Diff #405407)	This is racy, I think. Can we use atomic_add for all these Idx updates or pass the Id from the outside?
openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.h
119	at least add more information what the problem and potential solutions are.
openmp/libomptarget/plugins/vgpu/src/rtl.cpp
272	Move the lambda into a helper function. indention of 12 is too much.
314	When do we have more threads than NumThreads?
555	if we need for submit/retrieve, I'd assume to wait here too.

This revision is now accepted and ready to land.Feb 3 2022, 9:26 AM

Not sure if it's good to merge such a large patch. We could potentially split the patch to three independent patches: tool chain, device runtime, and the OpenMPOpt pass to support expansion of shared variable (which for some reason is not included in this patch. That is actually very important component otherwise the backend will complain about it).

We can merge runtime first, build it in isolation, then libomptarget host runtime, then clang.

Also make sure to adjust the commit messages

dexonsmith removed a subscriber: dexonsmith.Feb 14 2022, 11:03 AM

@jdoerfert @tianshilei1992 @atmnpatel @dhruvachak

Is the target to get this merged in for LLVM 16? Does the VGPU implementation provide a way to support OMPT callbacks for various constructs (parallel, worksharing, barriers, etc.)?

Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2022, 10:13 AM

Herald added a subscriber: MaskRay. · View Herald Transcript

Revision Contents

Path

Size

clang/

lib/

Basic/

Targets/

X86.h

30 lines

CodeGen/

CGOpenMPRuntimeGPU.cpp

9 lines

CodeGenModule.cpp

4 lines

Driver/

ToolChains/

Gnu.cpp

9 lines

Frontend/

CompilerInvocation.cpp

4 lines

llvm/

include/

llvm/

ADT/

Triple.h

3 lines

Frontend/

OpenMP/

OMPGridValues.h

10 lines

lib/

Support/

Triple.cpp

35 lines

openmp/

CMakeLists.txt

2 lines

libomptarget/

DeviceRTL/

CMakeLists.txt

8 lines

src/

10 lines

16 lines

77 lines

5 lines

67 lines

38 lines

plugins/

CMakeLists.txt

1 line

vgpu/

CMakeLists.txt

58 lines

src/

ThreadEnvironment.h

72 lines

ThreadEnvironment.cpp

120 lines

ThreadEnvironmentImpl.h

168 lines

rtl.cpp

623 lines

src/

rtl.cpp

9 lines

Diff 386426

clang/lib/Basic/Targets/X86.h

Show All 11 Lines

#ifndef LLVM_CLANG_LIB_BASIC_TARGETS_X86_H		#ifndef LLVM_CLANG_LIB_BASIC_TARGETS_X86_H
#define LLVM_CLANG_LIB_BASIC_TARGETS_X86_H		#define LLVM_CLANG_LIB_BASIC_TARGETS_X86_H

#include "OSTargets.h"		#include "OSTargets.h"
#include "clang/Basic/TargetInfo.h"		#include "clang/Basic/TargetInfo.h"
#include "clang/Basic/TargetOptions.h"		#include "clang/Basic/TargetOptions.h"
#include "llvm/ADT/Triple.h"		#include "llvm/ADT/Triple.h"
		#include "llvm/Frontend/OpenMP/OMPGridValues.h"
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
#include "llvm/Support/X86TargetParser.h"		#include "llvm/Support/X86TargetParser.h"

namespace clang {		namespace clang {
namespace targets {		namespace targets {

static const unsigned X86AddrSpaceMap[] = {		static const unsigned X86AddrSpaceMap[] = {
0, // Default		0, // Default
Show All 12 Lines	static const unsigned X86AddrSpaceMap[] = {
0, // sycl_global_host		0, // sycl_global_host
0, // sycl_local		0, // sycl_local
0, // sycl_private		0, // sycl_private
270, // ptr32_sptr		270, // ptr32_sptr
271, // ptr32_uptr		271, // ptr32_uptr
272 // ptr64		272 // ptr64
};		};

		static const unsigned X86VGPUAddrSpaceMap[] = {
		JonChesterfieldUnsubmitted Done Reply Inline Actions It's not clear to me what this is x86 specific. Being able to run our tests on power / arm etc seems like an advantage. Would also mean we would avoid adding openmp stuff the x86 specific files. Maybe OpenMPVGPUAddrSpaceMap and put it in one of the openmp source files? JonChesterfield: It's not clear to me what this is x86 specific. Being able to run our tests on power / arm etc…
		0, // Default
		1, // opencl_global
		3, // opencl_local
		4, // opencl_constant
		0, // opencl_private
		0, // opencl_generic
		1, // opencl_global_device
		1, // opencl_global_host
		1, // cuda_device
		4, // cuda_constant
		3, // cuda_shared
		1, // sycl_global
		0, // sycl_global_device
		0, // sycl_global_host
		3, // sycl_local
		0, // sycl_private
		270, // ptr32_sptr
		271, // ptr32_uptr
		272 // ptr64
		};

// X86 target abstract base class; x86-32 and x86-64 are very close, so		// X86 target abstract base class; x86-32 and x86-64 are very close, so
// most of the implementation can be shared.		// most of the implementation can be shared.
class LLVM_LIBRARY_VISIBILITY X86TargetInfo : public TargetInfo {		class LLVM_LIBRARY_VISIBILITY X86TargetInfo : public TargetInfo {

enum X86SSEEnum {		enum X86SSEEnum {
NoSSE,		NoSSE,
SSE1,		SSE1,
SSE2,		SSE2,
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	X86TargetInfo(const llvm::Triple &Triple, const TargetOptions &)
LongDoubleFormat = &llvm::APFloat::x87DoubleExtended();		LongDoubleFormat = &llvm::APFloat::x87DoubleExtended();
AddrSpaceMap = &X86AddrSpaceMap;		AddrSpaceMap = &X86AddrSpaceMap;
HasStrictFP = true;		HasStrictFP = true;

bool IsWinCOFF =		bool IsWinCOFF =
getTriple().isOSWindows() && getTriple().isOSBinFormatCOFF();		getTriple().isOSWindows() && getTriple().isOSBinFormatCOFF();
if (IsWinCOFF)		if (IsWinCOFF)
MaxVectorAlign = MaxTLSAlign = 8192u * getCharWidth();		MaxVectorAlign = MaxTLSAlign = 8192u * getCharWidth();

		if (Triple.getVendor() == llvm::Triple::OpenMP_VGPU)
		AddrSpaceMap = &X86VGPUAddrSpaceMap;
}		}

const char *getLongDoubleMangling() const override {		const char *getLongDoubleMangling() const override {
return LongDoubleFormat == &llvm::APFloat::IEEEquad() ? "g" : "e";		return LongDoubleFormat == &llvm::APFloat::IEEEquad() ? "g" : "e";
}		}

unsigned getFloatEvalMethod() const override {		unsigned getFloatEvalMethod() const override {
// X87 evaluates with 80 bits "long double" precision.		// X87 evaluates with 80 bits "long double" precision.
▲ Show 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	uint64_t getPointerWidthV(unsigned AddrSpace) const override {
if (AddrSpace == ptr64)		if (AddrSpace == ptr64)
return 64;		return 64;
return PointerWidth;		return PointerWidth;
}		}

uint64_t getPointerAlignV(unsigned AddrSpace) const override {		uint64_t getPointerAlignV(unsigned AddrSpace) const override {
return getPointerWidthV(AddrSpace);		return getPointerWidthV(AddrSpace);
}		}

		const llvm::omp::GV &getGridValue() const override {
		return llvm::omp::VirtualGpuGridValues;
		}
		jdoerfertUnsubmitted Not Done Reply Inline Actions Do we need the changes in this file at all? I couldn't see why. jdoerfert: Do we need the changes in this file at all? I couldn't see why.
};		};

// X86-32 generic target		// X86-32 generic target
class LLVM_LIBRARY_VISIBILITY X86_32TargetInfo : public X86TargetInfo {		class LLVM_LIBRARY_VISIBILITY X86_32TargetInfo : public X86TargetInfo {
public:		public:
X86_32TargetInfo(const llvm::Triple &Triple, const TargetOptions &Opts)		X86_32TargetInfo(const llvm::Triple &Triple, const TargetOptions &Opts)
: X86TargetInfo(Triple, Opts) {		: X86TargetInfo(Triple, Opts) {
DoubleAlign = LongLongAlign = 32;		DoubleAlign = LongLongAlign = 32;
▲ Show 20 Lines • Show All 538 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

Show First 20 Lines • Show All 1,113 Lines • ▼ Show 20 Lines	auto *GVMode = new llvm::GlobalVariable(
CGM.getModule(), CGM.Int8Ty, /isConstant=/true,		CGM.getModule(), CGM.Int8Ty, /isConstant=/true,
llvm::GlobalValue::WeakAnyLinkage,		llvm::GlobalValue::WeakAnyLinkage,
llvm::ConstantInt::get(CGM.Int8Ty, Mode ? OMP_TGT_EXEC_MODE_SPMD		llvm::ConstantInt::get(CGM.Int8Ty, Mode ? OMP_TGT_EXEC_MODE_SPMD
: OMP_TGT_EXEC_MODE_GENERIC),		: OMP_TGT_EXEC_MODE_GENERIC),
Twine(Name, "_exec_mode"));		Twine(Name, "_exec_mode"));
CGM.addCompilerUsedGlobal(GVMode);		CGM.addCompilerUsedGlobal(GVMode);
}		}

void CGOpenMPRuntimeGPU::createOffloadEntry(llvm::Constant *ID,		void CGOpenMPRuntimeGPU::createOffloadEntry(
llvm::Constant *Addr,		llvm::Constant ID, llvm::Constant Addr, uint64_t Size, int32_t Flags,
uint64_t Size, int32_t,		llvm::GlobalValue::LinkageTypes Linkage) {
llvm::GlobalValue::LinkageTypes) {		if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::OpenMP_VGPU)
		jdoerfertUnsubmitted Not Done Reply Inline Actions isOpenMPVGPU jdoerfert: isOpenMPVGPU
		return CGOpenMPRuntime::createOffloadEntry(ID, Addr, Size, Flags, Linkage);
// TODO: Add support for global variables on the device after declare target		// TODO: Add support for global variables on the device after declare target
// support.		// support.
if (!isa<llvm::Function>(Addr))		if (!isa<llvm::Function>(Addr))
return;		return;
llvm::Module &M = CGM.getModule();		llvm::Module &M = CGM.getModule();
llvm::LLVMContext &Ctx = CGM.getLLVMContext();		llvm::LLVMContext &Ctx = CGM.getLLVMContext();

// Get "nvvm.annotations" metadata node		// Get "nvvm.annotations" metadata node
▲ Show 20 Lines • Show All 2,842 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenModule.cpp

Show First 20 Lines • Show All 243 Lines • ▼ Show 20 Lines	void CodeGenModule::createOpenMPRuntime() {
case llvm::Triple::nvptx:		case llvm::Triple::nvptx:
case llvm::Triple::nvptx64:		case llvm::Triple::nvptx64:
case llvm::Triple::amdgcn:		case llvm::Triple::amdgcn:
assert(getLangOpts().OpenMPIsDevice &&		assert(getLangOpts().OpenMPIsDevice &&
"OpenMP AMDGPU/NVPTX is only prepared to deal with device code.");		"OpenMP AMDGPU/NVPTX is only prepared to deal with device code.");
OpenMPRuntime.reset(new CGOpenMPRuntimeGPU(*this));		OpenMPRuntime.reset(new CGOpenMPRuntimeGPU(*this));
break;		break;
default:		default:
if (LangOpts.OpenMPSimd)		if (getTriple().getVendor() == llvm::Triple::OpenMP_VGPU) {
		OpenMPRuntime.reset(new CGOpenMPRuntimeGPU(*this));
		} else if (LangOpts.OpenMPSimd)
		jdoerfertUnsubmitted Not Done Reply Inline Actions isOpenMPVGPU jdoerfert: isOpenMPVGPU
OpenMPRuntime.reset(new CGOpenMPSIMDRuntime(*this));		OpenMPRuntime.reset(new CGOpenMPSIMDRuntime(*this));
else		else
OpenMPRuntime.reset(new CGOpenMPRuntime(*this));		OpenMPRuntime.reset(new CGOpenMPRuntime(*this));
break;		break;
}		}
}		}

void CodeGenModule::createCUDARuntime() {		void CodeGenModule::createCUDARuntime() {
▲ Show 20 Lines • Show All 6,238 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Gnu.cpp

	Show First 20 Lines • Show All 3,068 Lines • ▼ Show 20 Lines
	void Generic_ELF::anchor() {}			void Generic_ELF::anchor() {}

	void Generic_ELF::addClangTargetOptions(const ArgList &DriverArgs,			void Generic_ELF::addClangTargetOptions(const ArgList &DriverArgs,
	ArgStringList &CC1Args,			ArgStringList &CC1Args,
	Action::OffloadKind) const {			Action::OffloadKind) const {
	if (!DriverArgs.hasFlag(options::OPT_fuse_init_array,			if (!DriverArgs.hasFlag(options::OPT_fuse_init_array,
	options::OPT_fno_use_init_array, true))			options::OPT_fno_use_init_array, true))
	CC1Args.push_back("-fno-use-init-array");			CC1Args.push_back("-fno-use-init-array");

				if (DriverArgs.hasArg(options::OPT_S))
				return;

				if (getTriple().getVendor() == llvm::Triple::OpenMP_VGPU) {
				jdoerfertUnsubmitted Not Done Reply Inline Actions isOpenMPVGPU jdoerfert: isOpenMPVGPU
				std::string BitcodeSuffix = "x86_64-vgpu";
				tianshilei1992Unsubmitted Not Done Reply Inline Actions Maybe `"x86_64-openmp_vpu"` now? tianshilei1992: Maybe `"x86_64-openmp_vpu"` now?
				jdoerfertUnsubmitted Done Reply Inline Actions not x86, right? triple contains the proper arch jdoerfert: not x86, right? triple contains the proper arch
				clang::driver::tools::addOpenMPDeviceRTL(getDriver(), DriverArgs, CC1Args,
				BitcodeSuffix, getTriple());
				}
	}			}

clang/lib/Frontend/CompilerInvocation.cpp

Show First 20 Lines • Show All 3,977 Lines • ▼ Show 20 Lines	#undef LANG_OPTION_WITH_MARSHALLING
if (Arg *A = Args.getLastArg(options::OPT_fopenmp_host_ir_file_path)) {		if (Arg *A = Args.getLastArg(options::OPT_fopenmp_host_ir_file_path)) {
Opts.OMPHostIRFile = A->getValue();		Opts.OMPHostIRFile = A->getValue();
if (!llvm::sys::fs::exists(Opts.OMPHostIRFile))		if (!llvm::sys::fs::exists(Opts.OMPHostIRFile))
Diags.Report(diag::err_drv_omp_host_ir_file_not_found)		Diags.Report(diag::err_drv_omp_host_ir_file_not_found)
<< Opts.OMPHostIRFile;		<< Opts.OMPHostIRFile;
}		}

// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options		// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options
Opts.OpenMPCUDAMode = Opts.OpenMPIsDevice && (T.isNVPTX() \|\| T.isAMDGCN()) &&		Opts.OpenMPCUDAMode = Opts.OpenMPIsDevice &&
		(T.isNVPTX() \|\| T.isAMDGCN() \|\|
		T.getVendor() == llvm::Triple::OpenMP_VGPU) &&
		JonChesterfieldUnsubmitted Done Reply Inline Actions Add a isOpenmpVGPU function? JonChesterfield: Add a isOpenmpVGPU function?
Args.hasArg(options::OPT_fopenmp_cuda_mode);		Args.hasArg(options::OPT_fopenmp_cuda_mode);

// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options		// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options
Opts.OpenMPCUDAForceFullRuntime =		Opts.OpenMPCUDAForceFullRuntime =
Opts.OpenMPIsDevice && (T.isNVPTX() \|\| T.isAMDGCN()) &&		Opts.OpenMPIsDevice && (T.isNVPTX() \|\| T.isAMDGCN()) &&
Args.hasArg(options::OPT_fopenmp_cuda_force_full_runtime);		Args.hasArg(options::OPT_fopenmp_cuda_force_full_runtime);

// FIXME: Eliminate this dependency.		// FIXME: Eliminate this dependency.
▲ Show 20 Lines • Show All 699 Lines • Show Last 20 Lines

llvm/include/llvm/ADT/Triple.h

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	enum VendorType {
MipsTechnologies,		MipsTechnologies,
NVIDIA,		NVIDIA,
CSR,		CSR,
Myriad,		Myriad,
AMD,		AMD,
Mesa,		Mesa,
SUSE,		SUSE,
OpenEmbedded,		OpenEmbedded,
LastVendorType = OpenEmbedded		OpenMP_VGPU,
		LastVendorType = OpenMP_VGPU
		jdoerfertUnsubmitted Done Reply Inline Actions Let's call it OpenMP_VGPU or something like that to make it clear. jdoerfert: Let's call it OpenMP_VGPU or something like that to make it clear.
};		};
enum OSType {		enum OSType {
UnknownOS,		UnknownOS,

Ananas,		Ananas,
CloudABI,		CloudABI,
Darwin,		Darwin,
DragonFly,		DragonFly,
▲ Show 20 Lines • Show All 815 Lines • Show Last 20 Lines

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	static constexpr GV NVPTXGridValues = {
256, // GV_Slot_Size		256, // GV_Slot_Size
32, // GV_Warp_Size		32, // GV_Warp_Size
1024, // GV_Max_Teams		1024, // GV_Max_Teams
896, // GV_SimpleBufferSize		896, // GV_SimpleBufferSize
1024, // GV_Max_WG_Size		1024, // GV_Max_WG_Size
128, // GV_Default_WG_Size		128, // GV_Default_WG_Size
};		};

		/// For Virtual GPUs
		static constexpr GV VirtualGpuGridValues = {
		256, // GV_Slot_Size
		32, // GV_Warp_Size
		1024, // GV_Max_Teams
		896, // GV_SimpleBufferSize
		1024, // GV_Max_WG_Size
		128, // GV_Defaut_WG_Size
		};

} // namespace omp		} // namespace omp
} // namespace llvm		} // namespace llvm

#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H		#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H

llvm/lib/Support/Triple.cpp

Show First 20 Lines • Show All 179 Lines • ▼ Show 20 Lines StringRef Triple::getVendorTypeName(VendorType Kind) {

case Mesa: return "mesa"; case Mesa: return "mesa";

case MipsTechnologies: return "mti"; case MipsTechnologies: return "mti";

case Myriad: return "myriad"; case Myriad: return "myriad";

case NVIDIA: return "nvidia"; case NVIDIA: return "nvidia";

case OpenEmbedded: return "oe"; case OpenEmbedded: return "oe";

case PC: return "pc"; case PC: return "pc";

case SCEI: return "scei"; case SCEI: return "scei";

case SUSE: return "suse"; case SUSE: return "suse";

case OpenMP_VGPU:

return "vgpu";

tianshilei1992Unsubmitted

Done

"openmp_vpu"?

tianshilei1992: `"openmp_vpu"`?

} }

llvm_unreachable("Invalid VendorType!"); llvm_unreachable("Invalid VendorType!");

} }

StringRef Triple::getOSTypeName(OSType Kind) { StringRef Triple::getOSTypeName(OSType Kind) {

switch (Kind) { switch (Kind) {

case UnknownOS: return "unknown"; case UnknownOS: return "unknown";

▲ Show 20 Lines • Show All 291 Lines • ▼ Show 20 Lines if (ArchName.startswith("bpf"))

return parseBPFArch(ArchName); return parseBPFArch(ArchName);

} }

return AT; return AT;

} }

static Triple::VendorType parseVendor(StringRef VendorName) { static Triple::VendorType parseVendor(StringRef VendorName) {

return StringSwitch<Triple::VendorType>(VendorName) return StringSwitch<Triple::VendorType>(VendorName)

.Case("apple", Triple::Apple) .Case("apple", Triple::Apple)

.Case("pc", Triple::PC) .Case("pc", Triple::PC)

.Case("scei", Triple::SCEI) .Case("scei", Triple::SCEI)

.Case("sie", Triple::SCEI) .Case("sie", Triple::SCEI)

.Case("fsl", Triple::Freescale) .Case("fsl", Triple::Freescale)

.Case("ibm", Triple::IBM) .Case("ibm", Triple::IBM)

.Case("img", Triple::ImaginationTechnologies) .Case("img", Triple::ImaginationTechnologies)

.Case("mti", Triple::MipsTechnologies) .Case("mti", Triple::MipsTechnologies)

.Case("nvidia", Triple::NVIDIA) .Case("nvidia", Triple::NVIDIA)

.Case("csr", Triple::CSR) .Case("csr", Triple::CSR)

.Case("myriad", Triple::Myriad) .Case("myriad", Triple::Myriad)

.Case("amd", Triple::AMD) .Case("amd", Triple::AMD)

.Case("mesa", Triple::Mesa) .Case("mesa", Triple::Mesa)

.Case("suse", Triple::SUSE) .Case("suse", Triple::SUSE)

.Case("oe", Triple::OpenEmbedded) .Case("oe", Triple::OpenEmbedded)

.Case("vgpu", Triple::OpenMP_VGPU)

jdoerfertUnsubmitted

Done

.Case("oe", Triple::OpenEmbedded)

- .Case("vgpu", Triple::OpenMP_VGPU)

+ .Case("openmp_vgpu", Triple::OpenMP_VGPU)

.Default(Triple::UnknownVendor);

jdoerfert:

.Default(Triple::UnknownVendor); .Default(Triple::UnknownVendor);

} }

static Triple::OSType parseOS(StringRef OSName) { static Triple::OSType parseOS(StringRef OSName) {

return StringSwitch<Triple::OSType>(OSName) return StringSwitch<Triple::OSType>(OSName)

.StartsWith("ananas", Triple::Ananas) .StartsWith("ananas", Triple::Ananas)

.StartsWith("cloudabi", Triple::CloudABI) .StartsWith("cloudabi", Triple::CloudABI)

.StartsWith("darwin", Triple::Darwin) .StartsWith("darwin", Triple::Darwin)

.StartsWith("dragonfly", Triple::DragonFly) .StartsWith("dragonfly", Triple::DragonFly)

▲ Show 20 Lines • Show All 1,335 Lines • Show Last 20 Lines

openmp/CMakeLists.txt

Show All 33 Lines	else()

if (NOT MSVC)		if (NOT MSVC)
set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang)		set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang)
set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++)		set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++)
else()		else()
set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang.exe)		set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang.exe)
set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++.exe)		set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++.exe)
endif()		endif()

		list(APPEND LIBOMPTARGET_LLVM_INCLUDE_DIRS ${LLVM_MAIN_INCLUDE_DIR} ${LLVM_BINARY_DIR}/include)
endif()		endif()

# Check and set up common compiler flags.		# Check and set up common compiler flags.
include(config-ix)		include(config-ix)
include(HandleOpenMPOptions)		include(HandleOpenMPOptions)

# Set up testing infrastructure.		# Set up testing infrastructure.
include(OpenMPTesting)		include(OpenMPTesting)
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/CMakeLists.txt

Show First 20 Lines • Show All 126 Lines • ▼ Show 20 Lines
# Set flags for LLVM Bitcode compilation.		# Set flags for LLVM Bitcode compilation.
set(bc_flags -S -x c++ -std=c++17		set(bc_flags -S -x c++ -std=c++17
${clang_opt_flags}		${clang_opt_flags}
-Xclang -emit-llvm-bc		-Xclang -emit-llvm-bc
-Xclang -aux-triple -Xclang ${aux_triple}		-Xclang -aux-triple -Xclang ${aux_triple}
-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device		-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device
-I${include_directory}		-I${include_directory}
-I${devicertl_base_directory}/../include		-I${devicertl_base_directory}/../include
		-I${devicertl_base_directory}/../plugins/vgpu/src
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Should only add this include to the vgu, not all the plugins. May be able to use relative include paths to drop it entirely JonChesterfield: Should only add this include to the vgu, not all the plugins. May be able to use relative…
${LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL}		${LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL}
)		)

if(${LIBOMPTARGET_DEVICE_DEBUG})		if(${LIBOMPTARGET_DEVICE_DEBUG})
list(APPEND bc_flags -DOMPTARGET_DEBUG=-1)		list(APPEND bc_flags -DOMPTARGET_DEBUG=-1)
else()		else()
list(APPEND bc_flags -DOMPTARGET_DEBUG=0)		list(APPEND bc_flags -DOMPTARGET_DEBUG=0)
endif()		endif()

function(compileDeviceRTLLibrary target_cpu target_name)		function(compileDeviceRTLLibrary target_cpu target_name)
set(target_bc_flags ${ARGN})		set(target_bc_flags ${ARGN})

set(bc_files "")		set(bc_files "")
foreach(src ${src_files})		foreach(src ${src_files})
get_filename_component(infile ${src} ABSOLUTE)		get_filename_component(infile ${src} ABSOLUTE)
get_filename_component(outfile ${src} NAME)		get_filename_component(outfile ${src} NAME)
set(outfile "${outfile}-${target_cpu}.bc")		set(outfile "${outfile}-${target_cpu}.bc")

add_custom_command(OUTPUT ${outfile}		add_custom_command(OUTPUT ${outfile}
COMMAND ${CLANG_TOOL}		COMMAND ${CLANG_TOOL}
${bc_flags}		${bc_flags}
-Xclang -target-cpu -Xclang ${target_cpu}
${target_bc_flags}		${target_bc_flags}
${infile} -o ${outfile}		${infile} -o ${outfile}
DEPENDS ${infile}		DEPENDS ${infile}
IMPLICIT_DEPENDS CXX ${infile}		IMPLICIT_DEPENDS CXX ${infile}
COMMENT "Building LLVM bitcode ${outfile}"		COMMENT "Building LLVM bitcode ${outfile}"
VERBATIM		VERBATIM
)		)
if("${CLANG_TOOL}" STREQUAL "$<TARGET_FILE:clang>")		if("${CLANG_TOOL}" STREQUAL "$<TARGET_FILE:clang>")
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	add_custom_command(TARGET ${bclib_target_name} POST_BUILD
${LIBOMPTARGET_LIBRARY_DIR})		${LIBOMPTARGET_LIBRARY_DIR})

# Install bitcode library under the lib destination folder.		# Install bitcode library under the lib destination folder.
install(FILES ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} DESTINATION "${OPENMP_INSTALL_LIBDIR}")		install(FILES ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} DESTINATION "${OPENMP_INSTALL_LIBDIR}")
endfunction()		endfunction()

# Generate a Bitcode library for all the compute capabilities the user requested		# Generate a Bitcode library for all the compute capabilities the user requested
foreach(sm ${nvptx_sm_list})		foreach(sm ${nvptx_sm_list})
compileDeviceRTLLibrary(sm_${sm} nvptx -target nvptx64 -Xclang -target-feature -Xclang +ptx61 "-D__CUDA_ARCH__=${sm}0")		compileDeviceRTLLibrary(sm_${sm} nvptx -Xclang -target-cpu -Xclang sm_${sm} -target nvptx64 -Xclang -target-feature -Xclang +ptx61 "-D__CUDA_ARCH__=${sm}0")
endforeach()		endforeach()

foreach(mcpu ${amdgpu_mcpus})		foreach(mcpu ${amdgpu_mcpus})
compileDeviceRTLLibrary(${mcpu} amdgpu -target amdgcn-amd-amdhsa -D__AMDGCN__ -fvisibility=default -nogpulib)		compileDeviceRTLLibrary(${mcpu} amdgpu -Xclang -target-cpu -Xclang ${mcpu} -target amdgcn-amd-amdhsa -D__AMDGCN__ -fvisibility=default -nogpulib)
endforeach()		endforeach()

		compileDeviceRTLLibrary(vgpu x86_64-vgpu -target x86_64-vgpu -std=c++20 -stdlib=libc++)
		tianshilei1992Unsubmitted Not Done Reply Inline Actions It's not a good practice to specify include directories in CMake in this way. Use `include_directories` instead. tianshilei1992: It's not a good practice to specify include directories in CMake in this way. Use…
		atmnpatelAuthorUnsubmitted Done Reply Inline Actions can't quite do that here I think, afaik both `include_directories` and `target_include_directories` require that CMake builds the target, but we specify custom targets/build commands so they don't get pulled in atmnpatel: can't quite do that here I think, afaik both `include_directories` and…

openmp/libomptarget/DeviceRTL/src/Debug.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

#pragma omp begin declare variant match(device = {arch(amdgcn)})

namespace impl {

static int32_t omp_vprintf(const char *Format, void *Arguments, uint32_t) {

return -1;

}

} // namespace impl

#pragma omp end declare variant

#pragma omp begin declare variant match( \

device = {kind(cpu)}, implementation = {extension(match_any)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

int32_t vprintf(const char *, void *);

jdoerfert:

int32_t vprintf(const char *, void *);

namespace impl {

static int32_t omp_vprintf(const char *Format, void *Arguments, uint32_t) {

return vprintf(Format, Arguments);

}

} // namespace impl

#pragma omp end declare variant

int32_t __llvm_omp_vprintf(const char *Format, void *Arguments, uint32_t Size) {

return impl::omp_vprintf(Format, Arguments, Size);

}

/// Current indentation level for the function trace. Only accessed by thread 0.

static uint32_t Level = 0;

#pragma omp allocate(Level) allocator(omp_pteam_mem_alloc)

Show All 22 Lines

openmp/libomptarget/DeviceRTL/src/Kernel.cpp

Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines

void __kmpc_target_deinit(IdentTy *Ident, int8_t Mode, bool) {

FunctionTracingRAII();

const bool IsSPMD = Mode & OMP_TGT_EXEC_MODE_SPMD;

state::assumeInitialState(IsSPMD);

if (IsSPMD)

return;

// Signal the workers to exit the state machine and exit the kernel.

state::ParallelRegionFn = nullptr;

jdoerfertUnsubmitted

Not Done

I don't think we should do this. Instead, the plugin should signal as threads finish the kernel.

jdoerfert: I don't think we should do this. Instead, the plugin should signal as threads finish the kernel.

}

#pragma omp begin declare variant match( \

tianshilei1992Unsubmitted

Not Done

Are these code here unintentional? We don't need to specialize this function for vgpu IIRC.

tianshilei1992: Are these code here unintentional? We don't need to specialize this function for vgpu IIRC.

jdoerfertUnsubmitted

Not Done

we might be able to avoid it if we move the synchronize::threads "effect" into the VGPU instead.

jdoerfert: we might be able to avoid it if we move the synchronize::threads "effect" into the VGPU instead.

device = {kind(cpu)}, implementation = {extension(match_any)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

void __kmpc_target_deinit(IdentTy *Ident, int8_t Mode, bool) {

jdoerfert:

void __kmpc_target_deinit(IdentTy *Ident, int8_t Mode, bool) {

FunctionTracingRAII();

const bool IsSPMD = Mode & OMP_TGT_EXEC_MODE_SPMD;

state::assumeInitialState(IsSPMD);

if (IsSPMD)

return;

// Signal the workers to exit the state machine and exit the kernel.

state::ParallelRegionFn = nullptr;

synchronize::threads();

}

#pragma omp end declare variant

int8_t __kmpc_is_spmd_exec_mode() {

FunctionTracingRAII();

return mapping::isSPMDMode();

}

#pragma omp end declare target

openmp/libomptarget/DeviceRTL/src/Mapping.cpp

Show All 15 Lines

#include "Utils.h"

#pragma omp declare target

#include "llvm/Frontend/OpenMP/OMPGridValues.h"

using namespace _OMP;

/// Virtual GPU Implementation

///

///{

#pragma omp begin declare variant match( \

device = {kind(cpu)}, implementation = {extension(match_any)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

#include "ThreadEnvironment.h"

jdoerfert:

jdoerfertUnsubmitted

Done

Move up to the beginning.

jdoerfert: Move up to the beginning.

#include "ThreadEnvironment.h"

namespace _OMP {

namespace impl {

constexpr const llvm::omp::GV &getGridValue() {

return llvm::omp::VirtualGpuGridValues;

}

LaneMaskTy activemask() {

uint64_t B = 0;

uint32_t N = mapping::getWarpSize();

while (N)

B |= (1 << (--N));

return B;

}

LaneMaskTy lanemaskLT() {

const uint32_t Lane = mapping::getThreadIdInWarp();

LaneMaskTy Ballot = mapping::activemask();

LaneMaskTy Mask = ((LaneMaskTy)1 << Lane) - (LaneMaskTy)1;

return Mask & Ballot;

}

LaneMaskTy lanemaskGT() {

const uint32_t Lane = mapping::getThreadIdInWarp();

if (Lane == (mapping::getWarpSize() - 1))

return 0;

LaneMaskTy Ballot = mapping::activemask();

LaneMaskTy Mask = (~((LaneMaskTy)0)) << (Lane + 1);

return Mask & Ballot;

}

uint32_t getThreadIdInWarp() {

return mapping::getThreadIdInBlock() & (mapping::getWarpSize() - 1);

}

uint32_t getThreadIdInBlock() {

return getThreadEnvironment()->getThreadIdInBlock();

}

uint32_t getNumHardwareThreadsInBlock() {

return getThreadEnvironment()->getBlockSize();

}

uint32_t getKernelSize() { return getThreadEnvironment()->getKernelSize(); }

uint32_t getBlockId() { return getThreadEnvironment()->getBlockId(); }

uint32_t getNumberOfBlocks() {

return getThreadEnvironment()->getNumberOfBlocks();

}

uint32_t getNumberOfProcessorElements() { return mapping::getBlockSize(); }

uint32_t getWarpId() {

return mapping::getThreadIdInBlock() / mapping::getWarpSize();

}

uint32_t getWarpSize() { return getThreadEnvironment()->getWarpSize(); }

uint32_t getNumberOfWarpsInBlock() {

return (mapping::getBlockSize() + mapping::getWarpSize() - 1) /

mapping::getWarpSize();

}

} // namespace impl

} // namespace _OMP

#pragma omp end declare variant

namespace _OMP {

namespace impl {

/// AMDGCN Implementation

///

///{

#pragma omp begin declare variant match(device = {arch(amdgcn)})

▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines

uint32_t getWarpSize() { return getGridValue().GV_Warp_Size; }

} // namespace impl

} // namespace _OMP

/// We have to be deliberate about the distinction of `mapping::` and `impl::`

/// below to avoid repeating assumptions or including irrelevant ones.

///{

jdoerfertUnsubmitted

Done

We probably should use kind(CPU) or something instead. Nothing x86 specific about it I think.

jdoerfert: We probably should use kind(CPU) or something instead. Nothing x86 specific about it I think.

static bool isInLastWarp() {

uint32_t MainTId = (mapping::getNumberOfProcessorElements() - 1) &

~(mapping::getWarpSize() - 1);

return mapping::getThreadIdInBlock() == MainTId;

}

bool mapping::isMainThreadInGenericMode(bool IsSPMD) {

▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/src/Misc.cpp

	Show All 12 Lines

	#include "Debug.h"			#include "Debug.h"

	#pragma omp declare target			#pragma omp declare target

	namespace _OMP {			namespace _OMP {
	namespace impl {			namespace impl {

	/// AMDGCN Implementation			/// Generic Implementation - AMDGCN, VGPU
	///			///
	///{			///{
	#pragma omp begin declare variant match(device = {arch(amdgcn)})

	double getWTick() { return ((double)1E-9); }			double getWTick() { return ((double)1E-9); }

	double getWTime() {			double getWTime() {
	// The intrinsics for measuring time have undocumented frequency			// The intrinsics for measuring time have undocumented frequency
	// This will probably need to be found by measurement on a number of			// This will probably need to be found by measurement on a number of
	// architectures. Until then, return 0, which is very inaccurate as a			// architectures. Until then, return 0, which is very inaccurate as a
	// timer but resolves the undefined symbol at link time.			// timer but resolves the undefined symbol at link time.
	return 0;			return 0;
	}			}

	#pragma omp end declare variant

	/// NVPTX Implementation			/// NVPTX Implementation
	///			///
	///{			///{
	#pragma omp begin declare variant match( \			#pragma omp begin declare variant match( \
	device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})			device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

	double getWTick() {			double getWTick() {
	// Timer precision is 1ns			// Timer precision is 1ns
	Show All 36 Lines

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines

void setLock(omp_lock_t *Lock) {

} // wait for 0 to be the read value

}

#pragma omp end declare variant

///}

} // namespace impl

/// Virtual GPU Implementation

///

///{

#pragma omp begin declare variant match( \

device = {kind(cpu)}, implementation = {extension(match_any)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

#include "ThreadEnvironment.h"

jdoerfert:

jdoerfertUnsubmitted

Done

Move up.

jdoerfert: Move up.

#include "ThreadEnvironment.h"

namespace impl {

uint32_t atomicInc(uint32_t *Address, uint32_t Val, int Ordering) {

return VGPUImpl::atomicInc(Address, Val, Ordering);

}

void namedBarrierInit() {}

void namedBarrier() {

uint32_t NumThreads = omp_get_num_threads();

ASSERT(NumThreads % mapping::getWarpSize() == 0);

getThreadEnvironment()->namedBarrier(true);

}

void fenceTeam(int) { getThreadEnvironment()->fenceTeam(); }

void fenceKernel(int memory_order) {

getThreadEnvironment()->fenceKernel(memory_order);

}

// Simply call fenceKernel because there is no need to sync with host

void fenceSystem(int) { fenceKernel(0); }

jdoerfertUnsubmitted

Done

Pass the memory order, also rename the arguments to match the coding convention.

jdoerfert: Pass the memory order, also rename the arguments to match the coding convention.

void syncWarp(__kmpc_impl_lanemask_t Mask) {

getThreadEnvironment()->syncWarp();

jdoerfertUnsubmitted

Done

Pass the mask

jdoerfert: Pass the mask

}

void syncThreads() { getThreadEnvironment()->namedBarrier(false); }

constexpr uint32_t OMP_SPIN = 1000;

jdoerfertUnsubmitted

Not Done

Remove these. Also the TODO below (copied from somewhere)

jdoerfert: Remove these. Also the TODO below (copied from somewhere)

constexpr uint32_t UNSET = 0;

constexpr uint32_t SET = 1;

// TODO: This seems to hide a bug in the declare variant handling. If it is

// called before it is defined

// here the overload won't happen. Investigate lalter!

void unsetLock(omp_lock_t *Lock) {

(void)atomicExchange((uint32_t *)Lock, UNSET, __ATOMIC_SEQ_CST);

}

int testLock(omp_lock_t *Lock) {

return atomicAdd((uint32_t *)Lock, 0u, __ATOMIC_SEQ_CST);

}

void initLock(omp_lock_t *Lock) { unsetLock(Lock); }

void destroyLock(omp_lock_t *Lock) { unsetLock(Lock); }

void setLock(omp_lock_t *Lock) {

VGPUImpl::setLock((uint32_t *)Lock, UNSET, SET, OMP_SPIN,

jdoerfertUnsubmitted

Not Done

We should simply use omp locks. Either here, or maybe better, in VGPUImpl. So redirect all calls to there and use a proper lock. no OMP_SPIN and stuff

jdoerfert: We should simply use omp locks. Either here, or maybe better, in VGPUImpl. So redirect all…

mapping::getBlockId(), atomicCAS);

}

void syncThreadsAligned() {}

} // namespace impl

#pragma omp end declare variant

///}

void synchronize::init(bool IsSPMD) {

if (!IsSPMD)

impl::namedBarrierInit();

}

void synchronize::warp(LaneMaskTy Mask) { impl::syncWarp(Mask); }

void synchronize::threads() { impl::syncThreads(); }

▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/src/Utils.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

}

uint64_t Pack(uint32_t LowBits, uint32_t HighBits) {

return (((uint64_t)HighBits) << 32) | (uint64_t)LowBits;

}

#pragma omp end declare variant

/// Virtual GPU Implementation

///

///{

#pragma omp begin declare variant match( \

device = {kind(cpu)}, implementation = {extension(match_any)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

void Unpack(uint64_t Val, uint32_t *LowBits, uint32_t *HighBits) {

jdoerfert:

void Unpack(uint64_t Val, uint32_t *LowBits, uint32_t *HighBits) {

*LowBits = (uint32_t)(Val & static_cast<uint64_t>(0x00000000FFFFFFFF));

*HighBits =

(uint32_t)((Val & static_cast<uint64_t>(0xFFFFFFFF00000000)) >> 32);

}

uint64_t Pack(uint32_t LowBits, uint32_t HighBits) {

return (((uint64_t)HighBits) << 32) | (uint64_t)LowBits;

}

#pragma omp end declare variant

jdoerfertUnsubmitted

Not Done

Can't we merge this with AMDGPU?

jdoerfert: Can't we merge this with AMDGPU?

/// NVPTX Implementation

///

///{

#pragma omp begin declare variant match( \

device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

void Unpack(uint64_t Val, uint32_t *LowBits, uint32_t *HighBits) {

uint32_t LowBitsLocal, HighBitsLocal;

▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines

int32_t shuffleDown(uint64_t Mask, int32_t Var, uint32_t Delta, int32_t Width) {

int32_t T = ((mapping::getWarpSize() - Width) << 8) | 0x1f;

return __nvvm_shfl_sync_down_i32(Mask, Var, Delta, T);

}

#pragma omp end declare variant

} // namespace impl

/// Virtual GPU Implementation

///

///{

#pragma omp begin declare variant match( \

device = {kind(cpu)}, implementation = {extension(match_any)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

#include "ThreadEnvironment.h"

jdoerfert:

jdoerfertUnsubmitted

Done

Move up

jdoerfert: Move up

#include "ThreadEnvironment.h"

namespace impl {

int32_t shuffle(uint64_t Mask, int32_t Var, int32_t SrcLane) {

return getThreadEnvironment()->shuffle(Var, SrcLane);

}

int32_t shuffleDown(uint64_t Mask, int32_t Var, uint32_t Delta, int32_t Width) {

return getThreadEnvironment()->shuffleDown(Var, Delta);

jdoerfertUnsubmitted

Done

Pass the mask, both times.

jdoerfert: Pass the mask, both times.

}

} // namespace impl

#pragma omp end declare variant

uint64_t utils::pack(uint32_t LowBits, uint32_t HighBits) {

return impl::Pack(LowBits, HighBits);

}

void utils::unpack(uint64_t Val, uint32_t &LowBits, uint32_t &HighBits) {

impl::Unpack(Val, &LowBits, &HighBits);

}

Show All 26 Lines

openmp/libomptarget/plugins/CMakeLists.txt

	Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
	endmacro()			endmacro()

	add_subdirectory(aarch64)			add_subdirectory(aarch64)
	add_subdirectory(amdgpu)			add_subdirectory(amdgpu)
	add_subdirectory(cuda)			add_subdirectory(cuda)
	add_subdirectory(ppc64)			add_subdirectory(ppc64)
	add_subdirectory(ppc64le)			add_subdirectory(ppc64le)
	add_subdirectory(ve)			add_subdirectory(ve)
				add_subdirectory(vgpu)
	add_subdirectory(x86_64)			add_subdirectory(x86_64)
	add_subdirectory(remote)			add_subdirectory(remote)

	# Make sure the parent scope can see the plugins that will be created.			# Make sure the parent scope can see the plugins that will be created.
	set(LIBOMPTARGET_SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}" PARENT_SCOPE)			set(LIBOMPTARGET_SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}" PARENT_SCOPE)
	set(LIBOMPTARGET_TESTED_PLUGINS "${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)			set(LIBOMPTARGET_TESTED_PLUGINS "${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)

openmp/libomptarget/plugins/vgpu/CMakeLists.txt

This file was added.

				set(tmachine_name "vgpu")
				set(tmachine_libname "vgpu")
				set(tmachine_triple "x86_64-vgpu")
				set(elf_machine_id "62")

				if(LIBOMPTARGET_DEP_LIBELF_FOUND)
				if(LIBOMPTARGET_DEP_LIBFFI_FOUND)

				libomptarget_say("Building ${tmachine_name} offloading plugin.")

				include_directories(${LIBOMPTARGET_DEP_LIBFFI_INCLUDE_DIR})
				include_directories(${LIBOMPTARGET_DEP_LIBELF_INCLUDE_DIR})
				include_directories(${LIBOMPTARGET_INCLUDE_DIR})

				# Define macro to be used as prefix of the runtime messages for this target.
				add_definitions("-DTARGET_NAME=${tmachine_name}")

				# Define macro with the ELF ID for this target.
				add_definitions("-DTARGET_ELF_ID=${elf_machine_id}")

				add_library("omptarget.rtl.${tmachine_libname}" SHARED
				${CMAKE_CURRENT_SOURCE_DIR}/src/rtl.cpp
				${CMAKE_CURRENT_SOURCE_DIR}/src/ThreadEnvironment.cpp)

				# Install plugin under the lib destination folder.
				install(TARGETS "omptarget.rtl.${tmachine_libname}"
				LIBRARY DESTINATION "${OPENMP_INSTALL_LIBDIR}")

				set_target_properties("omptarget.rtl.${tmachine_libname}" PROPERTIES CXX_STANDARD 20)
				target_compile_options("omptarget.rtl.${tmachine_libname}" PRIVATE "-stdlib=libc++")

				target_link_libraries(
				"omptarget.rtl.${tmachine_libname}"
				elf_common
				${LIBOMPTARGET_DEP_LIBFFI_LIBRARIES}
				${LIBOMPTARGET_DEP_LIBELF_LIBRARIES}
				dl
				${OPENMP_PTHREAD_LIB}
				"-rdynamic"
				c++
				#"-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/../exports"
				)

				list(APPEND LIBOMPTARGET_TESTED_PLUGINS
				"omptarget.rtl.${tmachine_libname}")

				# Report to the parent scope that we are building a plugin.
				set(LIBOMPTARGET_SYSTEM_TARGETS
				"${LIBOMPTARGET_SYSTEM_TARGETS} ${tmachine_triple}" PARENT_SCOPE)
				set(LIBOMPTARGET_TESTED_PLUGINS
				"${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)

				else(LIBOMPTARGET_DEP_LIBFFI_FOUND)
				libomptarget_say("Not building ${tmachine_name} offloading plugin: libffi dependency not found.")
				endif(LIBOMPTARGET_DEP_LIBFFI_FOUND)
				else(LIBOMPTARGET_DEP_LIBELF_FOUND)
				libomptarget_say("Not building ${tmachine_name} offloading plugin: libelf dependency not found.")
				endif(LIBOMPTARGET_DEP_LIBELF_FOUND)

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.h

This file was added.

				//===---- ThreadEnvironment.h - Virtual GPU thread environment ----- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENT_H
				#define OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENT_H

				using LaneMaskTy = uint64_t;

				// Forward declaration
				class WarpEnvironmentTy;
				class ThreadBlockEnvironmentTy;
				class CTAEnvironmentTy;
				namespace VGPUImpl {
				class ThreadEnvironmentTy;
				void setLock(uint32_t *Lock, uint32_t Unset, uint32_t Set, uint32_t OmpSpin,
				uint32_t BlockId,
				uint32_t(atomicCAS)(uint32_t *, uint32_t, uint32_t, int));
				uint32_t atomicInc(uint32_t *Address, uint32_t Val, int Ordering);
				} // namespace VGPUImpl

				class ThreadEnvironmentTy {
				VGPUImpl::ThreadEnvironmentTy *Impl;

				public:
				ThreadEnvironmentTy(unsigned Id, WarpEnvironmentTy *WE,
				CTAEnvironmentTy *CTAE);

				~ThreadEnvironmentTy();

				unsigned getThreadIdInWarp() const;

				unsigned getThreadIdInBlock() const;

				unsigned getGlobalThreadId() const;

				unsigned getBlockSize() const;

				unsigned getKernelSize() const;

				unsigned getBlockId() const;

				unsigned getNumberOfBlocks() const;

				LaneMaskTy getActiveMask() const;

				unsigned getWarpSize() const;

				int32_t shuffle(int32_t Var, uint64_t SrcLane);

				int32_t shuffleDown(int32_t Var, uint32_t Delta);

				void fenceKernel(int32_t MemoryOrder);

				void fenceTeam();

				void syncWarp();

				void namedBarrier(bool Generic);

				void setBlockEnv(ThreadBlockEnvironmentTy *TBE);

				void resetBlockEnv();
				};

				ThreadEnvironmentTy *getThreadEnvironment(void);

				#endif // OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENT_H

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.cpp

This file was added.

				//===---- DeviceEnvironment.cpp - Virtual GPU Device Environment -- C++ ---===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Implementation of VGPU environment classes.
				//
				//===----------------------------------------------------------------------===//

				// clang-format off
				#include <cstdint>
				#include "ThreadEnvironment.h"
				#include "ThreadEnvironmentImpl.h"
				#include <barrier>
				#include <mutex>
				// clang-format on

				std::mutex AtomicIncLock;

				uint32_t VGPUImpl::atomicInc(uint32_t *Address, uint32_t Val, int Ordering) {
				std::lock_guard G(AtomicIncLock);
				uint32_t V = *Address;
				if (V >= Val)
				*Address = 0;
				else
				*Address += 1;
				return V;
				}

				void VGPUImpl::setLock(uint32_t *Lock, uint32_t Unset, uint32_t Set,
				uint32_t OmpSpin, uint32_t BlockId,
				uint32_t(atomicCAS)(uint32_t *, uint32_t, uint32_t,
				int)) {
				// TODO: not sure spinning is a good idea here..
				while (atomicCAS((uint32_t *)Lock, Unset, Set, __ATOMIC_SEQ_CST) != Unset) {
				std::clock_t start = std::clock();
				std::clock_t now;
				for (;;) {
				now = std::clock();
				std::clock_t cycles =
				now > start ? now - start : now + (0xffffffff - start);
				if (cycles >= 1000 * BlockId) {
				break;
				}
				}
				} // wait for 0 to be the read value
				}
				jdoerfertUnsubmitted Not Done Reply Inline Actions see above. jdoerfert: see above.

				extern thread_local ThreadEnvironmentTy *ThreadEnvironment;

				ThreadEnvironmentTy *getThreadEnvironment() { return ThreadEnvironment; }

				ThreadEnvironmentTy::ThreadEnvironmentTy(unsigned Id, WarpEnvironmentTy *WE,
				CTAEnvironmentTy *CTAE)
				: Impl(new VGPUImpl::ThreadEnvironmentTy(Id, WE, CTAE)) {}

				ThreadEnvironmentTy::~ThreadEnvironmentTy() { delete Impl; }

				void ThreadEnvironmentTy::fenceTeam() { Impl->fenceTeam(); }

				void ThreadEnvironmentTy::syncWarp() { Impl->syncWarp(); }

				unsigned ThreadEnvironmentTy::getThreadIdInWarp() const {
				return Impl->getThreadIdInWarp();
				}

				unsigned ThreadEnvironmentTy::getThreadIdInBlock() const {
				return Impl->getThreadIdInBlock();
				}

				unsigned ThreadEnvironmentTy::getGlobalThreadId() const {
				return Impl->getGlobalThreadId();
				}

				unsigned ThreadEnvironmentTy::getBlockSize() const {
				return Impl->getBlockSize();
				}

				unsigned ThreadEnvironmentTy::getKernelSize() const {
				return Impl->getKernelSize();
				}

				unsigned ThreadEnvironmentTy::getBlockId() const { return Impl->getBlockId(); }

				unsigned ThreadEnvironmentTy::getNumberOfBlocks() const {
				return Impl->getNumberOfBlocks();
				}

				LaneMaskTy ThreadEnvironmentTy::getActiveMask() const {
				return Impl->getActiveMask();
				}

				int32_t ThreadEnvironmentTy::shuffle(int32_t Var, uint64_t SrcLane) {
				return Impl->shuffle(Var, SrcLane);
				}

				int32_t ThreadEnvironmentTy::shuffleDown(int32_t Var, uint32_t Delta) {
				return Impl->shuffleDown(Var, Delta);
				}

				void ThreadEnvironmentTy::fenceKernel(int32_t MemoryOrder) {
				return Impl->fenceKernel(MemoryOrder);
				}

				void ThreadEnvironmentTy::namedBarrier(bool Generic) {
				Impl->namedBarrier(Generic);
				}

				void ThreadEnvironmentTy::setBlockEnv(ThreadBlockEnvironmentTy *TBE) {
				Impl->setBlockEnv(TBE);
				}

				void ThreadEnvironmentTy::resetBlockEnv() { Impl->resetBlockEnv(); }

				unsigned ThreadEnvironmentTy::getWarpSize() const {
				return Impl->getWarpSize();
				}

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.h

This file was added.

				//===---- ThreadEnvironmentImpl.h - Virtual GPU thread environment - C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENTIMPL_H
				#define OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENTIMPL_H

				#include "ThreadEnvironment.h"
				#include <barrier>
				#include <cstdio>
				#include <functional>
				#include <map>
				#include <thread>
				#include <vector>

				class WarpEnvironmentTy {
				const unsigned ID;
				const unsigned NumThreads;

				std::vector<int32_t> ShuffleBuffer;

				std::barrier<std::function<void(void)>> Barrier;
				std::barrier<std::function<void(void)>> ShuffleBarrier;
				std::barrier<std::function<void(void)>> ShuffleDownBarrier;

				public:
				WarpEnvironmentTy(unsigned ID, unsigned NumThreads)
				: ID(ID), NumThreads(NumThreads), ShuffleBuffer(NumThreads),
				Barrier(NumThreads, []() {}), ShuffleBarrier(NumThreads, []() {}),
				ShuffleDownBarrier(NumThreads, []() {}) {}

				unsigned getWarpId() const { return ID; }
				int getNumThreads() const { return NumThreads; }

				void sync() { Barrier.arrive_and_wait(); }
				void writeShuffleBuffer(int32_t Var, unsigned LaneId) {
				ShuffleBuffer[LaneId] = Var;
				}

				int32_t getShuffleBuffer(unsigned LaneId) { return ShuffleBuffer[LaneId]; }

				void waitShuffleBarrier() { ShuffleBarrier.arrive_and_wait(); }

				void waitShuffleDownBarrier() { ShuffleBarrier.arrive_and_wait(); }
				};

				class CTAEnvironmentTy {
				public:
				unsigned ID;
				unsigned NumThreads;
				unsigned NumBlocks;

				std::barrier<std::function<void(void)>> Barrier;
				std::barrier<std::function<void(void)>> SyncThreads;
				std::barrier<std::function<void(void)>> NamedBarrier;

				CTAEnvironmentTy(unsigned ID, unsigned NumThreads, unsigned NumBlocks)
				: ID(ID), NumThreads(NumThreads), NumBlocks(NumBlocks),
				Barrier(NumThreads, []() {}), SyncThreads(NumThreads, []() {}),
				NamedBarrier(NumThreads, []() {}) {}

				unsigned getId() const { return ID; }
				unsigned getNumThreads() const { return NumThreads; }

				unsigned getNumBlocks() const { return NumBlocks; }

				void fence() { Barrier.arrive_and_wait(); }
				void syncThreads() { SyncThreads.arrive_and_wait(); }
				void namedBarrier() { NamedBarrier.arrive_and_wait(); }
				};

				class ThreadBlockEnvironmentTy {
				unsigned ID;
				unsigned NumBlocks;

				public:
				ThreadBlockEnvironmentTy(unsigned ID, unsigned NumBlocks)
				: ID(ID), NumBlocks(NumBlocks) {}

				unsigned getId() const { return ID; }
				unsigned getNumBlocks() const { return NumBlocks; }
				};

				namespace VGPUImpl {
				class ThreadEnvironmentTy {
				unsigned ThreadIdInWarp;
				unsigned ThreadIdInBlock;
				unsigned GlobalThreadIdx;

				WarpEnvironmentTy *WarpEnvironment;
				ThreadBlockEnvironmentTy *ThreadBlockEnvironment;
				CTAEnvironmentTy *CTAEnvironment;

				public:
				ThreadEnvironmentTy(unsigned ThreadId, WarpEnvironmentTy *WE,
				CTAEnvironmentTy *CTAE)
				: ThreadIdInWarp(ThreadId),
				ThreadIdInBlock(WE->getWarpId() * WE->getNumThreads() + ThreadId),
				GlobalThreadIdx(CTAE->getId() * CTAE->getNumThreads() +
				ThreadIdInBlock),
				WarpEnvironment(WE), CTAEnvironment(CTAE) {}

				void setBlockEnv(ThreadBlockEnvironmentTy *TBE) {
				ThreadBlockEnvironment = TBE;
				}

				void resetBlockEnv() {
				delete ThreadBlockEnvironment;
				ThreadBlockEnvironment = nullptr;
				}

				unsigned getThreadIdInWarp() const { return ThreadIdInWarp; }
				unsigned getThreadIdInBlock() const { return ThreadIdInBlock; }
				unsigned getGlobalThreadId() const { return GlobalThreadIdx; }

				jdoerfertUnsubmitted Not Done Reply Inline Actions at least add more information what the problem and potential solutions are. jdoerfert: at least add more information what the problem and potential solutions are.
				unsigned getBlockSize() const { return CTAEnvironment->getNumThreads(); }

				unsigned getBlockId() const { return ThreadBlockEnvironment->getId(); }

				unsigned getNumberOfBlocks() const {
				return ThreadBlockEnvironment->getNumBlocks();
				}
				unsigned getKernelSize() const {}

				// FIXME: This is wrong
				LaneMaskTy getActiveMask() const { return ~0U; }

				void fenceTeam() { CTAEnvironment->fence(); }
				void syncWarp() { WarpEnvironment->sync(); }

				int32_t shuffle(int32_t Var, uint64_t SrcLane) {
				WarpEnvironment->waitShuffleBarrier();
				WarpEnvironment->writeShuffleBuffer(Var, ThreadIdInWarp);
				WarpEnvironment->waitShuffleBarrier();
				Var = WarpEnvironment->getShuffleBuffer(ThreadIdInWarp);
				return Var;
				}

				int32_t shuffleDown(int32_t Var, uint32_t Delta) {
				WarpEnvironment->waitShuffleDownBarrier();
				WarpEnvironment->writeShuffleBuffer(Var, ThreadIdInWarp);
				WarpEnvironment->waitShuffleDownBarrier();
				Var = WarpEnvironment->getShuffleBuffer((ThreadIdInWarp + Delta) %
				getWarpSize());
				return Var;
				}

				void namedBarrier(bool Generic) {
				if (Generic) {
				CTAEnvironment->namedBarrier();
				} else {
				CTAEnvironment->syncThreads();
				}
				}

				void fenceKernel(int32_t MemoryOrder) {
				std::atomic_thread_fence(static_cast<std::memory_order>(MemoryOrder));
				}

				unsigned getWarpSize() const { return WarpEnvironment->getNumThreads(); }
				};
				} // namespace VGPUImpl

				#endif // OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENTIMPL_H

openmp/libomptarget/plugins/vgpu/src/rtl.cpp

This file was added.

				//===------RTLs/vgpu/src/rtl.cpp - Target RTLs Implementation ----- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// RTL for virtual (x86) GPU
				//
				//===----------------------------------------------------------------------===//

				#include <barrier>
				#include <cassert>
				#include <cmath>
				#include <condition_variable>
				#include <cstdio>
				#include <cstdlib>
				#include <cstring>
				#include <dlfcn.h>
				#include <ffi.h>
				#include <functional>
				#include <gelf.h>
				#include <link.h>
				#include <list>
				#include <memory>
				#include <mutex>
				#include <queue>
				#include <thread>
				#include <vector>

				#include "Debug.h"
				#include "ThreadEnvironment.h"
				#include "ThreadEnvironmentImpl.h"
				#include "omptarget.h"
				#include "omptargetplugin.h"

				#ifndef TARGET_NAME
				#define TARGET_NAME Generic ELF - 64bit
				#endif
				#define DEBUG_PREFIX "TARGET " GETNAME(TARGET_NAME) " RTL"

				#ifndef TARGET_ELF_ID
				#define TARGET_ELF_ID 0
				#endif

				#include "elf_common.h"

				#define NUMBER_OF_DEVICES 1
				#define OFFLOADSECTIONNAME "omp_offloading_entries"

				#define DEBUG false

				/// Array of Dynamic libraries loaded for this target.
				struct DynLibTy {
				char *FileName;
				void *Handle;
				};

				/// Keep entries table per device.
				struct FuncOrGblEntryTy {
				__tgt_target_table Table;
				};

				thread_local ThreadEnvironmentTy *ThreadEnvironment;

				/// Class containing all the device information.
				class RTLDeviceInfoTy {
				std::vector<std::list<FuncOrGblEntryTy>> FuncGblEntries;

				public:
				std::list<DynLibTy> DynLibs;

				// Record entry point associated with device.
				void createOffloadTable(int32_t device_id, __tgt_offload_entry *begin,
				__tgt_offload_entry *end) {
				assert(device_id < (int32_t)FuncGblEntries.size() &&
				"Unexpected device id!");
				FuncGblEntries[device_id].emplace_back();
				FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();

				E.Table.EntriesBegin = begin;
				E.Table.EntriesEnd = end;
				}

				// Return true if the entry is associated with device.
				bool findOffloadEntry(int32_t device_id, void *addr) {
				assert(device_id < (int32_t)FuncGblEntries.size() &&
				"Unexpected device id!");
				FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();

				for (__tgt_offload_entry i = E.Table.EntriesBegin, e = E.Table.EntriesEnd;
				i < e; ++i) {
				if (i->addr == addr)
				return true;
				}

				return false;
				}

				// Return the pointer to the target entries table.
				__tgt_target_table *getOffloadEntriesTable(int32_t device_id) {
				assert(device_id < (int32_t)FuncGblEntries.size() &&
				"Unexpected device id!");
				FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();

				return &E.Table;
				}

				RTLDeviceInfoTy(int32_t num_devices) { FuncGblEntries.resize(num_devices); }

				~RTLDeviceInfoTy() {
				// Close dynamic libraries
				for (auto &lib : DynLibs) {
				if (lib.Handle) {
				dlclose(lib.Handle);
				remove(lib.FileName);
				}
				}
				}
				};

				static RTLDeviceInfoTy DeviceInfo(NUMBER_OF_DEVICES);

				std::vector<CTAEnvironmentTy *> CTAEnvironments;
				std::vector<WarpEnvironmentTy *> WarpEnvironments;

				struct VGPUTy {
				struct KernelTy {
				ffi_cif *Cif;
				std::function<void(void)> Kernel;
				int NumTeams;

				KernelTy(ffi_cif *Cif, std::function<void(void)> Kernel, int NumTeams)
				: Cif(Cif), Kernel(Kernel), NumTeams(NumTeams) {}
				};

				struct VGPUStreamTy {
				std::queue<KernelTy> Kernels;
				std::mutex Mtx;

				void emplace(ffi_cif *Cif, std::function<void(void)> F, int NumTeams) {
				std::lock_guard Guard(Mtx);
				Kernels.emplace(Cif, F, NumTeams);
				}

				KernelTy front() {
				std::lock_guard Guard(Mtx);
				return Kernels.front();
				}

				void pop() {
				std::lock_guard Guard(Mtx);
				Kernels.pop();
				}

				bool empty() {
				std::lock_guard Guard(Mtx);
				return Kernels.empty();
				}
				};

				struct AsyncInfoQueueTy {
				std::deque<__tgt_async_info *> Streams;
				std::mutex Mtx;

				bool empty() {
				std::lock_guard Guard(Mtx);
				return Streams.empty();
				}

				__tgt_async_info *front() {
				std::lock_guard Guard(Mtx);
				return Streams.front();
				}

				void pop() {
				std::lock_guard Guard(Mtx);
				Streams.pop_front();
				}

				void emplace(__tgt_async_info *AsyncInfo) {
				std::lock_guard Guard(Mtx);
				Streams.emplace_back(AsyncInfo);
				}
				} ExecutionQueue;

				VGPUStreamTy getStream(__tgt_async_info AsyncInfo) {
				assert(AsyncInfo != nullptr && "async_info ptr was null");

				if (!AsyncInfo->Queue)
				AsyncInfo->Queue = new VGPUStreamTy();

				return reinterpret_cast<VGPUStreamTy *>(AsyncInfo->Queue);
				}

				std::atomic<bool> Running;
				std::vector<std::thread> Threads;
				int WarpsPerCTA;
				int NumCTAs;

				std::unique_ptr<std::barrier<std::function<void(void)>>> Barrier;
				std::condition_variable WorkAvailable;
				std::mutex WorkDoneMtx;
				std::condition_variable WorkDone;

				VGPUTy(int NumThreads = -1, int ThreadsPerWarp = -1, int WarpsPerCTA = -1)
				: Running(true) {
				if (const char *Env = std::getenv("VGPU_NUM_THREADS"))
				NumThreads = std::stoi(Env);
				if (const char *Env = std::getenv("VGPU_THREADS_PER_WARP"))
				ThreadsPerWarp = std::stoi(Env);
				if (const char *Env = std::getenv("VGPU_WARPS_PER_CTA"))
				WarpsPerCTA = std::stoi(Env);

				if (NumThreads == -1)
				NumThreads = std::thread::hardware_concurrency();
				if (ThreadsPerWarp == -1)
				ThreadsPerWarp = NumThreads;
				if (WarpsPerCTA == -1)
				WarpsPerCTA = 1;

				NumCTAs = NumThreads / (ThreadsPerWarp * WarpsPerCTA);

				// printf("NumThreads: %d, ThreadsPerWarp: %d, WarpsPerCTA: %d\n",
				// NumThreads,
				// ThreadsPerWarp, WarpsPerCTA);

				assert(NumThreads % ThreadsPerWarp == 0 && NumThreads % WarpsPerCTA == 0 &&
				"Invalid VGPU Config");

				Barrier = std::make_unique<std::barrier<std::function<void(void)>>>(
				NumThreads, []() {});

				Threads.reserve(NumThreads);

				auto GlobalThreadIdx = 0;
				for (auto CTAIdx = 0; CTAIdx < NumCTAs; CTAIdx++) {
				auto *CTAEnv =
				new CTAEnvironmentTy(CTAIdx, NumThreads / NumCTAs, NumCTAs);
				for (auto WarpIdx = 0; WarpIdx < WarpsPerCTA; WarpIdx++) {
				auto *WarpEnv = new WarpEnvironmentTy(WarpIdx, ThreadsPerWarp);
				for (auto ThreadIdx = 0; ThreadIdx < ThreadsPerWarp; ThreadIdx++) {
				Threads.emplace_back(
				[this, ThreadIdx, GlobalThreadIdx, CTAEnv, WarpEnv]() {
				ThreadEnvironment =
				new ThreadEnvironmentTy(ThreadIdx, WarpEnv, CTAEnv);
				std::function<void(void)> Kernel;
				while (Running) {
				{
				std::unique_lock<std::mutex> UniqueLock(ExecutionQueue.Mtx);

				WorkAvailable.wait(UniqueLock, [&]() {
				if (!Running) {
				return true;
				}
				bool IsEmpty = ExecutionQueue.Streams.empty();

				return !IsEmpty;
				});
				}

				if (ExecutionQueue.empty()) {
				continue;
				}

				while (!ExecutionQueue.empty()) {
				auto *Stream = getStream(ExecutionQueue.front());
				while (!Stream->empty()) {
				auto KernelInfo = Stream->front();
				Kernel = KernelInfo.Kernel;

				jdoerfertUnsubmitted Not Done Reply Inline Actions Move the lambda into a helper function. indention of 12 is too much. jdoerfert: Move the lambda into a helper function. indention of 12 is too much.
				const unsigned NumTeams = KernelInfo.NumTeams;
				unsigned TeamIdx = 0;
				while (TeamIdx < KernelInfo.NumTeams) {
				if (CTAEnv->getId() < KernelInfo.NumTeams) {
				ThreadEnvironment->setBlockEnv(
				new ThreadBlockEnvironmentTy(
				TeamIdx + CTAEnv->getId(), NumTeams));
				Kernel();
				ThreadEnvironment->resetBlockEnv();
				}
				Barrier->arrive_and_wait();
				TeamIdx += NumCTAs;
				}

				if (GlobalThreadIdx == 0) {
				delete KernelInfo.Cif;
				Stream->pop();
				}

				Barrier->arrive_and_wait();
				}
				if (GlobalThreadIdx == 0) {
				ExecutionQueue.pop();
				WorkDone.notify_all();
				}
				Barrier->arrive_and_wait();
				}
				}
				delete ThreadEnvironment;
				});
				GlobalThreadIdx = (GlobalThreadIdx + 1) % NumThreads;
				}
				jdoerfertUnsubmitted Not Done Reply Inline Actions Can we split this up and create some helper functions maybe? jdoerfert: Can we split this up and create some helper functions maybe?
				WarpEnvironments.push_back(WarpEnv);
				}
				CTAEnvironments.push_back(CTAEnv);
				}
				}

				~VGPUTy() {
				awaitAll();

				Running = false;
				jdoerfertUnsubmitted Not Done Reply Inline Actions When do we have more threads than NumThreads? jdoerfert: When do we have more threads than NumThreads?
				WorkAvailable.notify_all();

				for (auto &Thread : Threads) {
				if (Thread.joinable()) {
				Thread.join();
				}
				}

				for (auto *CTAEnv : CTAEnvironments)
				delete CTAEnv;

				for (auto *WarpEnv : WarpEnvironments)
				delete WarpEnv;
				}

				void await(__tgt_async_info *AsyncInfo) {
				std::unique_lock UniqueLock(getStream(AsyncInfo)->Mtx);
				WorkDone.wait(UniqueLock,
				[&]() { return getStream(AsyncInfo)->Kernels.empty(); });
				}

				void awaitAll() {
				while (!ExecutionQueue.empty()) {
				await(ExecutionQueue.front());
				}
				}

				void scheduleAsync(__tgt_async_info AsyncInfo, ffi_cif Cif,
				std::function<void(void)> F, int NumTeams) {
				if (NumTeams == 0)
				NumTeams = NumCTAs;
				auto *Stream = getStream(AsyncInfo);
				Stream->emplace(Cif, F, NumTeams);
				ExecutionQueue.emplace(AsyncInfo);
				WorkAvailable.notify_all();
				}
				};

				VGPUTy VGPU;

				#ifdef __cplusplus
				extern "C" {
				#endif

				int32_t __tgt_rtl_is_valid_binary(__tgt_device_image *image) {
				// If we don't have a valid ELF ID we can just fail.
				#if TARGET_ELF_ID < 1
				return 0;
				#else
				return elf_check_machine(image, TARGET_ELF_ID);
				#endif
				}

				int32_t __tgt_rtl_number_of_devices() { return NUMBER_OF_DEVICES; }

				int32_t __tgt_rtl_init_device(int32_t device_id) { return OFFLOAD_SUCCESS; }

				__tgt_target_table *__tgt_rtl_load_binary(int32_t device_id,
				__tgt_device_image *image) {

				DP("Dev %d: load binary from " DPxMOD " image\n", device_id,
				DPxPTR(image->ImageStart));

				assert(device_id >= 0 && device_id < NUMBER_OF_DEVICES && "bad dev id");

				size_t ImageSize = (size_t)image->ImageEnd - (size_t)image->ImageStart;
				size_t NumEntries = (size_t)(image->EntriesEnd - image->EntriesBegin);
				DP("Expecting to have %zd entries defined.\n", NumEntries);

				// Is the library version incompatible with the header file?
				if (elf_version(EV_CURRENT) == EV_NONE) {
				DP("Incompatible ELF library!\n");
				return NULL;
				}

				// Obtain elf handler
				Elf e = elf_memory((char )image->ImageStart, ImageSize);
				if (!e) {
				DP("Unable to get ELF handle: %s!\n", elf_errmsg(-1));
				return NULL;
				}

				if (elf_kind(e) != ELF_K_ELF) {
				DP("Invalid Elf kind!\n");
				elf_end(e);
				return NULL;
				}

				// Find the entries section offset
				Elf_Scn *section = 0;
				Elf64_Off entries_offset = 0;

				size_t shstrndx;

				if (elf_getshdrstrndx(e, &shstrndx)) {
				DP("Unable to get ELF strings index!\n");
				elf_end(e);
				return NULL;
				}

				while ((section = elf_nextscn(e, section))) {
				GElf_Shdr hdr;
				gelf_getshdr(section, &hdr);

				if (!strcmp(elf_strptr(e, shstrndx, hdr.sh_name), OFFLOADSECTIONNAME)) {
				entries_offset = hdr.sh_addr;
				break;
				}
				}

				if (!entries_offset) {
				DP("Entries Section Offset Not Found\n");
				elf_end(e);
				return NULL;
				}

				DP("Offset of entries section is (" DPxMOD ").\n", DPxPTR(entries_offset));

				// load dynamic library and get the entry points. We use the dl library
				// to do the loading of the library, but we could do it directly to avoid
				// the dump to the temporary file.
				//
				// 1) Create tmp file with the library contents.
				// 2) Use dlopen to load the file and dlsym to retrieve the symbols.
				char tmp_name[] = "/tmp/tmpfile_XXXXXX";
				int tmp_fd = mkstemp(tmp_name);

				if (tmp_fd == -1) {
				elf_end(e);
				return NULL;
				}

				FILE *ftmp = fdopen(tmp_fd, "wb");

				if (!ftmp) {
				elf_end(e);
				return NULL;
				}

				fwrite(image->ImageStart, ImageSize, 1, ftmp);
				fclose(ftmp);

				DynLibTy Lib = {tmp_name, dlopen(tmp_name, RTLD_NOW \| RTLD_GLOBAL)};

				if (!Lib.Handle) {
				DP("Target library loading error: %s\n", dlerror());
				elf_end(e);
				return NULL;
				}

				DeviceInfo.DynLibs.push_back(Lib);

				struct link_map libInfo = (struct link_map )Lib.Handle;

				// The place where the entries info is loaded is the library base address
				// plus the offset determined from the ELF file.
				Elf64_Addr entries_addr = libInfo->l_addr + entries_offset;

				DP("Pointer to first entry to be loaded is (" DPxMOD ").\n",
				DPxPTR(entries_addr));

				// Table of pointers to all the entries in the target.
				__tgt_offload_entry entries_table = (__tgt_offload_entry )entries_addr;

				__tgt_offload_entry *entries_begin = &entries_table[0];
				__tgt_offload_entry *entries_end = entries_begin + NumEntries;

				if (!entries_begin) {
				DP("Can't obtain entries begin\n");
				elf_end(e);
				return NULL;
				}

				DP("Entries table range is (" DPxMOD ")->(" DPxMOD ")\n",
				DPxPTR(entries_begin), DPxPTR(entries_end));
				DeviceInfo.createOffloadTable(device_id, entries_begin, entries_end);

				elf_end(e);

				return DeviceInfo.getOffloadEntriesTable(device_id);
				}

				// Sample implementation of explicit memory allocator. For this plugin all
				// kinds are equivalent to each other.
				void __tgt_rtl_data_alloc(int32_t device_id, int64_t size, void hst_ptr,
				int32_t kind) {
				void *ptr = NULL;

				switch (kind) {
				case TARGET_ALLOC_DEVICE:
				case TARGET_ALLOC_HOST:
				case TARGET_ALLOC_SHARED:
				case TARGET_ALLOC_DEFAULT:
				ptr = malloc(size);
				break;
				default:
				REPORT("Invalid target data allocation kind");
				}

				return ptr;
				}

				int32_t __tgt_rtl_data_submit(int32_t device_id, void tgt_ptr, void hst_ptr,
				int64_t size) {
				VGPU.awaitAll();
				memcpy(tgt_ptr, hst_ptr, size);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_data_retrieve(int32_t device_id, void hst_ptr, void tgt_ptr,
				int64_t size) {
				VGPU.awaitAll();
				memcpy(hst_ptr, tgt_ptr, size);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_data_delete(int32_t device_id, void *tgt_ptr) {
				free(tgt_ptr);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_synchronize(int32_t device_id, __tgt_async_info *async_info) {
				VGPU.await(async_info);
				delete (VGPUTy::VGPUStreamTy *)async_info->Queue;
				async_info->Queue = nullptr;
				return 0;
				}

				int32_t __tgt_rtl_run_target_team_region(int32_t device_id, void *tgt_entry_ptr,
				void **tgt_args,
				ptrdiff_t *tgt_offsets,
				int32_t arg_num, int32_t team_num,
				int32_t thread_limit,
				uint64_t loop_tripcount) {
				__tgt_async_info AsyncInfo;
				int rc = __tgt_rtl_run_target_team_region_async(
				device_id, tgt_entry_ptr, tgt_args, tgt_offsets, arg_num, team_num,
				thread_limit, loop_tripcount, &AsyncInfo);

				if (rc != OFFLOAD_SUCCESS)
				return OFFLOAD_FAIL;
				jdoerfertUnsubmitted Not Done Reply Inline Actions if we need for submit/retrieve, I'd assume to wait here too. jdoerfert: if we need for submit/retrieve, I'd assume to wait here too.

				return __tgt_rtl_synchronize(device_id, &AsyncInfo);
				}

				int32_t __tgt_rtl_run_target_team_region_async(
				int32_t device_id, void tgt_entry_ptr, void *tgt_args,
				ptrdiff_t *tgt_offsets, int32_t arg_num, int32_t team_num,
				int32_t thread_limit, uint64_t loop_tripcount /not used/,
				__tgt_async_info *async_info) {
				ffi_cif *cif = new ffi_cif();

				// All args are references.
				std::shared_ptr<std::vector<ffi_type *>> args_types =
				std::make_shared<std::vector<ffi_type *>>(arg_num, &ffi_type_pointer);
				std::shared_ptr<std::vector<void *>> args =
				std::make_shared<std::vector<void *>>(arg_num);
				std::shared_ptr<std::vector<void *>> ptrs =
				std::make_shared<std::vector<void *>>(arg_num);

				for (int32_t i = 0; i < arg_num; ++i) {
				(ptrs)[i] = (void )((intptr_t)tgt_args[i] + tgt_offsets[i]);
				(args)[i] = &(ptrs)[i];
				}

				ffi_status status = ffi_prep_cif(cif, FFI_DEFAULT_ABI, arg_num,
				&ffi_type_void, &(*args_types)[0]);

				assert(status == FFI_OK && "Unable to prepare target launch!");

				if (status != FFI_OK)
				return OFFLOAD_FAIL;

				DP("Running entry point at " DPxMOD "...\n", DPxPTR(tgt_entry_ptr));

				void (*entry)(void);
				((void *)&entry) = tgt_entry_ptr;

				VGPU.scheduleAsync(
				async_info, cif,
				[&]() {
				ffi_call(cif, entry, NULL, &(*args)[0]);
				&(args_types);
				},
				team_num);
				VGPU.await(async_info);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_run_target_region(int32_t device_id, void *tgt_entry_ptr,
				void *tgt_args, ptrdiff_t tgt_offsets,
				int32_t arg_num) {
				return __tgt_rtl_run_target_team_region(device_id, tgt_entry_ptr, tgt_args,
				tgt_offsets, arg_num, 1, 1, 0);
				}

				int32_t __tgt_rtl_run_target_region_async(int32_t device_id,
				void tgt_entry_ptr, void *tgt_args,
				ptrdiff_t *tgt_offsets,
				int32_t arg_num,
				__tgt_async_info *async_info) {
				return __tgt_rtl_run_target_team_region_async(device_id, tgt_entry_ptr,
				tgt_args, tgt_offsets, arg_num,
				1, 1, 0, async_info);
				}

				#ifdef __cplusplus
				}
				#endif

openmp/libomptarget/src/rtl.cpp

Show All 24 Lines
static const char *RTLNames[] = {		static const char *RTLNames[] = {
/* PowerPC target */ "libomptarget.rtl.ppc64.so",		/* PowerPC target */ "libomptarget.rtl.ppc64.so",
/* x86_64 target */ "libomptarget.rtl.x86_64.so",		/* x86_64 target */ "libomptarget.rtl.x86_64.so",
/* CUDA target */ "libomptarget.rtl.cuda.so",		/* CUDA target */ "libomptarget.rtl.cuda.so",
/* AArch64 target */ "libomptarget.rtl.aarch64.so",		/* AArch64 target */ "libomptarget.rtl.aarch64.so",
/* SX-Aurora VE target */ "libomptarget.rtl.ve.so",		/* SX-Aurora VE target */ "libomptarget.rtl.ve.so",
/* AMDGPU target */ "libomptarget.rtl.amdgpu.so",		/* AMDGPU target */ "libomptarget.rtl.amdgpu.so",
/* Remote target */ "libomptarget.rtl.rpc.so",		/* Remote target */ "libomptarget.rtl.rpc.so",
		/* Virtual GPU target */ "libomptarget.rtl.vgpu.so",
};		};
		jdoerfertUnsubmitted Not Done Reply Inline Actions Introduce an environment variable, if it is set, X86 target should skip the image. Also, add a TODO such that we later look into the image and inspect it to decide automatically. jdoerfert: Introduce an environment variable, if it is set, X86 target should skip the image. Also, add a…

PluginManager *PM;		PluginManager *PM;

#if OMPTARGET_PROFILE_ENABLED		#if OMPTARGET_PROFILE_ENABLED
static char *ProfileTraceFile = nullptr;		static char *ProfileTraceFile = nullptr;
#endif		#endif

__attribute__((constructor(101))) void init() {		__attribute__((constructor(101))) void init() {
Show All 32 Lines	void RTLsTy::LoadRTLs() {
}		}

DP("Loading RTLs...\n");		DP("Loading RTLs...\n");

// Attempt to open all the plugins and, if they exist, check if the interface		// Attempt to open all the plugins and, if they exist, check if the interface
// is correct and if they are supporting any devices.		// is correct and if they are supporting any devices.
for (auto *Name : RTLNames) {		for (auto *Name : RTLNames) {
DP("Loading library '%s'...\n", Name);		DP("Loading library '%s'...\n", Name);
void *dynlib_handle = dlopen(Name, RTLD_NOW);
		int Flags = RTLD_NOW;

		if (strcmp(Name, "libomptarget.rtl.vgpu.so") == 0)
		Flags \|= RTLD_GLOBAL;

		void *dynlib_handle = dlopen(Name, Flags);

if (!dynlib_handle) {		if (!dynlib_handle) {
// Library does not exist or cannot be found.		// Library does not exist or cannot be found.
		jdoerfertUnsubmitted Done Reply Inline Actions Not only x86, also let's not do strcmp. Extend RTLNAmes to be an array of structs with more elaborate information, e.g., is host flag. That said, unsure if not loading the plugin is the right way to not grab the image. Good enough for now. jdoerfert: Not only x86, also let's not do strcmp. Extend RTLNAmes to be an array of structs with more…
DP("Unable to load library '%s': %s!\n", Name, dlerror());		DP("Unable to load library '%s': %s!\n", Name, dlerror());
continue;		continue;
}		}

DP("Successfully loaded library '%s'!\n", Name);		DP("Successfully loaded library '%s'!\n", Name);

AllRTLs.emplace_back();		AllRTLs.emplace_back();

▲ Show 20 Lines • Show All 392 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget][WIP] Introduce VGPU PluginAcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 386426

clang/lib/Basic/Targets/X86.h

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

clang/lib/CodeGen/CodeGenModule.cpp

clang/lib/Driver/ToolChains/Gnu.cpp

clang/lib/Frontend/CompilerInvocation.cpp

llvm/include/llvm/ADT/Triple.h

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

llvm/lib/Support/Triple.cpp

openmp/CMakeLists.txt

openmp/libomptarget/DeviceRTL/CMakeLists.txt

openmp/libomptarget/DeviceRTL/src/Debug.cpp

openmp/libomptarget/DeviceRTL/src/Kernel.cpp

openmp/libomptarget/DeviceRTL/src/Mapping.cpp

openmp/libomptarget/DeviceRTL/src/Misc.cpp

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

openmp/libomptarget/DeviceRTL/src/Utils.cpp

openmp/libomptarget/plugins/CMakeLists.txt

openmp/libomptarget/plugins/vgpu/CMakeLists.txt

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.h

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.cpp

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.h

openmp/libomptarget/plugins/vgpu/src/rtl.cpp

openmp/libomptarget/src/rtl.cpp

[Libomptarget][WIP] Introduce VGPU Plugin
AcceptedPublic