This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/
-
lib/
-
Basic/
1
TargetInfo.cpp
-
Targets/
1/2
X86.h
-
CodeGen/
1
CGOpenMPRuntimeGPU.cpp
1
CodeGenModule.cpp
-
Driver/ToolChains/
-
ToolChains/
1/3
Gnu.cpp
-
Frontend/
1/1
CompilerInvocation.cpp
-
llvm/
-
include/llvm/
-
llvm/
-
ADT/
1/1
Triple.h
-
Frontend/OpenMP/
-
OpenMP/
-
OMPGridValues.h
-
lib/Support/
-
Support/
2/2
Triple.cpp
-
openmp/
-
CMakeLists.txt
-
libomptarget/
-
DeviceRTL/
1/3
CMakeLists.txt
-
include/
-
ThreadEnvironment.h
-
src/
1
Debug.cpp
2/3
Mapping.cpp
-
Misc.cpp
3/6
Synchronization.cpp
2/5
Utils.cpp
-
plugins/
-
CMakeLists.txt
-
vgpu/
-
CMakeLists.txt
-
src/
-
ThreadEnvironment.h
1
ThreadEnvironment.cpp
1
ThreadEnvironmentImpl.h
1
ThreadEnvironmentImpl.cpp
4
rtl.cpp
-
src/
1/2
rtl.cpp
-
test/
2/2
CMakeLists.txt

Differential D113359

[Libomptarget][WIP] Introduce VGPU Plugin
AcceptedPublic

Authored by atmnpatel on Nov 6 2021, 7:41 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992
JonChesterfield

Summary

This patch introduces a virtual GPU (x86) plugin. This allows for the
emulation of the GPU environment on the host. This re-uses the same
execution model, compilation paths, runtimes as a physical GPU. The
number of threads, warps, and CTAs are set through the environment
variables VGPU_{NUM_THREADS,NUM_WARPS,WARPS_PER_CTA} respectively.

Known Bugs:

There is UB somewhere in the DeviceRTL that occasionally sets stride to zero, causing a FPE segfault.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

atmnpatel created this revision.Nov 6 2021, 7:41 PM

Herald added subscribers: ormris, dexonsmith, pengfei and 2 others. · View Herald TranscriptNov 6 2021, 7:41 PM

atmnpatel requested review of this revision.Nov 6 2021, 7:41 PM

Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptNov 6 2021, 7:41 PM

Herald added subscribers: llvm-commits, openmp-commits, cfe-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B132881: Diff 385318.Nov 6 2021, 10:12 PM

jdoerfert added inline comments.Nov 8 2021, 7:59 AM

clang/lib/CodeGen/CGOpenMPRuntimeVirtualGPU.cpp
54 ↗	(On Diff #385318)	We should be able to get rid of this file (and the cuda/hip) version. Might be the right time now as a precommit.
llvm/include/llvm/ADT/Triple.h
169	Let's call it OpenMP_VGPU or something like that to make it clear.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
2177 ↗	(On Diff #385318)	@tianshilei1992 This needs a test.
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
107 ↗	(On Diff #385318)	I don't think we should do this. Instead, the plugin should signal as threads finish the kernel.
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
238	We probably should use kind(CPU) or something instead. Nothing x86 specific about it I think.
openmp/libomptarget/include/DeviceEnvironment.h
83 ↗	(On Diff #385318)	This should go into a new file (ThreadEnvironment)

I removed the shared var opt - might be best to keep this in a separate patch @tianshilei1992. Also addressed comments.

small nit fix

Harbormaster completed remote builds in B133662: Diff 386426.Nov 10 2021, 11:49 PM

tianshilei1992 added inline comments.Nov 11 2021, 9:04 AM

clang/lib/Driver/ToolChains/Gnu.cpp
3077	Maybe `"x86_64-openmp_vpu"` now?
llvm/lib/Support/Triple.cpp
189	`"openmp_vpu"`?

I can't see it in the diff - does the cmake somewhere enable the existing tests on this new target?

A bit surprised to see ffi involved, are we thinking of spawning a separate process for the target?

clang/lib/Basic/Targets/X86.h
49	It's not clear to me what this is x86 specific. Being able to run our tests on power / arm etc seems like an advantage. Would also mean we would avoid adding openmp stuff the x86 specific files. Maybe OpenMPVGPUAddrSpaceMap and put it in one of the openmp source files?
clang/lib/Frontend/CompilerInvocation.cpp
3990	Add a isOpenmpVGPU function?
openmp/libomptarget/DeviceRTL/CMakeLists.txt
140	Should only add this include to the vgu, not all the plugins. May be able to use relative include paths to drop it entirely

Fixed lifetime issue around ffi_call
Addressed comments

The existing x86 plugin uses ffi, so this does as well, no explicit benefit in doing so. Is it worth keeping?

Harbormaster completed remote builds in B142248: Diff 398370.Jan 8 2022, 2:06 PM

jdoerfert added inline comments.Jan 10 2022, 7:00 AM

llvm/lib/Support/Triple.cpp
512
openmp/libomptarget/DeviceRTL/src/Debug.cpp
53
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
128 ↗	(On Diff #398370)
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
29
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
291
315	Pass the memory order, also rename the arguments to match the coding convention.
318	Pass the mask
openmp/libomptarget/DeviceRTL/src/Utils.cpp
54
66	Can't we merge this with AMDGPU?
118
openmp/libomptarget/plugins/vgpu/src/rtl.cpp
304	Can we split this up and create some helper functions maybe?
openmp/libomptarget/src/rtl.cpp
27	Introduce an environment variable, if it is set, X86 target should skip the image. Also, add a TODO such that we later look into the image and inspect it to decide automatically.
openmp/libomptarget/test/lit.cfg
189 ↗	(On Diff #398370)	Leftovers.

tianshilei1992 added inline comments.Jan 12 2022, 6:55 AM

openmp/libomptarget/DeviceRTL/CMakeLists.txt
236	It's not a good practice to specify include directories in CMake in this way. Use `include_directories` instead.
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
127 ↗	(On Diff #398370)	Are these code here unintentional? We don't need to specialize this function for vgpu IIRC.

ormris removed a subscriber: ormris.Jan 18 2022, 10:04 AM

jdoerfert added inline comments.Jan 18 2022, 12:50 PM

openmp/libomptarget/DeviceRTL/src/Kernel.cpp
127 ↗	(On Diff #398370)	we might be able to avoid it if we move the synchronize::threads "effect" into the VGPU instead.

atmnpatel edited the summary of this revision. (Show Details)Jan 18 2022, 11:32 PM

Addressed comments

atmnpatel added inline comments.Jan 18 2022, 11:36 PM

openmp/libomptarget/DeviceRTL/CMakeLists.txt
236	can't quite do that here I think, afaik both `include_directories` and `target_include_directories` require that CMake builds the target, but we specify custom targets/build commands so they don't get pulled in

Harbormaster completed remote builds in B144209: Diff 401112.Jan 19 2022, 12:44 AM

jdoerfert added inline comments.Feb 2 2022, 9:45 AM

clang/lib/Driver/ToolChains/Gnu.cpp
3077	not x86, right? triple contains the proper arch
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
30	Move up to the beginning.
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
292	Move up.
343	We should simply use omp locks. Either here, or maybe better, in VGPUImpl. So redirect all calls to there and use a proper lock. no OMP_SPIN and stuff
openmp/libomptarget/DeviceRTL/src/Utils.cpp
119	Move up
128	Pass the mask, both times.
openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.cpp
50	see above.
openmp/libomptarget/src/rtl.cpp
112	Not only x86, also let's not do strcmp. Extend RTLNAmes to be an array of structs with more elaborate information, e.g., is host flag. That said, unsure if not loading the plugin is the right way to not grab the image. Good enough for now.
openmp/libomptarget/test/CMakeLists.txt
23	This is to disable the tests? Not sure this is a good way though. For one, can we check against -vgpu not x86, also openmp-vgpu or something, right?

updates

openmp/libomptarget/test/CMakeLists.txt
23	Yep

Harbormaster completed remote builds in B147239: Diff 405407.Feb 2 2022, 4:55 PM

LG, with some things to address before the merge though.

Didn't we have a pass to expand shared memory (and such)?

clang/lib/Basic/TargetInfo.cpp
155	use isOpenMPVGPU
clang/lib/Basic/Targets/X86.h
395	Do we need the changes in this file at all? I couldn't see why.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1125	isOpenMPVGPU
clang/lib/CodeGen/CodeGenModule.cpp
252	isOpenMPVGPU
clang/lib/Driver/ToolChains/Gnu.cpp
3076	isOpenMPVGPU
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
323	Remove these. Also the TODO below (copied from somewhere)
openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.cpp
85	This is racy, I think. Can we use atomic_add for all these Idx updates or pass the Id from the outside?
openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.h
118	at least add more information what the problem and potential solutions are.
openmp/libomptarget/plugins/vgpu/src/rtl.cpp
271	Move the lambda into a helper function. indention of 12 is too much.
313	When do we have more threads than NumThreads?
554	if we need for submit/retrieve, I'd assume to wait here too.

This revision is now accepted and ready to land.Feb 3 2022, 9:26 AM

Not sure if it's good to merge such a large patch. We could potentially split the patch to three independent patches: tool chain, device runtime, and the OpenMPOpt pass to support expansion of shared variable (which for some reason is not included in this patch. That is actually very important component otherwise the backend will complain about it).

We can merge runtime first, build it in isolation, then libomptarget host runtime, then clang.

Also make sure to adjust the commit messages

dexonsmith removed a subscriber: dexonsmith.Feb 14 2022, 11:03 AM

@jdoerfert @tianshilei1992 @atmnpatel @dhruvachak

Is the target to get this merged in for LLVM 16? Does the VGPU implementation provide a way to support OMPT callbacks for various constructs (parallel, worksharing, barriers, etc.)?

Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2022, 10:13 AM

Herald added a subscriber: MaskRay. · View Herald Transcript

Revision Contents

Path

Size

clang/

lib/

Basic/

TargetInfo.cpp

3 lines

Targets/

X86.h

5 lines

CodeGen/

CGOpenMPRuntimeGPU.cpp

9 lines

CodeGenModule.cpp

4 lines

Driver/

ToolChains/

Gnu.cpp

9 lines

Frontend/

CompilerInvocation.cpp

3 lines

llvm/

include/

llvm/

ADT/

Triple.h

6 lines

Frontend/

OpenMP/

OMPGridValues.h

32 lines

lib/

Support/

Triple.cpp

35 lines

openmp/

CMakeLists.txt

2 lines

libomptarget/

DeviceRTL/

CMakeLists.txt

9 lines

include/

ThreadEnvironment.h

11 lines

src/

9 lines

75 lines

5 lines

59 lines

24 lines

plugins/

CMakeLists.txt

1 line

vgpu/

CMakeLists.txt

74 lines

src/

ThreadEnvironment.h

73 lines

ThreadEnvironment.cpp

117 lines

ThreadEnvironmentImpl.h

137 lines

ThreadEnvironmentImpl.cpp

171 lines

rtl.cpp

615 lines

src/

rtl.cpp

49 lines

test/

CMakeLists.txt

3 lines

Diff 405407

clang/lib/Basic/TargetInfo.cpp

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	TargetInfo::TargetInfo(const llvm::Triple &T) : Triple(T) {
UseAddrSpaceMapMangling = false;		UseAddrSpaceMapMangling = false;

// Default to an unknown platform name.		// Default to an unknown platform name.
PlatformName = "unknown";		PlatformName = "unknown";
PlatformMinVersion = VersionTuple();		PlatformMinVersion = VersionTuple();

MaxOpenCLWorkGroupSize = 1024;		MaxOpenCLWorkGroupSize = 1024;
ProgramAddrSpace = 0;		ProgramAddrSpace = 0;

		if (Triple.getVendor() == llvm::Triple::OpenMP_VGPU)
		jdoerfertUnsubmitted Not Done Reply Inline Actions use isOpenMPVGPU jdoerfert: use isOpenMPVGPU
		AddrSpaceMap = &llvm::omp::OpenMPVGPUAddrSpaceMap;
}		}

// Out of line virtual dtor for TargetInfo.		// Out of line virtual dtor for TargetInfo.
TargetInfo::~TargetInfo() {}		TargetInfo::~TargetInfo() {}

void TargetInfo::resetDataLayout(StringRef DL, const char *ULP) {		void TargetInfo::resetDataLayout(StringRef DL, const char *ULP) {
DataLayoutString = DL.str();		DataLayoutString = DL.str();
UserLabelPrefix = ULP;		UserLabelPrefix = ULP;
▲ Show 20 Lines • Show All 727 Lines • Show Last 20 Lines

clang/lib/Basic/Targets/X86.h

Show All 11 Lines

#ifndef LLVM_CLANG_LIB_BASIC_TARGETS_X86_H		#ifndef LLVM_CLANG_LIB_BASIC_TARGETS_X86_H
#define LLVM_CLANG_LIB_BASIC_TARGETS_X86_H		#define LLVM_CLANG_LIB_BASIC_TARGETS_X86_H

#include "OSTargets.h"		#include "OSTargets.h"
#include "clang/Basic/TargetInfo.h"		#include "clang/Basic/TargetInfo.h"
#include "clang/Basic/TargetOptions.h"		#include "clang/Basic/TargetOptions.h"
#include "llvm/ADT/Triple.h"		#include "llvm/ADT/Triple.h"
		#include "llvm/Frontend/OpenMP/OMPGridValues.h"
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
#include "llvm/Support/X86TargetParser.h"		#include "llvm/Support/X86TargetParser.h"

namespace clang {		namespace clang {
namespace targets {		namespace targets {

static const unsigned X86AddrSpaceMap[] = {		static const unsigned X86AddrSpaceMap[] = {
0, // Default		0, // Default
Show All 12 Lines	static const unsigned X86AddrSpaceMap[] = {
0, // sycl_global_host		0, // sycl_global_host
0, // sycl_local		0, // sycl_local
0, // sycl_private		0, // sycl_private
270, // ptr32_sptr		270, // ptr32_sptr
271, // ptr32_uptr		271, // ptr32_uptr
272 // ptr64		272 // ptr64
};		};

// X86 target abstract base class; x86-32 and x86-64 are very close, so		// X86 target abstract base class; x86-32 and x86-64 are very close, so
		JonChesterfieldUnsubmitted Done Reply Inline Actions It's not clear to me what this is x86 specific. Being able to run our tests on power / arm etc seems like an advantage. Would also mean we would avoid adding openmp stuff the x86 specific files. Maybe OpenMPVGPUAddrSpaceMap and put it in one of the openmp source files? JonChesterfield: It's not clear to me what this is x86 specific. Being able to run our tests on power / arm etc…
// most of the implementation can be shared.		// most of the implementation can be shared.
class LLVM_LIBRARY_VISIBILITY X86TargetInfo : public TargetInfo {		class LLVM_LIBRARY_VISIBILITY X86TargetInfo : public TargetInfo {

enum X86SSEEnum {		enum X86SSEEnum {
NoSSE,		NoSSE,
SSE1,		SSE1,
SSE2,		SSE2,
SSE3,		SSE3,
▲ Show 20 Lines • Show All 326 Lines • ▼ Show 20 Lines	uint64_t getPointerWidthV(unsigned AddrSpace) const override {
if (AddrSpace == ptr64)		if (AddrSpace == ptr64)
return 64;		return 64;
return PointerWidth;		return PointerWidth;
}		}

uint64_t getPointerAlignV(unsigned AddrSpace) const override {		uint64_t getPointerAlignV(unsigned AddrSpace) const override {
return getPointerWidthV(AddrSpace);		return getPointerWidthV(AddrSpace);
}		}

		const llvm::omp::GV &getGridValue() const override {
		return llvm::omp::VirtualGpuGridValues;
		}
		jdoerfertUnsubmitted Not Done Reply Inline Actions Do we need the changes in this file at all? I couldn't see why. jdoerfert: Do we need the changes in this file at all? I couldn't see why.
};		};

// X86-32 generic target		// X86-32 generic target
class LLVM_LIBRARY_VISIBILITY X86_32TargetInfo : public X86TargetInfo {		class LLVM_LIBRARY_VISIBILITY X86_32TargetInfo : public X86TargetInfo {
public:		public:
X86_32TargetInfo(const llvm::Triple &Triple, const TargetOptions &Opts)		X86_32TargetInfo(const llvm::Triple &Triple, const TargetOptions &Opts)
: X86TargetInfo(Triple, Opts) {		: X86TargetInfo(Triple, Opts) {
DoubleAlign = LongLongAlign = 32;		DoubleAlign = LongLongAlign = 32;
▲ Show 20 Lines • Show All 538 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

Show First 20 Lines • Show All 1,113 Lines • ▼ Show 20 Lines	auto *GVMode = new llvm::GlobalVariable(
CGM.getModule(), CGM.Int8Ty, /isConstant=/true,		CGM.getModule(), CGM.Int8Ty, /isConstant=/true,
llvm::GlobalValue::WeakAnyLinkage,		llvm::GlobalValue::WeakAnyLinkage,
llvm::ConstantInt::get(CGM.Int8Ty, Mode ? OMP_TGT_EXEC_MODE_SPMD		llvm::ConstantInt::get(CGM.Int8Ty, Mode ? OMP_TGT_EXEC_MODE_SPMD
: OMP_TGT_EXEC_MODE_GENERIC),		: OMP_TGT_EXEC_MODE_GENERIC),
Twine(Name, "_exec_mode"));		Twine(Name, "_exec_mode"));
CGM.addCompilerUsedGlobal(GVMode);		CGM.addCompilerUsedGlobal(GVMode);
}		}

void CGOpenMPRuntimeGPU::createOffloadEntry(llvm::Constant *ID,		void CGOpenMPRuntimeGPU::createOffloadEntry(
llvm::Constant *Addr,		llvm::Constant ID, llvm::Constant Addr, uint64_t Size, int32_t Flags,
uint64_t Size, int32_t,		llvm::GlobalValue::LinkageTypes Linkage) {
llvm::GlobalValue::LinkageTypes) {		if (CGM.getTarget().getTriple().getVendor() == llvm::Triple::OpenMP_VGPU)
		jdoerfertUnsubmitted Not Done Reply Inline Actions isOpenMPVGPU jdoerfert: isOpenMPVGPU
		return CGOpenMPRuntime::createOffloadEntry(ID, Addr, Size, Flags, Linkage);
// TODO: Add support for global variables on the device after declare target		// TODO: Add support for global variables on the device after declare target
// support.		// support.
llvm::Function *Fn = dyn_cast<llvm::Function>(Addr);		llvm::Function *Fn = dyn_cast<llvm::Function>(Addr);
if (!Fn)		if (!Fn)
return;		return;

llvm::Module &M = CGM.getModule();		llvm::Module &M = CGM.getModule();
llvm::LLVMContext &Ctx = CGM.getLLVMContext();		llvm::LLVMContext &Ctx = CGM.getLLVMContext();
▲ Show 20 Lines • Show All 2,856 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenModule.cpp

Show First 20 Lines • Show All 243 Lines • ▼ Show 20 Lines	void CodeGenModule::createOpenMPRuntime() {
case llvm::Triple::nvptx:		case llvm::Triple::nvptx:
case llvm::Triple::nvptx64:		case llvm::Triple::nvptx64:
case llvm::Triple::amdgcn:		case llvm::Triple::amdgcn:
assert(getLangOpts().OpenMPIsDevice &&		assert(getLangOpts().OpenMPIsDevice &&
"OpenMP AMDGPU/NVPTX is only prepared to deal with device code.");		"OpenMP AMDGPU/NVPTX is only prepared to deal with device code.");
OpenMPRuntime.reset(new CGOpenMPRuntimeGPU(*this));		OpenMPRuntime.reset(new CGOpenMPRuntimeGPU(*this));
break;		break;
default:		default:
if (LangOpts.OpenMPSimd)		if (getTriple().getVendor() == llvm::Triple::OpenMP_VGPU)
		jdoerfertUnsubmitted Not Done Reply Inline Actions isOpenMPVGPU jdoerfert: isOpenMPVGPU
		OpenMPRuntime.reset(new CGOpenMPRuntimeGPU(*this));
		else if (LangOpts.OpenMPSimd)
OpenMPRuntime.reset(new CGOpenMPSIMDRuntime(*this));		OpenMPRuntime.reset(new CGOpenMPSIMDRuntime(*this));
else		else
OpenMPRuntime.reset(new CGOpenMPRuntime(*this));		OpenMPRuntime.reset(new CGOpenMPRuntime(*this));
break;		break;
}		}
}		}

void CodeGenModule::createCUDARuntime() {		void CodeGenModule::createCUDARuntime() {
▲ Show 20 Lines • Show All 6,383 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Gnu.cpp

	Show First 20 Lines • Show All 3,063 Lines • ▼ Show 20 Lines
	void Generic_ELF::anchor() {}			void Generic_ELF::anchor() {}

	void Generic_ELF::addClangTargetOptions(const ArgList &DriverArgs,			void Generic_ELF::addClangTargetOptions(const ArgList &DriverArgs,
	ArgStringList &CC1Args,			ArgStringList &CC1Args,
	Action::OffloadKind) const {			Action::OffloadKind) const {
	if (!DriverArgs.hasFlag(options::OPT_fuse_init_array,			if (!DriverArgs.hasFlag(options::OPT_fuse_init_array,
	options::OPT_fno_use_init_array, true))			options::OPT_fno_use_init_array, true))
	CC1Args.push_back("-fno-use-init-array");			CC1Args.push_back("-fno-use-init-array");

				if (DriverArgs.hasArg(options::OPT_S))
				return;

				if (getTriple().getVendor() == llvm::Triple::OpenMP_VGPU) {
				jdoerfertUnsubmitted Not Done Reply Inline Actions isOpenMPVGPU jdoerfert: isOpenMPVGPU
				std::string BitcodeSuffix = getTripleString() + "-openmp_vgpu";
				tianshilei1992Unsubmitted Not Done Reply Inline Actions Maybe `"x86_64-openmp_vpu"` now? tianshilei1992: Maybe `"x86_64-openmp_vpu"` now?
				jdoerfertUnsubmitted Done Reply Inline Actions not x86, right? triple contains the proper arch jdoerfert: not x86, right? triple contains the proper arch
				clang::driver::tools::addOpenMPDeviceRTL(getDriver(), DriverArgs, CC1Args,
				BitcodeSuffix, getTriple());
				}
	}			}

clang/lib/Frontend/CompilerInvocation.cpp

Show First 20 Lines • Show All 3,979 Lines • ▼ Show 20 Lines	#undef LANG_OPTION_WITH_MARSHALLING
if (Arg *A = Args.getLastArg(options::OPT_fopenmp_host_ir_file_path)) {		if (Arg *A = Args.getLastArg(options::OPT_fopenmp_host_ir_file_path)) {
Opts.OMPHostIRFile = A->getValue();		Opts.OMPHostIRFile = A->getValue();
if (!llvm::sys::fs::exists(Opts.OMPHostIRFile))		if (!llvm::sys::fs::exists(Opts.OMPHostIRFile))
Diags.Report(diag::err_drv_omp_host_ir_file_not_found)		Diags.Report(diag::err_drv_omp_host_ir_file_not_found)
<< Opts.OMPHostIRFile;		<< Opts.OMPHostIRFile;
}		}

// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options		// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options
Opts.OpenMPCUDAMode = Opts.OpenMPIsDevice && (T.isNVPTX() \|\| T.isAMDGCN()) &&		Opts.OpenMPCUDAMode = Opts.OpenMPIsDevice &&
		(T.isNVPTX() \|\| T.isAMDGCN() \|\| T.isOpenMPVGPU()) &&
Args.hasArg(options::OPT_fopenmp_cuda_mode);		Args.hasArg(options::OPT_fopenmp_cuda_mode);
		JonChesterfieldUnsubmitted Done Reply Inline Actions Add a isOpenmpVGPU function? JonChesterfield: Add a isOpenmpVGPU function?

// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options		// Set CUDA mode for OpenMP target NVPTX/AMDGCN if specified in options
Opts.OpenMPCUDAForceFullRuntime =		Opts.OpenMPCUDAForceFullRuntime =
Opts.OpenMPIsDevice && (T.isNVPTX() \|\| T.isAMDGCN()) &&		Opts.OpenMPIsDevice && (T.isNVPTX() \|\| T.isAMDGCN()) &&
Args.hasArg(options::OPT_fopenmp_cuda_force_full_runtime);		Args.hasArg(options::OPT_fopenmp_cuda_force_full_runtime);

// FIXME: Eliminate this dependency.		// FIXME: Eliminate this dependency.
unsigned Opt = getOptimizationLevel(Args, IK, Diags),		unsigned Opt = getOptimizationLevel(Args, IK, Diags),
▲ Show 20 Lines • Show All 720 Lines • Show Last 20 Lines

llvm/include/llvm/ADT/Triple.h

Show First 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	enum VendorType {
MipsTechnologies,		MipsTechnologies,
NVIDIA,		NVIDIA,
CSR,		CSR,
Myriad,		Myriad,
AMD,		AMD,
Mesa,		Mesa,
SUSE,		SUSE,
OpenEmbedded,		OpenEmbedded,
LastVendorType = OpenEmbedded		OpenMP_VGPU,
		LastVendorType = OpenMP_VGPU
		jdoerfertUnsubmitted Done Reply Inline Actions Let's call it OpenMP_VGPU or something like that to make it clear. jdoerfert: Let's call it OpenMP_VGPU or something like that to make it clear.
};		};
enum OSType {		enum OSType {
UnknownOS,		UnknownOS,

Ananas,		Ananas,
CloudABI,		CloudABI,
Darwin,		Darwin,
DragonFly,		DragonFly,
▲ Show 20 Lines • Show All 509 Lines • ▼ Show 20 Lines	bool isSPIRV() const {
return getArch() == Triple::spirv32 \|\| getArch() == Triple::spirv64;		return getArch() == Triple::spirv32 \|\| getArch() == Triple::spirv64;
}		}

/// Tests whether the target is NVPTX (32- or 64-bit).		/// Tests whether the target is NVPTX (32- or 64-bit).
bool isNVPTX() const {		bool isNVPTX() const {
return getArch() == Triple::nvptx \|\| getArch() == Triple::nvptx64;		return getArch() == Triple::nvptx \|\| getArch() == Triple::nvptx64;
}		}

		/// Tests whether the target is OpenMP VGPU.
		bool isOpenMPVGPU() const { return getVendor() == llvm::Triple::OpenMP_VGPU; }

/// Tests whether the target is AMDGCN		/// Tests whether the target is AMDGCN
bool isAMDGCN() const { return getArch() == Triple::amdgcn; }		bool isAMDGCN() const { return getArch() == Triple::amdgcn; }

bool isAMDGPU() const {		bool isAMDGPU() const {
return getArch() == Triple::r600 \|\| getArch() == Triple::amdgcn;		return getArch() == Triple::r600 \|\| getArch() == Triple::amdgcn;
}		}

/// Tests whether the target is Thumb (little and big endian).		/// Tests whether the target is Thumb (little and big endian).
▲ Show 20 Lines • Show All 304 Lines • Show Last 20 Lines

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	static constexpr GV NVPTXGridValues = {
256, // GV_Slot_Size		256, // GV_Slot_Size
32, // GV_Warp_Size		32, // GV_Warp_Size
1024, // GV_Max_Teams		1024, // GV_Max_Teams
896, // GV_SimpleBufferSize		896, // GV_SimpleBufferSize
1024, // GV_Max_WG_Size		1024, // GV_Max_WG_Size
128, // GV_Default_WG_Size		128, // GV_Default_WG_Size
};		};

		/// For Virtual GPUs
		static constexpr GV VirtualGpuGridValues = {
		256, // GV_Slot_Size
		32, // GV_Warp_Size
		1024, // GV_Max_Teams
		896, // GV_SimpleBufferSize
		1024, // GV_Max_WG_Size
		128, // GV_Defaut_WG_Size
		};

		static const unsigned OpenMPVGPUAddrSpaceMap[] = {
		0, // Default
		1, // opencl_global
		3, // opencl_local
		4, // opencl_constant
		0, // opencl_private
		0, // opencl_generic
		1, // opencl_global_device
		1, // opencl_global_host
		1, // cuda_device
		4, // cuda_constant
		3, // cuda_shared
		1, // sycl_global
		0, // sycl_global_device
		0, // sycl_global_host
		3, // sycl_local
		0, // sycl_private
		270, // ptr32_sptr
		271, // ptr32_uptr
		272 // ptr64
		};

} // namespace omp		} // namespace omp
} // namespace llvm		} // namespace llvm

#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H		#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H

llvm/lib/Support/Triple.cpp

Show First 20 Lines • Show All 179 Lines • ▼ Show 20 Lines StringRef Triple::getVendorTypeName(VendorType Kind) {

case Mesa: return "mesa"; case Mesa: return "mesa";

case MipsTechnologies: return "mti"; case MipsTechnologies: return "mti";

case Myriad: return "myriad"; case Myriad: return "myriad";

case NVIDIA: return "nvidia"; case NVIDIA: return "nvidia";

case OpenEmbedded: return "oe"; case OpenEmbedded: return "oe";

case PC: return "pc"; case PC: return "pc";

case SCEI: return "scei"; case SCEI: return "scei";

case SUSE: return "suse"; case SUSE: return "suse";

case OpenMP_VGPU:

return "openmp_vgpu";

tianshilei1992Unsubmitted

Done

"openmp_vpu"?

tianshilei1992: `"openmp_vpu"`?

} }

llvm_unreachable("Invalid VendorType!"); llvm_unreachable("Invalid VendorType!");

} }

StringRef Triple::getOSTypeName(OSType Kind) { StringRef Triple::getOSTypeName(OSType Kind) {

switch (Kind) { switch (Kind) {

case UnknownOS: return "unknown"; case UnknownOS: return "unknown";

▲ Show 20 Lines • Show All 291 Lines • ▼ Show 20 Lines if (ArchName.startswith("bpf"))

return parseBPFArch(ArchName); return parseBPFArch(ArchName);

} }

return AT; return AT;

} }

static Triple::VendorType parseVendor(StringRef VendorName) { static Triple::VendorType parseVendor(StringRef VendorName) {

return StringSwitch<Triple::VendorType>(VendorName) return StringSwitch<Triple::VendorType>(VendorName)

.Case("apple", Triple::Apple) .Case("apple", Triple::Apple)

.Case("pc", Triple::PC) .Case("pc", Triple::PC)

.Case("scei", Triple::SCEI) .Case("scei", Triple::SCEI)

.Case("sie", Triple::SCEI) .Case("sie", Triple::SCEI)

.Case("fsl", Triple::Freescale) .Case("fsl", Triple::Freescale)

.Case("ibm", Triple::IBM) .Case("ibm", Triple::IBM)

.Case("img", Triple::ImaginationTechnologies) .Case("img", Triple::ImaginationTechnologies)

.Case("mti", Triple::MipsTechnologies) .Case("mti", Triple::MipsTechnologies)

.Case("nvidia", Triple::NVIDIA) .Case("nvidia", Triple::NVIDIA)

.Case("csr", Triple::CSR) .Case("csr", Triple::CSR)

.Case("myriad", Triple::Myriad) .Case("myriad", Triple::Myriad)

.Case("amd", Triple::AMD) .Case("amd", Triple::AMD)

.Case("mesa", Triple::Mesa) .Case("mesa", Triple::Mesa)

.Case("suse", Triple::SUSE) .Case("suse", Triple::SUSE)

.Case("oe", Triple::OpenEmbedded) .Case("oe", Triple::OpenEmbedded)

.Case("openmp_vgpu", Triple::OpenMP_VGPU)

jdoerfertUnsubmitted

Done

.Case("oe", Triple::OpenEmbedded)

- .Case("vgpu", Triple::OpenMP_VGPU)

+ .Case("openmp_vgpu", Triple::OpenMP_VGPU)

.Default(Triple::UnknownVendor);

jdoerfert:

.Default(Triple::UnknownVendor); .Default(Triple::UnknownVendor);

} }

static Triple::OSType parseOS(StringRef OSName) { static Triple::OSType parseOS(StringRef OSName) {

return StringSwitch<Triple::OSType>(OSName) return StringSwitch<Triple::OSType>(OSName)

.StartsWith("ananas", Triple::Ananas) .StartsWith("ananas", Triple::Ananas)

.StartsWith("cloudabi", Triple::CloudABI) .StartsWith("cloudabi", Triple::CloudABI)

.StartsWith("darwin", Triple::Darwin) .StartsWith("darwin", Triple::Darwin)

.StartsWith("dragonfly", Triple::DragonFly) .StartsWith("dragonfly", Triple::DragonFly)

▲ Show 20 Lines • Show All 1,295 Lines • Show Last 20 Lines

openmp/CMakeLists.txt

Show All 38 Lines	else()

if (NOT MSVC)		if (NOT MSVC)
set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang)		set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang)
set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++)		set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++)
else()		else()
set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang.exe)		set(OPENMP_TEST_C_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang.exe)
set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++.exe)		set(OPENMP_TEST_CXX_COMPILER ${LLVM_RUNTIME_OUTPUT_INTDIR}/clang++.exe)
endif()		endif()

		list(APPEND LIBOMPTARGET_LLVM_INCLUDE_DIRS ${LLVM_MAIN_INCLUDE_DIR} ${LLVM_BINARY_DIR}/include)
endif()		endif()

# Check and set up common compiler flags.		# Check and set up common compiler flags.
include(config-ix)		include(config-ix)
include(HandleOpenMPOptions)		include(HandleOpenMPOptions)

# Set up testing infrastructure.		# Set up testing infrastructure.
include(OpenMPTesting)		include(OpenMPTesting)
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/CMakeLists.txt

Show First 20 Lines • Show All 131 Lines • ▼ Show 20 Lines
# Set flags for LLVM Bitcode compilation.		# Set flags for LLVM Bitcode compilation.
set(bc_flags -S -x c++ -std=c++17 -fvisibility=hidden		set(bc_flags -S -x c++ -std=c++17 -fvisibility=hidden
${clang_opt_flags}		${clang_opt_flags}
-Xclang -emit-llvm-bc		-Xclang -emit-llvm-bc
-Xclang -aux-triple -Xclang ${aux_triple}		-Xclang -aux-triple -Xclang ${aux_triple}
-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device		-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device
-I${include_directory}		-I${include_directory}
-I${devicertl_base_directory}/../include		-I${devicertl_base_directory}/../include
${LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL}		${LIBOMPTARGET_LLVM_INCLUDE_DIRS_DEVICERTL}
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Should only add this include to the vgu, not all the plugins. May be able to use relative include paths to drop it entirely JonChesterfield: Should only add this include to the vgu, not all the plugins. May be able to use relative…
)		)

if(${LIBOMPTARGET_DEVICE_DEBUG})		if(${LIBOMPTARGET_DEVICE_DEBUG})
list(APPEND bc_flags -DOMPTARGET_DEBUG=-1)		list(APPEND bc_flags -DOMPTARGET_DEBUG=-1)
else()		else()
list(APPEND bc_flags -DOMPTARGET_DEBUG=0)		list(APPEND bc_flags -DOMPTARGET_DEBUG=0)
endif()		endif()

function(compileDeviceRTLLibrary target_cpu target_name)		function(compileDeviceRTLLibrary target_cpu target_name)
set(target_bc_flags ${ARGN})		set(target_bc_flags ${ARGN})

set(bc_files "")		set(bc_files "")
foreach(src ${src_files})		foreach(src ${src_files})
get_filename_component(infile ${src} ABSOLUTE)		get_filename_component(infile ${src} ABSOLUTE)
get_filename_component(outfile ${src} NAME)		get_filename_component(outfile ${src} NAME)
set(outfile "${outfile}-${target_cpu}.bc")		set(outfile "${outfile}-${target_cpu}.bc")

add_custom_command(OUTPUT ${outfile}		add_custom_command(OUTPUT ${outfile}
COMMAND ${CLANG_TOOL}		COMMAND ${CLANG_TOOL}
${bc_flags}
-Xclang -target-cpu -Xclang ${target_cpu}
${target_bc_flags}		${target_bc_flags}
		${bc_flags}
${infile} -o ${outfile}		${infile} -o ${outfile}
DEPENDS ${infile}		DEPENDS ${infile}
IMPLICIT_DEPENDS CXX ${infile}		IMPLICIT_DEPENDS CXX ${infile}
COMMENT "Building LLVM bitcode ${outfile}"		COMMENT "Building LLVM bitcode ${outfile}"
VERBATIM		VERBATIM
)		)
if("${CLANG_TOOL}" STREQUAL "$<TARGET_FILE:clang>")		if("${CLANG_TOOL}" STREQUAL "$<TARGET_FILE:clang>")
# Add a file-level dependency to ensure that clang is up-to-date.		# Add a file-level dependency to ensure that clang is up-to-date.
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	add_custom_command(TARGET ${bclib_target_name} POST_BUILD
${LIBOMPTARGET_LIBRARY_DIR})		${LIBOMPTARGET_LIBRARY_DIR})

# Install bitcode library under the lib destination folder.		# Install bitcode library under the lib destination folder.
install(FILES ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} DESTINATION "${OPENMP_INSTALL_LIBDIR}")		install(FILES ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} DESTINATION "${OPENMP_INSTALL_LIBDIR}")
endfunction()		endfunction()

# Generate a Bitcode library for all the compute capabilities the user requested		# Generate a Bitcode library for all the compute capabilities the user requested
foreach(sm ${nvptx_sm_list})		foreach(sm ${nvptx_sm_list})
compileDeviceRTLLibrary(sm_${sm} nvptx -target nvptx64-nvidia-cuda -Xclang -target-feature -Xclang +ptx61 "-D__CUDA_ARCH__=${sm}0")		compileDeviceRTLLibrary(sm_${sm} nvptx -Xclang -target-cpu -Xclang sm_${sm} -target nvptx64-nvidia-cuda -Xclang -target-feature -Xclang +ptx61 "-D__CUDA_ARCH__=${sm}0")
endforeach()		endforeach()

foreach(mcpu ${amdgpu_mcpus})		foreach(mcpu ${amdgpu_mcpus})
compileDeviceRTLLibrary(${mcpu} amdgpu -target amdgcn-amd-amdhsa -D__AMDGCN__ -nogpulib)		compileDeviceRTLLibrary(${mcpu} amdgpu -Xclang -target-cpu -Xclang ${mcpu} -target amdgcn-amd-amdhsa -D__AMDGCN__ -nogpulib)
endforeach()		endforeach()

		compileDeviceRTLLibrary(x86_64 vgpu -target x86_64-vgpu -std=c++20 -I${devicertl_base_directory}/../plugins/vgpu/src)
		tianshilei1992Unsubmitted Not Done Reply Inline Actions It's not a good practice to specify include directories in CMake in this way. Use `include_directories` instead. tianshilei1992: It's not a good practice to specify include directories in CMake in this way. Use…
		atmnpatelAuthorUnsubmitted Done Reply Inline Actions can't quite do that here I think, afaik both `include_directories` and `target_include_directories` require that CMake builds the target, but we specify custom targets/build commands so they don't get pulled in atmnpatel: can't quite do that here I think, afaik both `include_directories` and…

openmp/libomptarget/DeviceRTL/include/ThreadEnvironment.h

This file was added.

				//===--- ThreadEnvironment.h - OpenMP VGPU Dummy Header File ------ C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Dummy header file to avoid preprocessor errors in device runtime.
				//
				//===----------------------------------------------------------------------===//

openmp/libomptarget/DeviceRTL/src/Debug.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

#pragma omp begin declare variant match(device = {arch(amdgcn)})

namespace impl {

static int32_t omp_vprintf(const char *Format, void *Arguments, uint32_t) {

return -1;

}

} // namespace impl

#pragma omp end declare variant

#pragma omp begin declare variant match(device = {kind(cpu)})

int32_t vprintf(const char *, void *);

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

int32_t vprintf(const char *, void *);

jdoerfert:

namespace impl {

static int32_t omp_vprintf(const char *Format, void *Arguments, uint32_t) {

return vprintf(Format, Arguments);

}

} // namespace impl

#pragma omp end declare variant

int32_t __llvm_omp_vprintf(const char *Format, void *Arguments, uint32_t Size) {

return impl::omp_vprintf(Format, Arguments, Size);

}

/// Current indentation level for the function trace. Only accessed by thread 0.

__attribute__((loader_uninitialized))

static uint32_t Level;

Show All 25 Lines

openmp/libomptarget/DeviceRTL/src/Mapping.cpp

Show All 11 Lines

#include "Mapping.h"

#include "Interface.h"

#include "State.h"

#include "Types.h"

#include "Utils.h"

#pragma omp declare target

#include "ThreadEnvironment.h"

#include "llvm/Frontend/OpenMP/OMPGridValues.h"

using namespace _OMP;

/// Virtual GPU Implementation

///

///{

#pragma omp begin declare variant match(device = {kind(cpu)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

#include "ThreadEnvironment.h"

jdoerfert:

namespace _OMP {

jdoerfertUnsubmitted

Done

Move up to the beginning.

jdoerfert: Move up to the beginning.

namespace impl {

constexpr const llvm::omp::GV &getGridValue() {

return llvm::omp::VirtualGpuGridValues;

}

LaneMaskTy activemask() {

uint64_t B = 0;

uint32_t N = mapping::getWarpSize();

while (N)

B |= (1 << (--N));

return B;

}

LaneMaskTy lanemaskLT() {

const uint32_t Lane = mapping::getThreadIdInWarp();

LaneMaskTy Ballot = mapping::activemask();

LaneMaskTy Mask = ((LaneMaskTy)1 << Lane) - (LaneMaskTy)1;

return Mask & Ballot;

}

LaneMaskTy lanemaskGT() {

const uint32_t Lane = mapping::getThreadIdInWarp();

if (Lane == (mapping::getWarpSize() - 1))

return 0;

LaneMaskTy Ballot = mapping::activemask();

LaneMaskTy Mask = (~((LaneMaskTy)0)) << (Lane + 1);

return Mask & Ballot;

}

uint32_t getThreadIdInWarp() {

return mapping::getThreadIdInBlock() & (mapping::getWarpSize() - 1);

}

uint32_t getThreadIdInBlock() {

return getThreadEnvironment()->getThreadIdInBlock();

}

uint32_t getNumHardwareThreadsInBlock() {

return getThreadEnvironment()->getBlockSize();

}

uint32_t getKernelSize() { return getThreadEnvironment()->getKernelSize(); }

uint32_t getBlockId() { return getThreadEnvironment()->getBlockId(); }

uint32_t getNumberOfBlocks() {

return getThreadEnvironment()->getNumberOfBlocks();

}

uint32_t getNumberOfProcessorElements() { return mapping::getBlockSize(); }

uint32_t getWarpId() {

return mapping::getThreadIdInBlock() / mapping::getWarpSize();

}

uint32_t getWarpSize() { return getThreadEnvironment()->getWarpSize(); }

uint32_t getNumberOfWarpsInBlock() {

return (mapping::getBlockSize() + mapping::getWarpSize() - 1) /

mapping::getWarpSize();

}

} // namespace impl

} // namespace _OMP

#pragma omp end declare variant

namespace _OMP {

namespace impl {

/// AMDGCN Implementation

///

///{

#pragma omp begin declare variant match(device = {arch(amdgcn)})

▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines

uint32_t getWarpSize() { return getGridValue().GV_Warp_Size; }

} // namespace impl

} // namespace _OMP

/// We have to be deliberate about the distinction of `mapping::` and `impl::`

/// below to avoid repeating assumptions or including irrelevant ones.

///{

jdoerfertUnsubmitted

Done

We probably should use kind(CPU) or something instead. Nothing x86 specific about it I think.

jdoerfert: We probably should use kind(CPU) or something instead. Nothing x86 specific about it I think.

static bool isInLastWarp() {

uint32_t MainTId = (mapping::getNumberOfProcessorElements() - 1) &

~(mapping::getWarpSize() - 1);

return mapping::getThreadIdInBlock() == MainTId;

}

bool mapping::isMainThreadInGenericMode(bool IsSPMD) {

▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/src/Misc.cpp

	Show All 12 Lines

	#include "Debug.h"			#include "Debug.h"

	#pragma omp declare target			#pragma omp declare target

	namespace _OMP {			namespace _OMP {
	namespace impl {			namespace impl {

	/// AMDGCN Implementation			/// Generic Implementation - AMDGCN, VGPU
	///			///
	///{			///{
	#pragma omp begin declare variant match(device = {arch(amdgcn)})

	double getWTick() { return ((double)1E-9); }			double getWTick() { return ((double)1E-9); }

	double getWTime() {			double getWTime() {
	// The intrinsics for measuring time have undocumented frequency			// The intrinsics for measuring time have undocumented frequency
	// This will probably need to be found by measurement on a number of			// This will probably need to be found by measurement on a number of
	// architectures. Until then, return 0, which is very inaccurate as a			// architectures. Until then, return 0, which is very inaccurate as a
	// timer but resolves the undefined symbol at link time.			// timer but resolves the undefined symbol at link time.
	return 0;			return 0;
	}			}

	#pragma omp end declare variant

	/// NVPTX Implementation			/// NVPTX Implementation
	///			///
	///{			///{
	#pragma omp begin declare variant match( \			#pragma omp begin declare variant match( \
	device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})			device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

	double getWTick() {			double getWTick() {
	// Timer precision is 1ns			// Timer precision is 1ns
	Show All 36 Lines

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

Show All 10 Lines

//===----------------------------------------------------------------------===//

#include "Synchronization.h"

#include "Debug.h"

#include "Interface.h"

#include "Mapping.h"

#include "State.h"

#include "ThreadEnvironment.h"

#include "Types.h"

#include "Utils.h"

#pragma omp declare target

using namespace _OMP;

namespace impl {

▲ Show 20 Lines • Show All 251 Lines • ▼ Show 20 Lines

void setLock(omp_lock_t *Lock) {

} // wait for 0 to be the read value

}

#pragma omp end declare variant

///}

} // namespace impl

/// Virtual GPU Implementation

///

///{

#pragma omp begin declare variant match(device = {kind(cpu)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

#include "ThreadEnvironment.h"

jdoerfert:

namespace impl {

jdoerfertUnsubmitted

Done

Move up.

jdoerfert: Move up.

uint32_t atomicInc(uint32_t *Address, uint32_t Val, int Ordering) {

return VGPUImpl::atomicInc(Address, Val, Ordering);

}

void namedBarrierInit() {}

void namedBarrier() {

uint32_t NumThreads = omp_get_num_threads();

ASSERT(NumThreads % mapping::getWarpSize() == 0);

getThreadEnvironment()->namedBarrier(true);

}

void fenceTeam(int Ordering) { getThreadEnvironment()->fenceTeam(Ordering); }

void fenceKernel(int Ordering) {

getThreadEnvironment()->fenceKernel(Ordering);

}

// Simply call fenceKernel because there is no need to sync with host

void fenceSystem(int Ordering) { fenceKernel(Ordering); }

void syncWarp(__kmpc_impl_lanemask_t Mask) {

jdoerfertUnsubmitted

Done

Pass the memory order, also rename the arguments to match the coding convention.

jdoerfert: Pass the memory order, also rename the arguments to match the coding convention.

getThreadEnvironment()->syncWarp(Mask);

}

jdoerfertUnsubmitted

Done

Pass the mask

jdoerfert: Pass the mask

void syncThreads() { getThreadEnvironment()->namedBarrier(false); }

constexpr uint32_t OMP_SPIN = 1000;

constexpr uint32_t UNSET = 0;

constexpr uint32_t SET = 1;

jdoerfertUnsubmitted

Not Done

Remove these. Also the TODO below (copied from somewhere)

jdoerfert: Remove these. Also the TODO below (copied from somewhere)

// TODO: This seems to hide a bug in the declare variant handling. If it is

// called before it is defined

// here the overload won't happen. Investigate lalter!

void unsetLock(omp_lock_t *Lock) { VGPUImpl::unsetLock((uint32_t *)Lock); }

int testLock(omp_lock_t *Lock) { return VGPUImpl::testLock((uint32_t *)Lock); }

void initLock(omp_lock_t *Lock) { VGPUImpl::initLock((uint32_t *)Lock); }

void destroyLock(omp_lock_t *Lock) { VGPUImpl::destroyLock((uint32_t *)Lock); }

void setLock(omp_lock_t *Lock) { VGPUImpl::setLock((uint32_t *)Lock); }

void syncThreadsAligned() {}

} // namespace impl

#pragma omp end declare variant

///}

jdoerfertUnsubmitted

Not Done

We should simply use omp locks. Either here, or maybe better, in VGPUImpl. So redirect all calls to there and use a proper lock. no OMP_SPIN and stuff

jdoerfert: We should simply use omp locks. Either here, or maybe better, in VGPUImpl. So redirect all…

void synchronize::init(bool IsSPMD) {

if (!IsSPMD)

impl::namedBarrierInit();

}

void synchronize::warp(LaneMaskTy Mask) { impl::syncWarp(Mask); }

void synchronize::threads() { impl::syncThreads(); }

▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/src/Utils.cpp

//===------- Utils.cpp - OpenMP device runtime utility functions -- C++ -*-===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include "Utils.h"

#include "Debug.h"

#include "Interface.h"

#include "Mapping.h"

#include "ThreadEnvironment.h"

#pragma omp declare target

using namespace _OMP;

namespace _OMP {

/// Helper to keep code alive without introducing a performance penalty.

__attribute__((used, retain, weak, optnone, cold)) void keepAlive() {

__kmpc_get_hardware_thread_id_in_block();

__kmpc_get_hardware_num_threads_in_block();

__kmpc_get_warp_size();

__kmpc_barrier_simple_spmd(nullptr, 0);

__kmpc_barrier_simple_generic(nullptr, 0);

}

} // namespace _OMP

namespace impl {

/// AMDGCN Implementation

/// AMDGCN/Generic Implementation

///

///{

#pragma omp begin declare variant match(device = {arch(amdgcn)})

void Unpack(uint64_t Val, uint32_t *LowBits, uint32_t *HighBits) {

static_assert(sizeof(unsigned long) == 8, "");

*LowBits = (uint32_t)(Val & 0x00000000FFFFFFFFUL);

*HighBits = (uint32_t)((Val & 0xFFFFFFFF00000000UL) >> 32);

}

uint64_t Pack(uint32_t LowBits, uint32_t HighBits) {

return (((uint64_t)HighBits) << 32) | (uint64_t)LowBits;

}

#pragma omp end declare variant

/// NVPTX Implementation

///

///{

#pragma omp begin declare variant match( \

device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

void Unpack(uint64_t Val, uint32_t *LowBits, uint32_t *HighBits) {

jdoerfert:

void Unpack(uint64_t Val, uint32_t *LowBits, uint32_t *HighBits) {

uint32_t LowBitsLocal, HighBitsLocal;

asm("mov.b64 {%0,%1}, %2;"

: "=r"(LowBitsLocal), "=r"(HighBitsLocal)

: "l"(Val));

*LowBits = LowBitsLocal;

*HighBits = HighBitsLocal;

}

uint64_t Pack(uint32_t LowBits, uint32_t HighBits) {

uint64_t Val;

jdoerfertUnsubmitted

Not Done

Can't we merge this with AMDGPU?

jdoerfert: Can't we merge this with AMDGPU?

asm("mov.b64 %0, {%1,%2};" : "=l"(Val) : "r"(LowBits), "r"(HighBits));

return Val;

}

#pragma omp end declare variant

/// AMDGCN Implementation

///

Show All 31 Lines

int32_t shuffleDown(uint64_t Mask, int32_t Var, uint32_t Delta, int32_t Width) {

int32_t T = ((mapping::getWarpSize() - Width) << 8) | 0x1f;

return __nvvm_shfl_sync_down_i32(Mask, Var, Delta, T);

}

#pragma omp end declare variant

} // namespace impl

/// Virtual GPU Implementation

///

///{

#pragma omp begin declare variant match(device = {kind(cpu)})

jdoerfertUnsubmitted

Not Done

#pragma omp begin declare variant match( \

- device = {kind(cpu)}, implementation = {extension(match_any)})

+ device = {kind(cpu)})

#include "ThreadEnvironment.h"

jdoerfert:

namespace impl {

jdoerfertUnsubmitted

Done

Move up

jdoerfert: Move up

int32_t shuffle(uint64_t Mask, int32_t Var, int32_t SrcLane) {

return getThreadEnvironment()->shuffle(Mask, Var, SrcLane);

}

int32_t shuffleDown(uint64_t Mask, int32_t Var, uint32_t Delta, int32_t Width) {

return getThreadEnvironment()->shuffleDown(Mask, Var, Delta);

}

jdoerfertUnsubmitted

Done

Pass the mask, both times.

jdoerfert: Pass the mask, both times.

} // namespace impl

#pragma omp end declare variant

uint64_t utils::pack(uint32_t LowBits, uint32_t HighBits) {

return impl::Pack(LowBits, HighBits);

}

void utils::unpack(uint64_t Val, uint32_t &LowBits, uint32_t &HighBits) {

impl::Unpack(Val, &LowBits, &HighBits);

}

Show All 26 Lines

openmp/libomptarget/plugins/CMakeLists.txt

	Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
	endmacro()			endmacro()

	add_subdirectory(aarch64)			add_subdirectory(aarch64)
	add_subdirectory(amdgpu)			add_subdirectory(amdgpu)
	add_subdirectory(cuda)			add_subdirectory(cuda)
	add_subdirectory(ppc64)			add_subdirectory(ppc64)
	add_subdirectory(ppc64le)			add_subdirectory(ppc64le)
	add_subdirectory(ve)			add_subdirectory(ve)
				add_subdirectory(vgpu)
	add_subdirectory(x86_64)			add_subdirectory(x86_64)
	add_subdirectory(remote)			add_subdirectory(remote)

	# Make sure the parent scope can see the plugins that will be created.			# Make sure the parent scope can see the plugins that will be created.
	set(LIBOMPTARGET_SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}" PARENT_SCOPE)			set(LIBOMPTARGET_SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}" PARENT_SCOPE)
	set(LIBOMPTARGET_TESTED_PLUGINS "${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)			set(LIBOMPTARGET_TESTED_PLUGINS "${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)

openmp/libomptarget/plugins/vgpu/CMakeLists.txt

This file was added.

				###===----------------------------------------------------------------------===##
				# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				# See https://llvm.org/LICENSE.txt for license information.
				# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				#
				##===----------------------------------------------------------------------===##
				#
				# Build the VGPU plugin for virtual GPU offloading.
				#
				##===----------------------------------------------------------------------===#

				if (NOT(LIBOMPTARGET_ENABLE_EXPERIMENTAL_VGPU_PLUGIN))
				return()
				endif()

				macro(build_generic_elf64_vgpu tmachine tmachine_name tmachine_libname tmachine_triple elf_machine_id)
				if(CMAKE_SYSTEM_PROCESSOR MATCHES "${tmachine}$")
				if(LIBOMPTARGET_DEP_LIBELF_FOUND)
				if(LIBOMPTARGET_DEP_LIBFFI_FOUND)
				libomptarget_say("Building ${tmachine_triple}-${tmachine_name} offloading plugin.")

				include_directories(${LIBOMPTARGET_DEP_LIBFFI_INCLUDE_DIR})
				include_directories(${LIBOMPTARGET_DEP_LIBELF_INCLUDE_DIR})
				include_directories(${LIBOMPTARGET_INCLUDE_DIR})

				# Define macro to be used as prefix of the runtime messages for this target.
				add_definitions("-DTARGET_NAME=${tmachine_name}")

				# Define macro with the ELF ID for this target.
				add_definitions("-DTARGET_ELF_ID=${elf_machine_id}")

				add_library("omptarget.rtl.${tmachine_libname}" SHARED
				${CMAKE_CURRENT_SOURCE_DIR}/src/rtl.cpp
				${CMAKE_CURRENT_SOURCE_DIR}/src/ThreadEnvironment.cpp
				${CMAKE_CURRENT_SOURCE_DIR}/src/ThreadEnvironmentImpl.cpp)

				# Install plugin under the lib destination folder.
				install(TARGETS "omptarget.rtl.${tmachine_libname}"
				LIBRARY DESTINATION "${OPENMP_INSTALL_LIBDIR}")

				set_target_properties("omptarget.rtl.${tmachine_libname}" PROPERTIES CXX_STANDARD 20)

				target_link_libraries(
				"omptarget.rtl.${tmachine_libname}"
				elf_common
				${LIBOMPTARGET_DEP_LIBFFI_LIBRARIES}
				${LIBOMPTARGET_DEP_LIBELF_LIBRARIES}
				dl
				# ${OPENMP_PTHREAD_LIB}
				"-rdynamic"
				"-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/../exports"
				)

				list(APPEND LIBOMPTARGET_TESTED_PLUGINS
				"omptarget.rtl.${tmachine_libname}")

				# Report to the parent scope that we are building a plugin.
				set(LIBOMPTARGET_SYSTEM_TARGETS
				"${LIBOMPTARGET_SYSTEM_TARGETS} ${tmachine_triple}" PARENT_SCOPE)
				set(LIBOMPTARGET_TESTED_PLUGINS
				"${LIBOMPTARGET_TESTED_PLUGINS}" PARENT_SCOPE)
				else(LIBOMPTARGET_DEP_LIBFFI_FOUND)
				libomptarget_say("Not building ${tmachine_name} offloading plugin: libffi dependency not found.")
				endif(LIBOMPTARGET_DEP_LIBFFI_FOUND)
				else(LIBOMPTARGET_DEP_LIBELF_FOUND)
				libomptarget_say("Not building ${tmachine_name} offloading plugin: libelf dependency not found.")
				endif(LIBOMPTARGET_DEP_LIBELF_FOUND)
				else()
				libomptarget_say("Not building ${tmachine_name}-vgpu offloading plugin: machine not found in the system.")
				endif()
				endmacro()

				build_generic_elf64_vgpu("x86_64" "vgpu" "vgpu" "x86_64-vgpu" "62")

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.h

This file was added.

				//===---- ThreadEnvironment.h - Virtual GPU thread environment ----- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENT_H
				#define OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENT_H

				using LaneMaskTy = uint64_t;

				// Forward declaration
				class WarpEnvironmentTy;
				class ThreadBlockEnvironmentTy;
				class CTAEnvironmentTy;
				namespace VGPUImpl {
				class ThreadEnvironmentTy;
				void initLock(uint32_t *Lock);
				void destroyLock(uint32_t *Lock);
				void setLock(uint32_t *Lock);
				void unsetLock(uint32_t *Lock);
				bool testLock(uint32_t *Lock);
				uint32_t atomicInc(uint32_t *Address, uint32_t Val, int Ordering);
				} // namespace VGPUImpl

				class ThreadEnvironmentTy {
				VGPUImpl::ThreadEnvironmentTy *Impl;

				public:
				ThreadEnvironmentTy(WarpEnvironmentTy WE, CTAEnvironmentTy CTAE);

				~ThreadEnvironmentTy();

				unsigned getThreadIdInWarp() const;

				unsigned getThreadIdInBlock() const;

				unsigned getGlobalThreadId() const;

				unsigned getBlockSize() const;

				unsigned getKernelSize() const;

				unsigned getBlockId() const;

				unsigned getNumberOfBlocks() const;

				LaneMaskTy getActiveMask() const;

				unsigned getWarpSize() const;

				int32_t shuffle(uint64_t Mask, int32_t Var, uint64_t SrcLane);

				int32_t shuffleDown(uint64_t Mask, int32_t Var, uint32_t Delta);

				void fenceKernel(int32_t MemoryOrder);

				void fenceTeam(int MemoryOrder);

				void syncWarp(int Mask);

				void namedBarrier(bool Generic);

				void setBlockEnv(ThreadBlockEnvironmentTy *TBE);

				void resetBlockEnv();
				};

				ThreadEnvironmentTy *getThreadEnvironment(void);

				#endif // OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENT_H

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.cpp

This file was added.

				//===---- DeviceEnvironment.cpp - Virtual GPU Device Environment -- C++ ---===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Implementation of VGPU environment classes.
				//
				//===----------------------------------------------------------------------===//
				//
				#include <cstdint>

				#include "ThreadEnvironment.h"
				#include "ThreadEnvironmentImpl.h"
				#include <barrier>
				#include <mutex>

				std::mutex AtomicIncLock;

				uint32_t VGPUImpl::atomicInc(uint32_t *Address, uint32_t Val, int Ordering) {
				std::lock_guard G(AtomicIncLock);
				uint32_t V = *Address;
				if (V >= Val)
				*Address = 0;
				else
				*Address += 1;
				return V;
				}

				void VGPUImpl::initLock(uint32_t Lock) { Lock = (uint32_t )new std::mutex; }

				void VGPUImpl::destroyLock(uint32_t *Lock) {
				std::mutex Mtx = (std::mutex )Lock;
				delete Mtx;
				}

				void VGPUImpl::setLock(uint32_t Lock) { ((std::mutex )Lock)->lock(); }

				void VGPUImpl::unsetLock(uint32_t Lock) { ((std::mutex )Lock)->unlock(); }

				bool VGPUImpl::testLock(uint32_t *Lock) {
				return ((std::mutex *)Lock)->try_lock();
				}

				extern thread_local ThreadEnvironmentTy *ThreadEnvironment;

				ThreadEnvironmentTy *getThreadEnvironment() { return ThreadEnvironment; }

				jdoerfertUnsubmitted Not Done Reply Inline Actions see above. jdoerfert: see above.
				ThreadEnvironmentTy::ThreadEnvironmentTy(WarpEnvironmentTy *WE,
				CTAEnvironmentTy *CTAE)
				: Impl(new VGPUImpl::ThreadEnvironmentTy(WE, CTAE)) {}

				ThreadEnvironmentTy::~ThreadEnvironmentTy() { delete Impl; }

				void ThreadEnvironmentTy::fenceTeam(int Ordering) { Impl->fenceTeam(Ordering); }

				void ThreadEnvironmentTy::syncWarp(int Ordering) { Impl->syncWarp(Ordering); }

				unsigned ThreadEnvironmentTy::getThreadIdInWarp() const {
				return Impl->getThreadIdInWarp();
				}

				unsigned ThreadEnvironmentTy::getThreadIdInBlock() const {
				return Impl->getThreadIdInBlock();
				}

				unsigned ThreadEnvironmentTy::getGlobalThreadId() const {
				return Impl->getGlobalThreadId();
				}

				unsigned ThreadEnvironmentTy::getBlockSize() const {
				return Impl->getBlockSize();
				}

				unsigned ThreadEnvironmentTy::getKernelSize() const {
				return Impl->getKernelSize();
				}

				unsigned ThreadEnvironmentTy::getBlockId() const { return Impl->getBlockId(); }

				unsigned ThreadEnvironmentTy::getNumberOfBlocks() const {
				return Impl->getNumberOfBlocks();
				}

				LaneMaskTy ThreadEnvironmentTy::getActiveMask() const {
				return Impl->getActiveMask();
				}

				int32_t ThreadEnvironmentTy::shuffle(uint64_t Mask, int32_t Var,
				uint64_t SrcLane) {
				return Impl->shuffle(Mask, Var, SrcLane);
				}

				int32_t ThreadEnvironmentTy::shuffleDown(uint64_t Mask, int32_t Var,
				uint32_t Delta) {
				return Impl->shuffleDown(Mask, Var, Delta);
				}

				void ThreadEnvironmentTy::fenceKernel(int32_t MemoryOrder) {
				return Impl->fenceKernel(MemoryOrder);
				}

				void ThreadEnvironmentTy::namedBarrier(bool Generic) {
				Impl->namedBarrier(Generic);
				}

				void ThreadEnvironmentTy::setBlockEnv(ThreadBlockEnvironmentTy *TBE) {
				Impl->setBlockEnv(TBE);
				}

				void ThreadEnvironmentTy::resetBlockEnv() { Impl->resetBlockEnv(); }

				unsigned ThreadEnvironmentTy::getWarpSize() const {
				return Impl->getWarpSize();
				}

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.h

This file was added.

				//===---- ThreadEnvironmentImpl.h - Virtual GPU thread environment - C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENTIMPL_H
				#define OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENTIMPL_H

				#include "ThreadEnvironment.h"
				#include <barrier>
				#include <cstdio>
				#include <functional>
				#include <map>
				#include <thread>
				#include <vector>

				using BarrierTy = std::barrier<std::function<void(void)>>;

				class WarpEnvironmentTy {
				static unsigned Idx;

				const unsigned ID;

				std::vector<int32_t> ShuffleBuffer;

				BarrierTy Barrier;
				BarrierTy ShuffleBarrier;
				BarrierTy ShuffleDownBarrier;

				public:
				static void configure(unsigned NumThreadsInWarp);

				static unsigned ThreadsPerWarp;

				WarpEnvironmentTy();

				unsigned getWarpId() const;
				int getNumThreads() const;

				void sync(int Ordering);
				void writeShuffleBuffer(int32_t Var, unsigned LaneId);

				int32_t getShuffleBuffer(unsigned LaneId);

				void waitShuffleBarrier();
				void waitShuffleDownBarrier();
				};

				class CTAEnvironmentTy {
				static unsigned Idx;

				public:
				unsigned ID;
				static unsigned NumThreads;
				static unsigned NumCTAs;

				BarrierTy Barrier;
				BarrierTy SyncThreads;
				BarrierTy NamedBarrier;

				static void configure(unsigned TotalNumThreads, unsigned NumBlocksInCTA);

				CTAEnvironmentTy();

				unsigned getId() const;
				unsigned getNumThreads() const;

				unsigned getNumBlocks() const;

				void fence(int Ordering);
				void syncThreads();
				void namedBarrier();
				};

				class ThreadBlockEnvironmentTy {
				unsigned ID;
				unsigned NumBlocks;

				public:
				ThreadBlockEnvironmentTy(unsigned ID, unsigned NumBlocks);

				unsigned getId() const;
				unsigned getNumBlocks() const;
				};

				namespace VGPUImpl {
				class ThreadEnvironmentTy {
				static unsigned Idx;
				unsigned ThreadIdInWarp;
				unsigned ThreadIdInBlock;
				unsigned GlobalThreadIdx;

				WarpEnvironmentTy *WarpEnvironment;
				ThreadBlockEnvironmentTy *ThreadBlockEnvironment;
				CTAEnvironmentTy *CTAEnvironment;

				public:
				ThreadEnvironmentTy(WarpEnvironmentTy WE, CTAEnvironmentTy CTAE);

				void setBlockEnv(ThreadBlockEnvironmentTy *TBE);

				void resetBlockEnv();

				unsigned getThreadIdInWarp() const;
				unsigned getThreadIdInBlock() const;
				unsigned getGlobalThreadId() const;

				unsigned getBlockSize() const;

				unsigned getBlockId() const;

				unsigned getNumberOfBlocks() const;
				unsigned getKernelSize() const;

				// FIXME: This is wrong
				jdoerfertUnsubmitted Not Done Reply Inline Actions at least add more information what the problem and potential solutions are. jdoerfert: at least add more information what the problem and potential solutions are.
				LaneMaskTy getActiveMask() const;

				void fenceTeam(int Ordering);
				void syncWarp(int Ordering);

				int32_t shuffle(uint64_t Mask, int32_t Var, uint64_t SrcLane);

				int32_t shuffleDown(uint64_t Mask, int32_t Var, uint32_t Delta);

				void namedBarrier(bool Generic);

				void fenceKernel(int32_t MemoryOrder);

				unsigned getWarpSize() const;
				};

				} // namespace VGPUImpl

				#endif // OPENMP_LIBOMPTARGET_PLUGINS_VGPU_SRC_THREADENVIRONMENTIMPL_H

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.cpp

This file was added.

				//===---- ThreadEnvironmentImpl.h - Virtual GPU thread environment - C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include <cstdint>

				#include "ThreadEnvironmentImpl.h"
				#include <barrier>
				#include <cstdio>
				#include <functional>
				#include <map>
				#include <thread>
				#include <vector>

				void WarpEnvironmentTy::configure(unsigned NumThreads) {
				ThreadsPerWarp = NumThreads;
				}

				WarpEnvironmentTy::WarpEnvironmentTy()
				: ID(Idx++), ShuffleBuffer(ThreadsPerWarp),
				Barrier(ThreadsPerWarp, []() {}), ShuffleBarrier(ThreadsPerWarp, []() {}),
				ShuffleDownBarrier(ThreadsPerWarp, []() {}) {}

				unsigned WarpEnvironmentTy::getWarpId() const { return ID; }

				int WarpEnvironmentTy::getNumThreads() const { return ThreadsPerWarp; }

				void WarpEnvironmentTy::sync(int Ordering) { Barrier.arrive_and_wait(); }

				void WarpEnvironmentTy::writeShuffleBuffer(int32_t Var, unsigned LaneId) {
				ShuffleBuffer[LaneId] = Var;
				}

				int32_t WarpEnvironmentTy::getShuffleBuffer(unsigned LaneId) {
				return ShuffleBuffer[LaneId];
				}

				void WarpEnvironmentTy::waitShuffleBarrier() {
				ShuffleBarrier.arrive_and_wait();
				}

				void WarpEnvironmentTy::waitShuffleDownBarrier() {
				ShuffleBarrier.arrive_and_wait();
				}

				unsigned WarpEnvironmentTy::Idx = 0;
				unsigned WarpEnvironmentTy::ThreadsPerWarp = 0;

				void CTAEnvironmentTy::configure(unsigned TotalNumThreads, unsigned NumBlocks) {
				NumThreads = TotalNumThreads / NumBlocks;
				NumCTAs = NumBlocks;
				}

				CTAEnvironmentTy::CTAEnvironmentTy()
				: ID(Idx++), Barrier(NumThreads, []() {}), SyncThreads(NumThreads, []() {}),
				NamedBarrier(NumThreads, []() {}) {}

				unsigned CTAEnvironmentTy::getId() const { return ID; }
				unsigned CTAEnvironmentTy::getNumThreads() const { return NumThreads; }

				unsigned CTAEnvironmentTy::getNumBlocks() const { return NumCTAs; }

				void CTAEnvironmentTy::fence(int Ordering) { Barrier.arrive_and_wait(); }
				void CTAEnvironmentTy::syncThreads() { SyncThreads.arrive_and_wait(); }
				void CTAEnvironmentTy::namedBarrier() { NamedBarrier.arrive_and_wait(); }

				unsigned CTAEnvironmentTy::Idx = 0;
				unsigned CTAEnvironmentTy::NumThreads = 0;
				unsigned CTAEnvironmentTy::NumCTAs = 0;

				ThreadBlockEnvironmentTy::ThreadBlockEnvironmentTy(unsigned ID,
				unsigned NumBlocks)
				: ID(ID), NumBlocks(NumBlocks) {}

				unsigned ThreadBlockEnvironmentTy::getId() const { return ID; }
				unsigned ThreadBlockEnvironmentTy::getNumBlocks() const { return NumBlocks; }

				namespace VGPUImpl {
				ThreadEnvironmentTy::ThreadEnvironmentTy(WarpEnvironmentTy *WE,
				CTAEnvironmentTy *CTAE)
				: ThreadIdInWarp(Idx++ % WE->getNumThreads()),
				jdoerfertUnsubmitted Not Done Reply Inline Actions This is racy, I think. Can we use atomic_add for all these Idx updates or pass the Id from the outside? jdoerfert: This is racy, I think. Can we use atomic_add for all these Idx updates or pass the Id from the…
				ThreadIdInBlock(WE->getWarpId() * WE->getNumThreads() + ThreadIdInWarp),
				GlobalThreadIdx(CTAE->getId() * CTAE->getNumThreads() + ThreadIdInBlock),
				WarpEnvironment(WE), CTAEnvironment(CTAE) {}

				void ThreadEnvironmentTy::setBlockEnv(ThreadBlockEnvironmentTy *TBE) {
				ThreadBlockEnvironment = TBE;
				}

				void ThreadEnvironmentTy::resetBlockEnv() {
				delete ThreadBlockEnvironment;
				ThreadBlockEnvironment = nullptr;
				}

				unsigned ThreadEnvironmentTy::getThreadIdInWarp() const {
				return ThreadIdInWarp;
				}
				unsigned ThreadEnvironmentTy::getThreadIdInBlock() const {
				return ThreadIdInBlock;
				}
				unsigned ThreadEnvironmentTy::getGlobalThreadId() const {
				return GlobalThreadIdx;
				}

				unsigned ThreadEnvironmentTy::getBlockSize() const {
				return CTAEnvironment->getNumThreads();
				}

				unsigned ThreadEnvironmentTy::getBlockId() const {
				return ThreadBlockEnvironment->getId();
				}

				unsigned ThreadEnvironmentTy::getNumberOfBlocks() const {
				return ThreadBlockEnvironment->getNumBlocks();
				}
				unsigned ThreadEnvironmentTy::getKernelSize() const {
				return getBlockSize() * getNumberOfBlocks();
				}

				// FIXME: This is wrong
				LaneMaskTy ThreadEnvironmentTy::getActiveMask() const { return ~0U; }

				void ThreadEnvironmentTy::fenceTeam(int Ordering) {
				CTAEnvironment->fence(Ordering);
				}
				void ThreadEnvironmentTy::syncWarp(int Ordering) {
				WarpEnvironment->sync(Ordering);
				}

				int32_t ThreadEnvironmentTy::shuffle(uint64_t Mask, int32_t Var,
				uint64_t SrcLane) {
				WarpEnvironment->waitShuffleBarrier();
				WarpEnvironment->writeShuffleBuffer(Var, ThreadIdInWarp);
				WarpEnvironment->waitShuffleBarrier();
				Var = WarpEnvironment->getShuffleBuffer(ThreadIdInWarp);
				return Var;
				}

				int32_t ThreadEnvironmentTy::shuffleDown(uint64_t Mask, int32_t Var,
				uint32_t Delta) {
				WarpEnvironment->waitShuffleDownBarrier();
				WarpEnvironment->writeShuffleBuffer(Var, ThreadIdInWarp);
				WarpEnvironment->waitShuffleDownBarrier();
				Var = WarpEnvironment->getShuffleBuffer((ThreadIdInWarp + Delta) %
				getWarpSize());
				return Var;
				}

				void ThreadEnvironmentTy::namedBarrier(bool Generic) {
				if (Generic) {
				CTAEnvironment->namedBarrier();
				} else {
				CTAEnvironment->syncThreads();
				}
				}

				void ThreadEnvironmentTy::fenceKernel(int32_t MemoryOrder) {
				std::atomic_thread_fence(static_cast<std::memory_order>(MemoryOrder));
				}

				unsigned ThreadEnvironmentTy::getWarpSize() const {
				return WarpEnvironment->getNumThreads();
				}

				unsigned ThreadEnvironmentTy::Idx = 0;

				} // namespace VGPUImpl

openmp/libomptarget/plugins/vgpu/src/rtl.cpp

This file was added.

				//===------RTLs/vgpu/src/rtl.cpp - Target RTLs Implementation ----- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// RTL for virtual (x86) GPU
				//
				//===----------------------------------------------------------------------===//

				#include <barrier>
				#include <cassert>
				#include <cmath>
				#include <condition_variable>
				#include <cstdio>
				#include <cstdlib>
				#include <cstring>
				#include <dlfcn.h>
				#include <ffi.h>
				#include <functional>
				#include <gelf.h>
				#include <link.h>
				#include <list>
				#include <memory>
				#include <mutex>
				#include <queue>
				#include <thread>
				#include <vector>

				#include "Debug.h"
				#include "ThreadEnvironment.h"
				#include "ThreadEnvironmentImpl.h"
				#include "omptarget.h"
				#include "omptargetplugin.h"

				#ifndef TARGET_NAME
				#define TARGET_NAME Generic ELF - 64bit
				#endif
				#define DEBUG_PREFIX "TARGET " GETNAME(TARGET_NAME) " RTL"

				#ifndef TARGET_ELF_ID
				#define TARGET_ELF_ID 0
				#endif

				#include "elf_common.h"

				#define OFFLOADSECTIONNAME "omp_offloading_entries"

				#define DEBUG false

				struct FFICallTy {
				ffi_cif CIF;
				std::vector<ffi_type *> ArgsTypes;
				std::vector<void *> Args;
				std::vector<void *> Ptrs;
				void (*Entry)(void);

				FFICallTy(int32_t ArgNum, void *TgtArgs, ptrdiff_t TgtOffsets,
				void *TgtEntryPtr)
				: ArgsTypes(ArgNum, &ffi_type_pointer), Args(ArgNum), Ptrs(ArgNum) {
				for (int32_t i = 0; i < ArgNum; ++i) {
				Ptrs[i] = (void *)((intptr_t)TgtArgs[i] + TgtOffsets[i]);
				Args[i] = &Ptrs[i];
				}

				ffi_status status = ffi_prep_cif(&CIF, FFI_DEFAULT_ABI, ArgNum,
				&ffi_type_void, &ArgsTypes[0]);

				assert(status == FFI_OK && "Unable to prepare target launch!");

				((void *)&Entry) = TgtEntryPtr;
				}
				};

				/// Array of Dynamic libraries loaded for this target.
				struct DynLibTy {
				char *FileName;
				void *Handle;
				};

				/// Keep entries table per device.
				struct FuncOrGblEntryTy {
				__tgt_target_table Table;
				};

				thread_local ThreadEnvironmentTy *ThreadEnvironment;

				/// Class containing all the device information.
				class RTLDeviceInfoTy {
				std::vector<std::list<FuncOrGblEntryTy>> FuncGblEntries;

				public:
				std::list<DynLibTy> DynLibs;

				// Record entry point associated with device.
				void createOffloadTable(int32_t device_id, __tgt_offload_entry *begin,
				__tgt_offload_entry *end) {
				assert(device_id < (int32_t)FuncGblEntries.size() &&
				"Unexpected device id!");
				FuncGblEntries[device_id].emplace_back();
				FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();

				E.Table.EntriesBegin = begin;
				E.Table.EntriesEnd = end;
				}

				// Return true if the entry is associated with device.
				bool findOffloadEntry(int32_t device_id, void *addr) {
				assert(device_id < (int32_t)FuncGblEntries.size() &&
				"Unexpected device id!");
				FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();

				for (__tgt_offload_entry i = E.Table.EntriesBegin, e = E.Table.EntriesEnd;
				i < e; ++i) {
				if (i->addr == addr)
				return true;
				}

				return false;
				}

				// Return the pointer to the target entries table.
				__tgt_target_table *getOffloadEntriesTable(int32_t device_id) {
				assert(device_id < (int32_t)FuncGblEntries.size() &&
				"Unexpected device id!");
				FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();

				return &E.Table;
				}

				RTLDeviceInfoTy() : FuncGblEntries(1) {}

				~RTLDeviceInfoTy() {
				// Close dynamic libraries
				for (auto &lib : DynLibs) {
				if (lib.Handle) {
				dlclose(lib.Handle);
				remove(lib.FileName);
				}
				}
				}
				};

				static RTLDeviceInfoTy DeviceInfo;

				std::vector<CTAEnvironmentTy *> CTAEnvironments;
				std::vector<WarpEnvironmentTy *> WarpEnvironments;

				struct VGPUTy {
				struct KernelTy {
				FFICallTy *Call;
				int NumTeams;

				KernelTy(FFICallTy *Call, int NumTeams) : Call(Call), NumTeams(NumTeams) {}
				};

				struct VGPUStreamTy {
				std::queue<KernelTy> Kernels;
				std::mutex Mtx;

				void emplace(FFICallTy *Call, int NumTeams) {
				std::lock_guard Guard(Mtx);
				Kernels.emplace(Call, NumTeams);
				}

				KernelTy front() {
				std::lock_guard Guard(Mtx);
				return Kernels.front();
				}

				void pop() {
				std::lock_guard Guard(Mtx);
				Kernels.pop();
				}

				bool empty() {
				std::lock_guard Guard(Mtx);
				return Kernels.empty();
				}
				};

				struct AsyncInfoQueueTy {
				std::deque<__tgt_async_info *> Streams;
				std::mutex Mtx;

				bool empty() {
				std::lock_guard Guard(Mtx);
				return Streams.empty();
				}

				__tgt_async_info *front() {
				std::lock_guard Guard(Mtx);
				return Streams.front();
				}

				void pop() {
				std::lock_guard Guard(Mtx);
				Streams.pop_front();
				}

				void emplace(__tgt_async_info *AsyncInfo) {
				std::lock_guard Guard(Mtx);
				Streams.emplace_back(AsyncInfo);
				}
				} ExecutionQueue;

				VGPUStreamTy getStream(__tgt_async_info AsyncInfo) {
				assert(AsyncInfo != nullptr && "async_info ptr was null");

				if (!AsyncInfo->Queue)
				AsyncInfo->Queue = new VGPUStreamTy();

				return reinterpret_cast<VGPUStreamTy *>(AsyncInfo->Queue);
				}

				std::atomic<bool> Running;
				std::vector<std::thread> Threads;
				int WarpsPerCTA = -1;
				int NumCTAs = -1;
				int NumThreads = -1;

				std::unique_ptr<std::barrier<std::function<void(void)>>> Barrier;
				std::condition_variable WorkAvailable;
				std::mutex WorkDoneMtx;
				std::condition_variable WorkDone;

				void configureArchitecture() {
				int ThreadsPerWarp = -1;

				if (const char *Env = std::getenv("VGPU_NUM_THREADS"))
				NumThreads = std::stoi(Env);
				if (const char *Env = std::getenv("VGPU_THREADS_PER_WARP"))
				ThreadsPerWarp = std::stoi(Env);
				if (const char *Env = std::getenv("VGPU_WARPS_PER_CTA"))
				WarpsPerCTA = std::stoi(Env);

				if (NumThreads == -1)
				NumThreads = std::thread::hardware_concurrency();
				if (ThreadsPerWarp == -1)
				ThreadsPerWarp = NumThreads;
				if (WarpsPerCTA == -1)
				WarpsPerCTA = 1;

				NumCTAs = NumThreads / (ThreadsPerWarp * WarpsPerCTA);

				assert(NumThreads % ThreadsPerWarp == 0 && NumThreads % WarpsPerCTA == 0 &&
				"Invalid VGPU Config");

				DP("NumThreads: %d, ThreadsPerWarp: %d, WarpsPerCTA: %d\n", NumThreads,
				ThreadsPerWarp, WarpsPerCTA);

				CTAEnvironmentTy::configure(NumThreads, NumCTAs);
				WarpEnvironmentTy::configure(ThreadsPerWarp);
				}

				VGPUTy() : Running(true) {
				configureArchitecture();

				Barrier = std::make_unique<BarrierTy>(NumThreads, []() {});
				Threads.reserve(NumThreads);

				auto GlobalThreadIdx = 0;
				for (auto CTAIdx = 0; CTAIdx < CTAEnvironmentTy::NumCTAs; CTAIdx++) {
				auto *CTAEnv = new CTAEnvironmentTy();
				for (auto WarpIdx = 0; WarpIdx < WarpsPerCTA; WarpIdx++) {
				auto *WarpEnv = new WarpEnvironmentTy();
				for (auto ThreadIdx = 0; ThreadIdx < WarpEnvironmentTy::ThreadsPerWarp;
				ThreadIdx++) {
				Threads.emplace_back([this, GlobalThreadIdx, CTAEnv, WarpEnv]() {
				jdoerfertUnsubmitted Not Done Reply Inline Actions Move the lambda into a helper function. indention of 12 is too much. jdoerfert: Move the lambda into a helper function. indention of 12 is too much.
				ThreadEnvironment = new ThreadEnvironmentTy(WarpEnv, CTAEnv);
				while (Running) {
				{
				std::unique_lock<std::mutex> UniqueLock(ExecutionQueue.Mtx);

				WorkAvailable.wait(UniqueLock, [&]() {
				if (!Running)
				return true;

				bool IsEmpty = ExecutionQueue.Streams.empty();

				return !IsEmpty;
				});
				}

				if (ExecutionQueue.empty())
				continue;

				while (!ExecutionQueue.empty()) {
				auto *Stream = getStream(ExecutionQueue.front());
				while (!Stream->empty()) {
				auto [Call, NumTeams] = Stream->front();

				runKernel(CTAEnv, Call, NumTeams);

				if (GlobalThreadIdx == 0) {
				Stream->pop();
				delete Call;
				}

				Barrier->arrive_and_wait();
				}
				if (GlobalThreadIdx == 0) {
				jdoerfertUnsubmitted Not Done Reply Inline Actions Can we split this up and create some helper functions maybe? jdoerfert: Can we split this up and create some helper functions maybe?
				ExecutionQueue.pop();
				WorkDone.notify_all();
				}
				Barrier->arrive_and_wait();
				}
				}
				delete ThreadEnvironment;
				});
				GlobalThreadIdx = (GlobalThreadIdx + 1) % NumThreads;
				jdoerfertUnsubmitted Not Done Reply Inline Actions When do we have more threads than NumThreads? jdoerfert: When do we have more threads than NumThreads?
				}
				WarpEnvironments.push_back(WarpEnv);
				}
				CTAEnvironments.push_back(CTAEnv);
				}
				}

				void runKernel(CTAEnvironmentTy CTAEnv, FFICallTy Call, int NumTeams) {
				unsigned TeamIdx = 0;
				while (TeamIdx < NumTeams) {
				if (CTAEnv->getId() < NumTeams) {
				ThreadEnvironment->setBlockEnv(
				new ThreadBlockEnvironmentTy(TeamIdx + CTAEnv->getId(), NumTeams));
				ffi_call(&Call->CIF, Call->Entry, NULL, &(Call->Args)[0]);
				ThreadEnvironment->resetBlockEnv();
				}
				Barrier->arrive_and_wait();
				TeamIdx += NumCTAs;
				}
				}

				~VGPUTy() {
				awaitAll();

				Running = false;
				WorkAvailable.notify_all();

				for (auto &Thread : Threads) {
				if (Thread.joinable())
				Thread.join();
				}

				for (auto *CTAEnv : CTAEnvironments)
				delete CTAEnv;

				for (auto *WarpEnv : WarpEnvironments)
				delete WarpEnv;
				}

				void await(__tgt_async_info *AsyncInfo) {
				std::unique_lock UniqueLock(getStream(AsyncInfo)->Mtx);
				WorkDone.wait(UniqueLock,
				[&]() { return getStream(AsyncInfo)->Kernels.empty(); });
				}

				void awaitAll() {
				while (!ExecutionQueue.empty()) {
				await(ExecutionQueue.front());
				}
				}

				void scheduleAsync(__tgt_async_info AsyncInfo, FFICallTy Call,
				int NumTeams) {
				if (NumTeams == 0)
				NumTeams = NumCTAs;
				auto *Stream = getStream(AsyncInfo);
				Stream->emplace(Call, NumTeams);
				ExecutionQueue.emplace(AsyncInfo);
				WorkAvailable.notify_all();
				}
				};

				VGPUTy VGPU;

				#ifdef __cplusplus
				extern "C" {
				#endif

				int32_t __tgt_rtl_is_valid_binary(__tgt_device_image *image) {
				// If we don't have a valid ELF ID we can just fail.
				#if TARGET_ELF_ID < 1
				return 0;
				#else
				return elf_check_machine(image, TARGET_ELF_ID);
				#endif
				}

				int32_t __tgt_rtl_number_of_devices() { return 1; }

				int32_t __tgt_rtl_init_device(int32_t device_id) { return OFFLOAD_SUCCESS; }

				__tgt_target_table *__tgt_rtl_load_binary(int32_t device_id,
				__tgt_device_image *image) {

				DP("Dev %d: load binary from " DPxMOD " image\n", device_id,
				DPxPTR(image->ImageStart));

				assert(device_id >= 0 && device_id < 1 && "bad dev id");

				size_t ImageSize = (size_t)image->ImageEnd - (size_t)image->ImageStart;
				size_t NumEntries = (size_t)(image->EntriesEnd - image->EntriesBegin);
				DP("Expecting to have %zd entries defined.\n", NumEntries);

				// Is the library version incompatible with the header file?
				if (elf_version(EV_CURRENT) == EV_NONE) {
				DP("Incompatible ELF library!\n");
				return NULL;
				}

				// Obtain elf handler
				Elf e = elf_memory((char )image->ImageStart, ImageSize);
				if (!e) {
				DP("Unable to get ELF handle: %s!\n", elf_errmsg(-1));
				return NULL;
				}

				if (elf_kind(e) != ELF_K_ELF) {
				DP("Invalid Elf kind!\n");
				elf_end(e);
				return NULL;
				}

				// Find the entries section offset
				Elf_Scn *section = 0;
				Elf64_Off entries_offset = 0;

				size_t shstrndx;

				if (elf_getshdrstrndx(e, &shstrndx)) {
				DP("Unable to get ELF strings index!\n");
				elf_end(e);
				return NULL;
				}

				while ((section = elf_nextscn(e, section))) {
				GElf_Shdr hdr;
				gelf_getshdr(section, &hdr);

				if (!strcmp(elf_strptr(e, shstrndx, hdr.sh_name), OFFLOADSECTIONNAME)) {
				entries_offset = hdr.sh_addr;
				break;
				}
				}

				if (!entries_offset) {
				DP("Entries Section Offset Not Found\n");
				elf_end(e);
				return NULL;
				}

				DP("Offset of entries section is (" DPxMOD ").\n", DPxPTR(entries_offset));

				// load dynamic library and get the entry points. We use the dl library
				// to do the loading of the library, but we could do it directly to avoid
				// the dump to the temporary file.
				//
				// 1) Create tmp file with the library contents.
				// 2) Use dlopen to load the file and dlsym to retrieve the symbols.
				char tmp_name[] = "/tmp/tmpfile_XXXXXX";
				int tmp_fd = mkstemp(tmp_name);

				if (tmp_fd == -1) {
				elf_end(e);
				return NULL;
				}

				FILE *ftmp = fdopen(tmp_fd, "wb");

				if (!ftmp) {
				elf_end(e);
				return NULL;
				}

				fwrite(image->ImageStart, ImageSize, 1, ftmp);
				fclose(ftmp);

				DynLibTy Lib = {tmp_name, dlopen(tmp_name, RTLD_NOW \| RTLD_GLOBAL)};

				if (!Lib.Handle) {
				DP("Target library loading error: %s\n", dlerror());
				elf_end(e);
				return NULL;
				}

				DeviceInfo.DynLibs.push_back(Lib);

				struct link_map libInfo = (struct link_map )Lib.Handle;

				// The place where the entries info is loaded is the library base address
				// plus the offset determined from the ELF file.
				Elf64_Addr entries_addr = libInfo->l_addr + entries_offset;

				DP("Pointer to first entry to be loaded is (" DPxMOD ").\n",
				DPxPTR(entries_addr));

				// Table of pointers to all the entries in the target.
				__tgt_offload_entry entries_table = (__tgt_offload_entry )entries_addr;

				__tgt_offload_entry *entries_begin = &entries_table[0];
				__tgt_offload_entry *entries_end = entries_begin + NumEntries;

				if (!entries_begin) {
				DP("Can't obtain entries begin\n");
				elf_end(e);
				return NULL;
				}

				DP("Entries table range is (" DPxMOD ")->(" DPxMOD ")\n",
				DPxPTR(entries_begin), DPxPTR(entries_end));
				DeviceInfo.createOffloadTable(device_id, entries_begin, entries_end);

				elf_end(e);

				return DeviceInfo.getOffloadEntriesTable(device_id);
				}

				// Sample implementation of explicit memory allocator. For this plugin all
				// kinds are equivalent to each other.
				void __tgt_rtl_data_alloc(int32_t device_id, int64_t size, void hst_ptr,
				int32_t kind) {
				void *ptr = NULL;

				switch (kind) {
				case TARGET_ALLOC_DEVICE:
				case TARGET_ALLOC_HOST:
				case TARGET_ALLOC_SHARED:
				case TARGET_ALLOC_DEFAULT:
				ptr = malloc(size);
				break;
				default:
				REPORT("Invalid target data allocation kind");
				}

				return ptr;
				}

				int32_t __tgt_rtl_data_submit(int32_t device_id, void tgt_ptr, void hst_ptr,
				int64_t size) {
				VGPU.awaitAll();
				memcpy(tgt_ptr, hst_ptr, size);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_data_retrieve(int32_t device_id, void hst_ptr, void tgt_ptr,
				int64_t size) {
				VGPU.awaitAll();
				memcpy(hst_ptr, tgt_ptr, size);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_data_delete(int32_t device_id, void *tgt_ptr) {
				jdoerfertUnsubmitted Not Done Reply Inline Actions if we need for submit/retrieve, I'd assume to wait here too. jdoerfert: if we need for submit/retrieve, I'd assume to wait here too.
				free(tgt_ptr);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_synchronize(int32_t device_id, __tgt_async_info *async_info) {
				VGPU.await(async_info);
				delete (VGPUTy::VGPUStreamTy *)async_info->Queue;
				async_info->Queue = nullptr;
				return 0;
				}

				int32_t __tgt_rtl_run_target_team_region(int32_t device_id, void *tgt_entry_ptr,
				void **tgt_args,
				ptrdiff_t *tgt_offsets,
				int32_t arg_num, int32_t team_num,
				int32_t thread_limit,
				uint64_t loop_tripcount) {
				__tgt_async_info AsyncInfo;
				int rc = __tgt_rtl_run_target_team_region_async(
				device_id, tgt_entry_ptr, tgt_args, tgt_offsets, arg_num, team_num,
				thread_limit, loop_tripcount, &AsyncInfo);

				if (rc != OFFLOAD_SUCCESS)
				return OFFLOAD_FAIL;

				return __tgt_rtl_synchronize(device_id, &AsyncInfo);
				}

				int32_t __tgt_rtl_run_target_team_region_async(
				int32_t device_id, void tgt_entry_ptr, void *tgt_args,
				ptrdiff_t *tgt_offsets, int32_t arg_num, int32_t team_num,
				int32_t thread_limit, uint64_t loop_tripcount /not used/,
				__tgt_async_info *async_info) {
				DP("Running entry point at " DPxMOD "...\n", DPxPTR(tgt_entry_ptr));

				auto Call = new FFICallTy(arg_num, tgt_args, tgt_offsets, tgt_entry_ptr);

				VGPU.scheduleAsync(async_info, std::move(Call), team_num);
				return OFFLOAD_SUCCESS;
				}

				int32_t __tgt_rtl_run_target_region(int32_t device_id, void *tgt_entry_ptr,
				void *tgt_args, ptrdiff_t tgt_offsets,
				int32_t arg_num) {
				return __tgt_rtl_run_target_team_region(device_id, tgt_entry_ptr, tgt_args,
				tgt_offsets, arg_num, 1, 1, 0);
				}

				int32_t __tgt_rtl_run_target_region_async(int32_t device_id,
				void tgt_entry_ptr, void *tgt_args,
				ptrdiff_t *tgt_offsets,
				int32_t arg_num,
				__tgt_async_info *async_info) {
				return __tgt_rtl_run_target_team_region_async(device_id, tgt_entry_ptr,
				tgt_args, tgt_offsets, arg_num,
				1, 1, 0, async_info);
				}

				#ifdef __cplusplus
				}
				#endif

openmp/libomptarget/src/rtl.cpp

	Show All 15 Lines

	#include <cassert>			#include <cassert>
	#include <cstdlib>			#include <cstdlib>
	#include <cstring>			#include <cstring>
	#include <dlfcn.h>			#include <dlfcn.h>
	#include <mutex>			#include <mutex>
	#include <string>			#include <string>

	// List of all plugins that can support offloading.			struct PluginInfoTy {
	static const char *RTLNames[] = {			std::string Name;
	/* PowerPC target */ "libomptarget.rtl.ppc64.so",			bool IsHost;
	/* x86_64 target */ "libomptarget.rtl.x86_64.so",
	/* CUDA target */ "libomptarget.rtl.cuda.so",
	/* AArch64 target */ "libomptarget.rtl.aarch64.so",
	/* SX-Aurora VE target */ "libomptarget.rtl.ve.so",
	/* AMDGPU target */ "libomptarget.rtl.amdgpu.so",
	/* Remote target */ "libomptarget.rtl.rpc.so",
	};			};
				jdoerfertUnsubmitted Not Done Reply Inline Actions Introduce an environment variable, if it is set, X86 target should skip the image. Also, add a TODO such that we later look into the image and inspect it to decide automatically. jdoerfert: Introduce an environment variable, if it is set, X86 target should skip the image. Also, add a…

				// List of all plugins that can support offloading.
				static const PluginInfoTy Plugins[] = {
				/* PowerPC target */ {"libomptarget.rtl.ppc64.so", true},
				/* x86_64 target */ {"libomptarget.rtl.x86_64.so", true},
				/* CUDA target */ {"libomptarget.rtl.cuda.so", false},
				/* AArch64 target */ {"libomptarget.rtl.aarch64.so", true},
				/* SX-Aurora VE target */ {"libomptarget.rtl.ve.so", false},
				/* AMDGPU target */ {"libomptarget.rtl.amdgpu.so", false},
				/* Remote target */ {"libomptarget.rtl.rpc.so", false},
				/* Virtual GPU target */ {"libomptarget.rtl.vgpu.so", false}};

	PluginManager *PM;			PluginManager *PM;

	#if OMPTARGET_PROFILE_ENABLED			#if OMPTARGET_PROFILE_ENABLED
	static char *ProfileTraceFile = nullptr;			static char *ProfileTraceFile = nullptr;
	#endif			#endif

	__attribute__((constructor(101))) void init() {			__attribute__((constructor(101))) void init() {
	DP("Init target library!\n");			DP("Init target library!\n");
	Show All 38 Lines
	void RTLsTy::LoadRTLs() {			void RTLsTy::LoadRTLs() {
	// Parse environment variable OMP_TARGET_OFFLOAD (if set)			// Parse environment variable OMP_TARGET_OFFLOAD (if set)
	PM->TargetOffloadPolicy =			PM->TargetOffloadPolicy =
	(kmp_target_offload_kind_t)__kmpc_get_target_offload();			(kmp_target_offload_kind_t)__kmpc_get_target_offload();
	if (PM->TargetOffloadPolicy == tgt_disabled) {			if (PM->TargetOffloadPolicy == tgt_disabled) {
	return;			return;
	}			}

				// TODO: add ability to inspect image and decide automatically
				bool UseVGPU = false;
				if (auto *EnvFlag = std::getenv("LIBOMPTARGET_USE_VGPU"))
				UseVGPU = true;

	DP("Loading RTLs...\n");			DP("Loading RTLs...\n");

	// Attempt to open all the plugins and, if they exist, check if the interface			// Attempt to open all the plugins and, if they exist, check if the interface
	// is correct and if they are supporting any devices.			// is correct and if they are supporting any devices.
	for (auto *Name : RTLNames) {			for (auto &[Name, IsHost] : Plugins) {
	DP("Loading library '%s'...\n", Name);			DP("Loading library '%s'...\n", Name.c_str());
	void *dynlib_handle = dlopen(Name, RTLD_NOW);
				int Flags = RTLD_NOW;

				if (Name.compare("libomptarget.rtl.vgpu.so") == 0)
				Flags \|= RTLD_GLOBAL;

				if (UseVGPU && IsHost) {
				DP("Skipping library '%s': VGPU was requested.\n", Name.c_str());
				jdoerfertUnsubmitted Done Reply Inline Actions Not only x86, also let's not do strcmp. Extend RTLNAmes to be an array of structs with more elaborate information, e.g., is host flag. That said, unsure if not loading the plugin is the right way to not grab the image. Good enough for now. jdoerfert: Not only x86, also let's not do strcmp. Extend RTLNAmes to be an array of structs with more…
				continue;
				}

				void *dynlib_handle = dlopen(Name.c_str(), Flags);

	if (!dynlib_handle) {			if (!dynlib_handle) {
	// Library does not exist or cannot be found.			// Library does not exist or cannot be found.
	DP("Unable to load library '%s': %s!\n", Name, dlerror());			DP("Unable to load library '%s': %s!\n", Name.c_str(), dlerror());
	continue;			continue;
	}			}

	DP("Successfully loaded library '%s'!\n", Name);			DP("Successfully loaded library '%s'!\n", Name.c_str());

	AllRTLs.emplace_back();			AllRTLs.emplace_back();

	// Retrieve the RTL information from the runtime library.			// Retrieve the RTL information from the runtime library.
	RTLInfoTy &R = AllRTLs.back();			RTLInfoTy &R = AllRTLs.back();

	bool ValidPlugin = true;			bool ValidPlugin = true;

	▲ Show 20 Lines • Show All 393 Lines • Show Last 20 Lines

openmp/libomptarget/test/CMakeLists.txt

	Show All 12 Lines
	endif()			endif()

	# Replace the space from user's input with ";" in case that CMake add escape			# Replace the space from user's input with ";" in case that CMake add escape
	# char into the lit command.			# char into the lit command.
	string(REPLACE " " ";" LIBOMPTARGET_LIT_ARG_LIST "${LIBOMPTARGET_LIT_ARGS}")			string(REPLACE " " ";" LIBOMPTARGET_LIT_ARG_LIST "${LIBOMPTARGET_LIT_ARGS}")

	string(REGEX MATCHALL "([^\ ]+\ \|[^\ ]+$)" SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}")			string(REGEX MATCHALL "([^\ ]+\ \|[^\ ]+$)" SYSTEM_TARGETS "${LIBOMPTARGET_SYSTEM_TARGETS}")
	foreach(CURRENT_TARGET IN LISTS SYSTEM_TARGETS)			foreach(CURRENT_TARGET IN LISTS SYSTEM_TARGETS)
				IF ("${CURRENT_TARGET}" MATCHES "-vgpu")
				continue()
				ENDIF()
				jdoerfertUnsubmitted Done Reply Inline Actions This is to disable the tests? Not sure this is a good way though. For one, can we check against -vgpu not x86, also openmp-vgpu or something, right? jdoerfert: This is to disable the tests? Not sure this is a good way though. For one, can we check against…
				atmnpatelAuthorUnsubmitted Done Reply Inline Actions Yep atmnpatel: Yep
	string(STRIP "${CURRENT_TARGET}" CURRENT_TARGET)			string(STRIP "${CURRENT_TARGET}" CURRENT_TARGET)
	add_openmp_testsuite(check-libomptarget-${CURRENT_TARGET}			add_openmp_testsuite(check-libomptarget-${CURRENT_TARGET}
	"Running libomptarget tests"			"Running libomptarget tests"
	${CMAKE_CURRENT_BINARY_DIR}/${CURRENT_TARGET}			${CMAKE_CURRENT_BINARY_DIR}/${CURRENT_TARGET}
	DEPENDS omptarget omp ${LIBOMPTARGET_TESTED_PLUGINS}			DEPENDS omptarget omp ${LIBOMPTARGET_TESTED_PLUGINS}
	ARGS ${LIBOMPTARGET_LIT_ARG_LIST})			ARGS ${LIBOMPTARGET_LIT_ARG_LIST})
	list(APPEND LIBOMPTARGET_LIT_TESTSUITES ${CMAKE_CURRENT_BINARY_DIR}/${CURRENT_TARGET})			list(APPEND LIBOMPTARGET_LIT_TESTSUITES ${CMAKE_CURRENT_BINARY_DIR}/${CURRENT_TARGET})

	Show All 12 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget][WIP] Introduce VGPU PluginAcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 405407

clang/lib/Basic/TargetInfo.cpp

clang/lib/Basic/Targets/X86.h

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

clang/lib/CodeGen/CodeGenModule.cpp

clang/lib/Driver/ToolChains/Gnu.cpp

clang/lib/Frontend/CompilerInvocation.cpp

llvm/include/llvm/ADT/Triple.h

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

llvm/lib/Support/Triple.cpp

openmp/CMakeLists.txt

openmp/libomptarget/DeviceRTL/CMakeLists.txt

openmp/libomptarget/DeviceRTL/include/ThreadEnvironment.h

openmp/libomptarget/DeviceRTL/src/Debug.cpp

openmp/libomptarget/DeviceRTL/src/Mapping.cpp

openmp/libomptarget/DeviceRTL/src/Misc.cpp

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

openmp/libomptarget/DeviceRTL/src/Utils.cpp

openmp/libomptarget/plugins/CMakeLists.txt

openmp/libomptarget/plugins/vgpu/CMakeLists.txt

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.h

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.cpp

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.h

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.cpp

openmp/libomptarget/plugins/vgpu/src/rtl.cpp

openmp/libomptarget/src/rtl.cpp

openmp/libomptarget/test/CMakeLists.txt

[Libomptarget][WIP] Introduce VGPU Plugin
AcceptedPublic