This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/plugins/cuda/src/
-
libomptarget/
-
plugins/
-
cuda/
-
src/
5
rtl.cpp

Differential D70010

[OpenMP][Offloading] Replaced default stream with an actual per-device unblocking stream in NVPTX implementation
AbandonedPublic

Authored by tianshilei1992 on Nov 8 2019, 7:53 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
hfinkel
ABataev
grokos
JonChesterfield

Summary

In this patch, default stream is replaced with an actual per-device stream for better performance as there're plenty of constraints in default stream, according to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. Later I'll enable multiple streams to improve the concurrency.

Diff Detail

Repository: rL LLVM

Event Timeline

tianshilei1992 created this revision.Nov 8 2019, 7:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2019, 7:53 AM

Herald added subscribers: llvm-commits, guansong. · View Herald Transcript

tianshilei1992 retitled this revision from [OpenMP][Offlloading] Replaced default stream with an actual per-device unblocking stream in NVPTX implementation to [OpenMP][Offloading] Replaced default stream with an actual per-device unblocking stream in NVPTX implementation.Nov 8 2019, 7:53 AM

ABataev added a subscriber: ABataev.Nov 8 2019, 8:02 AM

ABataev added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
260	Use real type instead of `auto` Variables must start with the upper case letter.

tianshilei1992 marked an inline comment as done.Nov 8 2019, 8:17 AM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
260	Sure but just to keep align with existing code. Do I need to update existing code as well?

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
260	In a separate patch, please

tianshilei1992 marked an inline comment as not done.Nov 8 2019, 8:37 AM

In D70010#1738930, @ABataev wrote:

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.

In D70010#1739049, @tianshilei1992 wrote:

In D70010#1738930, @ABataev wrote:

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.

My question is different. Does it affect execution of the existing code anyhow?

In D70010#1739061, @ABataev wrote:

In D70010#1739049, @tianshilei1992 wrote:

In D70010#1738930, @ABataev wrote:

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.

My question is different. Does it affect execution of the existing code anyhow?

AFAIK, no. Currently we still only have one stream for each device, but it's just not the default stream. Kernels in a stream are executed in order. The asynchronous execution requires multiple streams. I'll check whether existing cases can cover it, and will write one if no.

Can you please write a commit message explaining the change and the plan?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
260	You can go either way but I would just keep it as the surrounding code for now. Making the code blend in is arguably a good think and we have not decided how we handle the plugins in the short term (wrt. coding standard). @JonChesterfield @grokos What do you think? Should we run once over the plugin and adjust the coding style now or keep it consistent for the time being?

grokos added inline comments.Nov 8 2019, 1:04 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
260	I would say let's keep it consistent. Later on, we can adjust the code style for the whole library but until then I prefer consistency over mixed styles.

In D70010#1739316, @jdoerfert wrote:

Can you please write a commit message explaining the change and the plan?

No problem. Will do.

tianshilei1992 edited the summary of this revision. (Show Details)Nov 8 2019, 1:46 PM

This one can be abandoned now...

Herald added a subscriber: yaxunl. · View Herald TranscriptApr 6 2020, 5:24 PM

Superseded by D74145. You can abandon this one.

In D70010#1966048, @grokos wrote:

Superseded by D74145. You can abandon this one.

I didn't see the option?

tianshilei1992 abandoned this revision.Apr 7 2020, 7:17 AM

Revision Contents

Path

Size

openmp/

libomptarget/

plugins/

cuda/

src/

rtl.cpp

31 lines

Diff 228459

openmp/libomptarget/plugins/cuda/src/rtl.cpp

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines
/// Class containing all the device information.		/// Class containing all the device information.
class RTLDeviceInfoTy {		class RTLDeviceInfoTy {
std::vector<std::list<FuncOrGblEntryTy>> FuncGblEntries;		std::vector<std::list<FuncOrGblEntryTy>> FuncGblEntries;

public:		public:
int NumberOfDevices;		int NumberOfDevices;
std::vector<CUmodule> Modules;		std::vector<CUmodule> Modules;
std::vector<CUcontext> Contexts;		std::vector<CUcontext> Contexts;
		std::vector<CUstream> Streams;

// Device properties		// Device properties
std::vector<int> ThreadsPerBlock;		std::vector<int> ThreadsPerBlock;
std::vector<int> BlocksPerGrid;		std::vector<int> BlocksPerGrid;
std::vector<int> WarpSize;		std::vector<int> WarpSize;

// OpenMP properties		// OpenMP properties
std::vector<int> NumTeams;		std::vector<int> NumTeams;
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	#endif // OMPTARGET_DEBUG

if (NumberOfDevices == 0) {		if (NumberOfDevices == 0) {
DP("There are no devices supporting CUDA.\n");		DP("There are no devices supporting CUDA.\n");
return;		return;
}		}

FuncGblEntries.resize(NumberOfDevices);		FuncGblEntries.resize(NumberOfDevices);
Contexts.resize(NumberOfDevices);		Contexts.resize(NumberOfDevices);
		Streams.resize(NumberOfDevices);
ThreadsPerBlock.resize(NumberOfDevices);		ThreadsPerBlock.resize(NumberOfDevices);
BlocksPerGrid.resize(NumberOfDevices);		BlocksPerGrid.resize(NumberOfDevices);
WarpSize.resize(NumberOfDevices);		WarpSize.resize(NumberOfDevices);
NumTeams.resize(NumberOfDevices);		NumTeams.resize(NumberOfDevices);
NumThreads.resize(NumberOfDevices);		NumThreads.resize(NumberOfDevices);

// Get environment variables regarding teams		// Get environment variables regarding teams
char *envStr = getenv("OMP_TEAM_LIMIT");		char *envStr = getenv("OMP_TEAM_LIMIT");
Show All 32 Lines	~RTLDeviceInfoTy() {
for (auto &ctx : Contexts)		for (auto &ctx : Contexts)
if (ctx) {		if (ctx) {
CUresult err = cuCtxDestroy(ctx);		CUresult err = cuCtxDestroy(ctx);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when destroying CUDA context\n");		DP("Error when destroying CUDA context\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
}		}
}		}

		// Destroy streams
		for (auto &stream : Streams)
		ABataevUnsubmitted Not Done Reply Inline Actions Use real type instead of `auto` Variables must start with the upper case letter. ABataev: 1. Use real type instead of `auto` 2. Variables must start with the upper case letter.
		tianshilei1992AuthorUnsubmitted Not Done Reply Inline Actions Sure but just to keep align with existing code. Do I need to update existing code as well? tianshilei1992: Sure but just to keep align with existing code. Do I need to update existing code as well?
		ABataevUnsubmitted Not Done Reply Inline Actions In a separate patch, please ABataev: In a separate patch, please
		jdoerfertUnsubmitted Not Done Reply Inline Actions You can go either way but I would just keep it as the surrounding code for now. Making the code blend in is arguably a good think and we have not decided how we handle the plugins in the short term (wrt. coding standard). @JonChesterfield @grokos What do you think? Should we run once over the plugin and adjust the coding style now or keep it consistent for the time being? jdoerfert: You can go either way but I would just keep it as the surrounding code for now. Making the code…
		grokosUnsubmitted Not Done Reply Inline Actions I would say let's keep it consistent. Later on, we can adjust the code style for the whole library but until then I prefer consistency over mixed styles. grokos: I would say let's keep it consistent. Later on, we can adjust the code style for the whole…
		if (stream) {
		CUresult err = cuStreamDestroy(stream);
		if (err != CUDA_SUCCESS) {
		DP("Error when destroying CUDA stream\n");
		CUDA_ERR_STRING(err);
		}
		}
}		}
};		};

static RTLDeviceInfoTy DeviceInfo;		static RTLDeviceInfoTy DeviceInfo;

#ifdef __cplusplus		#ifdef __cplusplus
extern "C" {		extern "C" {
#endif		#endif
Show All 25 Lines	int32_t __tgt_rtl_init_device(int32_t device_id) {
err = cuCtxCreate(&DeviceInfo.Contexts[device_id], CU_CTX_SCHED_BLOCKING_SYNC,		err = cuCtxCreate(&DeviceInfo.Contexts[device_id], CU_CTX_SCHED_BLOCKING_SYNC,
cuDevice);		cuDevice);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when creating a CUDA context\n");		DP("Error when creating a CUDA context\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

		// Set current context for later creating corresponding stream
		err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);
		if (err != CUDA_SUCCESS) {
		DP("Error when setting current CUDA context\n");
		CUDA_ERR_STRING(err);
		return OFFLOAD_FAIL;
		}

		//Create a stream for each device
		err = cuStreamCreate(&DeviceInfo.Streams[device_id], CU_STREAM_NON_BLOCKING);
		if (err != CUDA_SUCCESS) {
		DP("Error when creating CUDA stream\n");
		CUDA_ERR_STRING(err);
		return OFFLOAD_FAIL;
		}

// Query attributes to determine number of threads/block and blocks/grid.		// Query attributes to determine number of threads/block and blocks/grid.
int maxGridDimX;		int maxGridDimX;
err = cuDeviceGetAttribute(&maxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,		err = cuDeviceGetAttribute(&maxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,
cuDevice);		cuDevice);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error getting max grid dimension, use default\n");		DP("Error getting max grid dimension, use default\n");
DeviceInfo.BlocksPerGrid[device_id] = RTLDeviceInfoTy::DefaultNumTeams;		DeviceInfo.BlocksPerGrid[device_id] = RTLDeviceInfoTy::DefaultNumTeams;
} else if (maxGridDimX <= RTLDeviceInfoTy::HardTeamLimit) {		} else if (maxGridDimX <= RTLDeviceInfoTy::HardTeamLimit) {
▲ Show 20 Lines • Show All 446 Lines • ▼ Show 20 Lines	if (team_num <= 0) {
DP("Using requested number of teams %d\n", team_num);		DP("Using requested number of teams %d\n", team_num);
}		}

// Run on the device.		// Run on the device.
DP("Launch kernel with %d blocks and %d threads\n", cudaBlocksPerGrid,		DP("Launch kernel with %d blocks and %d threads\n", cudaBlocksPerGrid,
cudaThreadsPerBlock);		cudaThreadsPerBlock);

err = cuLaunchKernel(KernelInfo->Func, cudaBlocksPerGrid, 1, 1,		err = cuLaunchKernel(KernelInfo->Func, cudaBlocksPerGrid, 1, 1,
cudaThreadsPerBlock, 1, 1, 0 /bytes of shared memory/, 0, &args[0], 0);		cudaThreadsPerBlock, 1, 1, 0 /bytes of shared memory/,
		DeviceInfo.Streams[device_id], &args[0], 0);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Device kernel launch failed!\n");		DP("Device kernel launch failed!\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

DP("Launch of entry point at " DPxMOD " successful!\n",		DP("Launch of entry point at " DPxMOD " successful!\n",
DPxPTR(tgt_entry_ptr));		DPxPTR(tgt_entry_ptr));
Show All 25 Lines