This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libomptarget/plugins/cuda/src/
-
plugins/
-
cuda/
-
src/
3/9
rtl.cpp

Differential D45326

[OpenMP] [CUDA plugin] Add support for teams reduction via scratchpad
AbandonedPublic

Authored by grokos on Apr 5 2018, 8:21 AM.

Download Raw Diff

Details

Reviewers

Hahnfeld
ABataev

Summary

This patch adds support for teams reduction into the CUDA plugin. The number of variables to be reduced as well as their size are passed from the compiler to the plugin via a struct of kernel computation properties (which also includes the execution mode). Before a kernel is launched, the plugin allocates space for the scratchpad to be used for the reduction. A pointer to the allocated scratchpad is passed as the last parameter to the kernel at launch.

Diff Detail

Repository: rOMP OpenMP

Event Timeline

grokos created this revision.Apr 5 2018, 8:21 AM

Herald added a subscriber: guansong. · View Herald TranscriptApr 5 2018, 8:21 AM

Some comments inline, mostly minor things.

libomptarget/plugins/cuda/src/rtl.cpp
62–81	Shouldn't you be explicitly assigning values to this `enum`? Currently, it's not obvious which values they will hold. (And I think the names should not all be upper case (except `SPMD`), but only the first character...)
496–498	You should be using `SPMD` and `>= None` here.
638–641	I think this shouldn't be in this patch?
705–708	Maybe error out completely?

ABataev added inline comments.Apr 5 2018, 9:07 AM

libomptarget/plugins/cuda/src/rtl.cpp
89–90	Why do you need all that data before starting the outlined function? Can we allocate the memory during execution of the outlined function by some runtime function call? Like this: __omp_offloading.... <master> %Scratchpad = call i8 __kmpc_allocate_scratchpad(<Size_of_the_reductions>); .... __kmpc_deallocate_scratchpad(i8 %Scratchpad); <end_master>

grokos updated this revision to Diff 141184.Apr 5 2018, 11:05 AM

grokos marked 3 inline comments as done.

grokos added inline comments.

libomptarget/plugins/cuda/src/rtl.cpp
89–90	We can go down that route if you prefer. I haven't been able to find official documentation about which type of memory allocation is faster (`cuadMalloc` on the host vs `malloc` on the device), so I assume they perform equally fast. Any thoughts on that?
638–641	Removed.
705–708	Correct, if allocating the scratchpad fails the kernel cannot be executed, so we'll return `OFFLOAD_FAIL`.

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

In D45326#1058730, @grokos wrote:

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

But you can change the limit using cudaThreadSetLimit

libomptarget/plugins/cuda/src/rtl.cpp
89–90	I'd prefer this solution rather than the original one.

In D45326#1058740, @ABataev wrote:

In D45326#1058730, @grokos wrote:

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

But you can change the limit using cudaThreadSetLimit

That's what I'm saying. You can increase the limit, but how large will you set it? How will you know how many bytes are needed for the scratchpad if the compiler doesn't provide this information?

In D45326#1058799, @grokos wrote:

In D45326#1058740, @ABataev wrote:

In D45326#1058730, @grokos wrote:

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

But you can change the limit using cudaThreadSetLimit

That's what I'm saying. You can increase the limit, but how large will you set it? How will you know how many bytes are needed for the scratchpad if the compiler doesn't provide this information?

We already using the global memory allocation, so I don't see any reason why we can't use it for scratchpad. We just need to set some initial amount which is big enough and, probably, add the option that will allow increasing this size.

I think reductions are already implemented differently, can we close this?

Herald added a subscriber: jdoerfert. · View Herald TranscriptJul 9 2019, 12:24 PM

Right, this patch is now obsolete.

Revision Contents

Path

Size

libomptarget/

plugins/

cuda/

src/

rtl.cpp

103 lines

Diff 141184

libomptarget/plugins/cuda/src/rtl.cpp

Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	#define CUDA_ERR_STRING(err) \
{}		{}
#endif		#endif

/// Keep entries table per device.		/// Keep entries table per device.
struct FuncOrGblEntryTy {		struct FuncOrGblEntryTy {
__tgt_target_table Table;		__tgt_target_table Table;
std::vector<__tgt_offload_entry> Entries;		std::vector<__tgt_offload_entry> Entries;
};		};

enum ExecutionModeType {		enum ExecutionModeType : int8_t {
SPMD, // constructors, destructors,		SPMD = 0, // constructors, destructors,
// combined constructs (`teams distribute parallel for [simd]`)		// combined constructs (`teams distribute parallel for [simd]`)
GENERIC, // everything else		Generic = 1, // everything else
NONE		None = 2
		};

		// Each target kernel is described by the following data structure type code
		// which is generated by the compiler and contains the kernel's computation
		// properties.
		struct TargetKernelCompProperties {
		int8_t ExecutionMode;
		int32_t NumReductionVars;
		int32_t ReductionVarsSize;

		TargetKernelCompProperties(int8_t _ExecutionMode, int32_t _NumReductionVars,
		int32_t _ReductionVarsSize) : ExecutionMode(_ExecutionMode),
		NumReductionVars(_NumReductionVars), ReductionVarsSize(_ReductionVarsSize)
		{}
		HahnfeldUnsubmitted Done Reply Inline Actions Shouldn't you be explicitly assigning values to this `enum`? Currently, it's not obvious which values they will hold. (And I think the names should not all be upper case (except `SPMD`), but only the first character...) Hahnfeld: Shouldn't you be explicitly assigning values to this `enum`? Currently, it's not obvious which…
};		};

/// Use a single entity to encode a kernel and a set of flags		/// Use a single entity to encode a kernel and a set of flags
struct KernelTy {		struct KernelTy {
CUfunction Func;		CUfunction Func;

// execution mode of kernel		// execution mode of kernel
// 0 - SPMD mode (without master warp)		// 0 - SPMD mode (without master warp)
// 1 - Generic mode (with master warp)		// 1 - Generic mode (with master warp)
		ABataevUnsubmitted Not Done Reply Inline Actions Why do you need all that data before starting the outlined function? Can we allocate the memory during execution of the outlined function by some runtime function call? Like this: __omp_offloading.... <master> %Scratchpad = call i8 __kmpc_allocate_scratchpad(<Size_of_the_reductions>); .... __kmpc_deallocate_scratchpad(i8 %Scratchpad); <end_master> ABataev: Why do you need all that data before starting the outlined function? Can we allocate the memory…
		grokosAuthorUnsubmitted Not Done Reply Inline Actions We can go down that route if you prefer. I haven't been able to find official documentation about which type of memory allocation is faster (`cuadMalloc` on the host vs `malloc` on the device), so I assume they perform equally fast. Any thoughts on that? grokos: We can go down that route if you prefer. I haven't been able to find official documentation…
		ABataevUnsubmitted Not Done Reply Inline Actions I'd prefer this solution rather than the original one. ABataev: I'd prefer this solution rather than the original one.
int8_t ExecutionMode;		int8_t ExecutionMode;

KernelTy(CUfunction _Func, int8_t _ExecutionMode)		// Number of reduction variables
: Func(_Func), ExecutionMode(_ExecutionMode) {}		int32_t NumReductionVars;

		// Total size of reduction variables
		int32_t ReductionVarsSize;

		KernelTy(CUfunction _Func, TargetKernelCompProperties _CP) : Func(_Func),
		ExecutionMode(_CP.ExecutionMode), NumReductionVars(_CP.NumReductionVars),
		ReductionVarsSize(_CP.ReductionVarsSize) {}
};		};

/// List that contains all the kernels.		/// List that contains all the kernels.
/// FIXME: we may need this to be per device and per library.		/// FIXME: we may need this to be per device and per library.
std::list<KernelTy> KernelsList;		std::list<KernelTy> KernelsList;

/// Class containing all the device information.		/// Class containing all the device information.
class RTLDeviceInfoTy {		class RTLDeviceInfoTy {
▲ Show 20 Lines • Show All 348 Lines • ▼ Show 20 Lines	if (err != CUDA_SUCCESS) {
DP("Loading '%s' (Failed)\n", e->name);		DP("Loading '%s' (Failed)\n", e->name);
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return NULL;		return NULL;
}		}

DP("Entry point " DPxMOD " maps to %s (" DPxMOD ")\n",		DP("Entry point " DPxMOD " maps to %s (" DPxMOD ")\n",
DPxPTR(e - HostBegin), e->name, DPxPTR(fun));		DPxPTR(e - HostBegin), e->name, DPxPTR(fun));

// default value GENERIC (in case symbol is missing from cubin file)		// Load the kernel's computation properties
int8_t ExecModeVal = ExecutionModeType::GENERIC;		// Default values (in case symbol is missing from cubin file):
std::string ExecModeNameStr (e->name);		// Generic, 0, 0
ExecModeNameStr += "_exec_mode";		struct TargetKernelCompProperties CP(ExecutionModeType::Generic, 0, 0);
const char *ExecModeName = ExecModeNameStr.c_str();
		std::string CPNameStr (e->name);
		CPNameStr += "_property";
		const char *CPName = CPNameStr.c_str();

CUdeviceptr ExecModePtr;		CUdeviceptr CPPtr;
size_t cusize;		size_t cusize;
err = cuModuleGetGlobal(&ExecModePtr, &cusize, cumod, ExecModeName);		err = cuModuleGetGlobal(&CPPtr, &cusize, cumod, CPName);
if (err == CUDA_SUCCESS) {		if (err == CUDA_SUCCESS) {
if ((size_t)cusize != sizeof(int8_t)) {		if ((size_t)cusize != sizeof(TargetKernelCompProperties)) {
DP("Loading global exec_mode '%s' - size mismatch (%zd != %zd)\n",		DP("Loading global target kernel computation properties '%s' - size "
ExecModeName, cusize, sizeof(int8_t));		"mismatch (%zd != %zd)\n", CPName, cusize,
		sizeof(TargetKernelCompProperties));
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return NULL;		return NULL;
}		}

err = cuMemcpyDtoH(&ExecModeVal, ExecModePtr, cusize);		err = cuMemcpyDtoH(&CP, CPPtr, cusize);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when copying data from device to host. Pointers: "		DP("Error when copying data from device to host. Pointers: "
"host = " DPxMOD ", device = " DPxMOD ", size = %zd\n",		"host = " DPxMOD ", device = " DPxMOD ", size = %zd\n",
DPxPTR(&ExecModeVal), DPxPTR(ExecModePtr), cusize);		DPxPTR(&CP), DPxPTR(CPPtr), cusize);
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return NULL;		return NULL;
}		}

if (ExecModeVal < 0 \|\| ExecModeVal > 1) {		if (CP.ExecutionMode < SPMD \|\| CP.ExecutionMode >= None) {
DP("Error wrong exec_mode value specified in cubin file: %d\n",		DP("Error wrong target kernel computation properties value specified in"
ExecModeVal);		" cubin file: %d\n", CP.ExecutionMode);
		HahnfeldUnsubmitted Done Reply Inline Actions You should be using `SPMD` and `>= None` here. Hahnfeld: You should be using `SPMD` and `>= None` here.
return NULL;		return NULL;
}		}
} else {		} else {
DP("Loading global exec_mode '%s' - symbol missing, using default value "		DP("Loading global target kernel computation properties '%s' - symbol "
"GENERIC (1)\n", ExecModeName);		"missing, using default values: ExecutionMode Generic (1), "
		"NumReductionVars 0, ReductionVarsSize 0\n", CPName);
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
}		}

KernelsList.push_back(KernelTy(fun, ExecModeVal));		KernelsList.push_back(KernelTy(fun, CP));

__tgt_offload_entry entry = *e;		__tgt_offload_entry entry = *e;
entry.addr = (void *)&KernelsList.back();		entry.addr = (void *)&KernelsList.back();
DeviceInfo.addOffloadEntry(device_id, entry);		DeviceInfo.addOffloadEntry(device_id, entry);
}		}

return DeviceInfo.getOffloadEntriesTable(device_id);		return DeviceInfo.getOffloadEntriesTable(device_id);
}		}
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	int32_t __tgt_rtl_run_target_team_region(int32_t device_id, void *tgt_entry_ptr,
CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);		CUresult err = cuCtxSetCurrent(DeviceInfo.Contexts[device_id]);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Error when setting CUDA context\n");		DP("Error when setting CUDA context\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

// All args are references.		// All args are references.
std::vector<void *> args(arg_num);		// Allocate one more pointer for the reduction scratchpad.
		std::vector<void *> args(arg_num + 1);
std::vector<void *> ptrs(arg_num);		std::vector<void *> ptrs(arg_num);

for (int32_t i = 0; i < arg_num; ++i) {		for (int32_t i = 0; i < arg_num; ++i) {
ptrs[i] = (void *)((intptr_t)tgt_args[i] + tgt_offsets[i]);		ptrs[i] = (void *)((intptr_t)tgt_args[i] + tgt_offsets[i]);
args[i] = &ptrs[i];		args[i] = &ptrs[i];
}		}

KernelTy KernelInfo = (KernelTy )tgt_entry_ptr;		KernelTy KernelInfo = (KernelTy )tgt_entry_ptr;

int cudaThreadsPerBlock;		int cudaThreadsPerBlock;

if (thread_limit > 0) {		if (thread_limit > 0) {
cudaThreadsPerBlock = thread_limit;		cudaThreadsPerBlock = thread_limit;
DP("Setting CUDA threads per block to requested %d\n", thread_limit);		DP("Setting CUDA threads per block to requested %d\n", thread_limit);
// Add master warp if necessary		// Add master warp if necessary
if (KernelInfo->ExecutionMode == GENERIC) {		if (KernelInfo->ExecutionMode == Generic) {
cudaThreadsPerBlock += DeviceInfo.WarpSize[device_id];		cudaThreadsPerBlock += DeviceInfo.WarpSize[device_id];
DP("Adding master warp: +%d threads\n", DeviceInfo.WarpSize[device_id]);		DP("Adding master warp: +%d threads\n", DeviceInfo.WarpSize[device_id]);
}		}
} else {		} else {
cudaThreadsPerBlock = DeviceInfo.NumThreads[device_id];		cudaThreadsPerBlock = DeviceInfo.NumThreads[device_id];
DP("Setting CUDA threads per block to default %d\n",		DP("Setting CUDA threads per block to default %d\n",
DeviceInfo.NumThreads[device_id]);		DeviceInfo.NumThreads[device_id]);
}		}

		HahnfeldUnsubmitted Not Done Reply Inline Actions I think this shouldn't be in this patch? Hahnfeld: I think this shouldn't be in this patch?
		grokosAuthorUnsubmitted Not Done Reply Inline Actions Removed. grokos: Removed.
if (cudaThreadsPerBlock > DeviceInfo.ThreadsPerBlock[device_id]) {		if (cudaThreadsPerBlock > DeviceInfo.ThreadsPerBlock[device_id]) {
cudaThreadsPerBlock = DeviceInfo.ThreadsPerBlock[device_id];		cudaThreadsPerBlock = DeviceInfo.ThreadsPerBlock[device_id];
DP("Threads per block capped at device limit %d\n",		DP("Threads per block capped at device limit %d\n",
DeviceInfo.ThreadsPerBlock[device_id]);		DeviceInfo.ThreadsPerBlock[device_id]);
}		}

int kernel_limit;		int kernel_limit;
err = cuFuncGetAttribute(&kernel_limit,		err = cuFuncGetAttribute(&kernel_limit,
Show All 40 Lines	if (team_num <= 0) {
cudaBlocksPerGrid = DeviceInfo.BlocksPerGrid[device_id];		cudaBlocksPerGrid = DeviceInfo.BlocksPerGrid[device_id];
DP("Capping number of teams to team limit %d\n",		DP("Capping number of teams to team limit %d\n",
DeviceInfo.BlocksPerGrid[device_id]);		DeviceInfo.BlocksPerGrid[device_id]);
} else {		} else {
cudaBlocksPerGrid = team_num;		cudaBlocksPerGrid = team_num;
DP("Using requested number of teams %d\n", team_num);		DP("Using requested number of teams %d\n", team_num);
}		}

		void *Scratchpad = NULL;
		size_t ScratchpadSize = KernelInfo->NumReductionVars == 0 ? 0 :
		256 /space for timestamp/ +
		cudaBlocksPerGrid * KernelInfo->ReductionVarsSize +
		KernelInfo->NumReductionVars * /padding=/256;
		if (ScratchpadSize > 0) {
		Scratchpad = __tgt_rtl_data_alloc(device_id, ScratchpadSize, Scratchpad);
		if (Scratchpad == NULL) {
		DP("Failed to allocate reduction scratchpad\n");
		return OFFLOAD_FAIL;
		}
		HahnfeldUnsubmitted Done Reply Inline Actions Maybe error out completely? Hahnfeld: Maybe error out completely?
		grokosAuthorUnsubmitted Not Done Reply Inline Actions Correct, if allocating the scratchpad fails the kernel cannot be executed, so we'll return `OFFLOAD_FAIL`. grokos: Correct, if allocating the scratchpad fails the kernel cannot be executed, so we'll return…

		unsigned timestamp = 0;
		__tgt_rtl_data_submit(device_id, Scratchpad, &timestamp, sizeof(unsigned));
		}
		args[arg_num] = &Scratchpad;

// Run on the device.		// Run on the device.
DP("Launch kernel with %d blocks and %d threads\n", cudaBlocksPerGrid,		DP("Launch kernel with %d blocks and %d threads\n", cudaBlocksPerGrid,
cudaThreadsPerBlock);		cudaThreadsPerBlock);

err = cuLaunchKernel(KernelInfo->Func, cudaBlocksPerGrid, 1, 1,		err = cuLaunchKernel(KernelInfo->Func, cudaBlocksPerGrid, 1, 1,
cudaThreadsPerBlock, 1, 1, 0 /bytes of shared memory/, 0, &args[0], 0);		cudaThreadsPerBlock, 1, 1, 0 /bytes of shared memory/, 0, &args[0], 0);
if (err != CUDA_SUCCESS) {		if (err != CUDA_SUCCESS) {
DP("Device kernel launch failed!\n");		DP("Device kernel launch failed!\n");
CUDA_ERR_STRING(err);		CUDA_ERR_STRING(err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

DP("Launch of entry point at " DPxMOD " successful!\n",		DP("Launch of entry point at " DPxMOD " successful!\n",
DPxPTR(tgt_entry_ptr));		DPxPTR(tgt_entry_ptr));

CUresult sync_err = cuCtxSynchronize();		CUresult sync_err = cuCtxSynchronize();
if (sync_err != CUDA_SUCCESS) {		if (sync_err != CUDA_SUCCESS) {
DP("Kernel execution error at " DPxMOD "!\n", DPxPTR(tgt_entry_ptr));		DP("Kernel execution error at " DPxMOD "!\n", DPxPTR(tgt_entry_ptr));
CUDA_ERR_STRING(sync_err);		CUDA_ERR_STRING(sync_err);
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
} else {		} else {
DP("Kernel execution at " DPxMOD " successful!\n", DPxPTR(tgt_entry_ptr));		DP("Kernel execution at " DPxMOD " successful!\n", DPxPTR(tgt_entry_ptr));
}		}

		if (Scratchpad)
		__tgt_rtl_data_delete(device_id, Scratchpad);

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

int32_t __tgt_rtl_run_target_region(int32_t device_id, void *tgt_entry_ptr,		int32_t __tgt_rtl_run_target_region(int32_t device_id, void *tgt_entry_ptr,
void *tgt_args, ptrdiff_t tgt_offsets, int32_t arg_num) {		void *tgt_args, ptrdiff_t tgt_offsets, int32_t arg_num) {
// use one team and the default number of threads.		// use one team and the default number of threads.
const int32_t team_num = 1;		const int32_t team_num = 1;
const int32_t thread_limit = 0;		const int32_t thread_limit = 0;
return __tgt_rtl_run_target_team_region(device_id, tgt_entry_ptr, tgt_args,		return __tgt_rtl_run_target_team_region(device_id, tgt_entry_ptr, tgt_args,
tgt_offsets, arg_num, team_num, thread_limit, 0);		tgt_offsets, arg_num, team_num, thread_limit, 0);
}		}

#ifdef __cplusplus		#ifdef __cplusplus
}		}
#endif		#endif