This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/plugins/cuda/src/
-
libomptarget/
-
plugins/
-
cuda/
-
src/
-
rtl.cpp

Differential D119313

[OpenMP][CUDA] Remove the hard team limit
ClosedPublic

Authored by tianshilei1992 on Feb 8 2022, 10:11 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
jhuber6
JonChesterfield

Commits

rGaca33b0b37b7: [OpenMP][CUDA] Remove the hard team limit

Summary

Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a int32_t can represent. There is no limitation mentioned in the spec. This patch simply removes it.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tianshilei1992 created this revision.Feb 8 2022, 10:11 PM

Herald added subscribers: guansong, yaxunl. · View Herald TranscriptFeb 8 2022, 10:11 PM

tianshilei1992 requested review of this revision.Feb 8 2022, 10:11 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 8 2022, 10:11 PM

Herald added subscribers: openmp-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B148419: Diff 407050.Feb 8 2022, 10:18 PM

I don't think we should hardcode this at all. It seems to me like a device specific value, no? As such we should store it per device and query it in the beginning.
If this is always going to be 2^31-1, ok, but if there is nothing specifying that I'd say we make it dynamic. I can see little use in a static value anyway.

In D119313#3307102, @jdoerfert wrote:

I don't think we should hardcode this at all. It seems to me like a device specific value, no? As such we should store it per device and query it in the beginning.
If this is always going to be 2^31-1, ok, but if there is nothing specifying that I'd say we make it dynamic. I can see little use in a static value anyway.

No. This value is the hard limit, not the per device value. Per device value is BlocksPerGrid, and it is set by calling CUDA interface. After that, it compares with this hard limit, and caps it accordingly. As a result, even if the device can support more blocks, it will always capped to 65536.

We ask the device how many blocks/teams it supports, do we not?

In D119313#3309288, @jdoerfert wrote:

We ask the device how many blocks/teams it supports, do we not?

Yes, we do. Then after that query, we also have a capping.

// Query attributes to determine number of threads/block and blocks/grid.
int MaxGridDimX;
Err = cuDeviceGetAttribute(&MaxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,
                           Device);
if (Err != CUDA_SUCCESS) {
  DP("Error getting max grid dimension, use default value %d\n",
     DeviceRTLTy::DefaultNumTeams);
  DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::DefaultNumTeams;
} else if (MaxGridDimX <= DeviceRTLTy::HardTeamLimit) {
  DP("Using %d CUDA blocks per grid\n", MaxGridDimX);
  DeviceData[DeviceId].BlocksPerGrid = MaxGridDimX;
} else {
  DP("Max CUDA blocks per grid %d exceeds the hard team limit %d, capping "
     "at the hard limit\n",
     MaxGridDimX, DeviceRTLTy::HardTeamLimit);
  DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::HardTeamLimit;
}

So actually, I think we even don't need this "hard" limit. We don't have that limit in the spec. I don't know why they are there at the first place.

Actually, the hard limit of 65536 can help with performance in some cases. For example, for BabelStream benchmark, if we don't cap the team number, it could have 262144 blocks. After capping to 65536, the performance improved a lot.

Capping to 65536:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   21.96%  361.91ms       100  3.6191ms  3.4761ms  4.2035ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3dotEv_l229
                   12.50%  205.93ms       100  2.0593ms  2.0200ms  2.0720ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE5triadEv_l180
                   12.40%  204.35ms       100  2.0435ms  2.0084ms  2.0561ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3addEv_l155
                    8.57%  141.31ms       100  1.4131ms  1.3905ms  1.4901ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3mulEv_l132
                    8.53%  140.61ms       100  1.4061ms  1.3885ms  1.4647ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE4copyEv_l108
                    0.20%  3.2532ms         1  3.2532ms  3.2532ms  3.2532ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE11init_arraysEddd_l62

Not capping, grid size 262144:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   34.48%  682.98ms       100  6.8298ms  6.6153ms  7.8655ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3dotEv_l229
                   10.30%  204.15ms       100  2.0415ms  2.0165ms  2.0479ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE5triadEv_l180
                   10.26%  203.31ms       100  2.0331ms  2.0084ms  2.0385ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3addEv_l155
                    7.51%  148.83ms       100  1.4883ms  1.4327ms  1.7717ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3mulEv_l132
                    7.46%  147.83ms       100  1.4783ms  1.4251ms  1.7499ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE4copyEv_l108
                    0.15%  3.0440ms         1  3.0440ms  3.0440ms  3.0440ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE11init_arraysEddd_l62

However, the capping doesn't work w/o D119311.

remove the hard limit

tianshilei1992 retitled this revision from [OpenMP][CUDA] Set the hard team limit to 2^31-1 to [OpenMP][CUDA] Remove the hard team limit.Feb 10 2022, 11:42 AM

tianshilei1992 edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B148803: Diff 407622.Feb 10 2022, 11:56 AM

LG. Can we improve our heuristic next, e.g., pick a number that is often reasonable without limiting the user if they picked one

This revision is now accepted and ready to land.Feb 10 2022, 11:57 AM

Closed by commit rGaca33b0b37b7: [OpenMP][CUDA] Remove the hard team limit (authored by tianshilei1992). · Explain WhyFeb 10 2022, 3:08 PM

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rGaca33b0b37b7: [OpenMP][CUDA] Remove the hard team limit.

tianshilei1992 mentioned this in D121255: [OpenMP][CUDA] Remove hard thread limit in the plugin.Mar 8 2022, 2:30 PM

Revision Contents

Path

Size

openmp/

libomptarget/

plugins/

cuda/

src/

rtl.cpp

14 lines

Diff 407693

openmp/libomptarget/plugins/cuda/src/rtl.cpp

Show First 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	class DeviceRTLTy {
int EnvTeamThreadLimit;		int EnvTeamThreadLimit;
// OpenMP requires flags		// OpenMP requires flags
int64_t RequiresFlags;		int64_t RequiresFlags;
// Amount of dynamic shared memory to use at launch.		// Amount of dynamic shared memory to use at launch.
uint64_t DynamicMemorySize;		uint64_t DynamicMemorySize;
// Number of initial streams for each device.		// Number of initial streams for each device.
int NumInitialStreams = 32;		int NumInitialStreams = 32;

static constexpr const int HardTeamLimit = 1U << 16U; // 64k		static constexpr const int32_t HardThreadLimit = 1024;
static constexpr const int HardThreadLimit = 1024;		static constexpr const int32_t DefaultNumTeams = 128;
static constexpr const int DefaultNumTeams = 128;		static constexpr const int32_t DefaultNumThreads = 128;
static constexpr const int DefaultNumThreads = 128;

using StreamPoolTy = ResourcePoolTy<CUstream>;		using StreamPoolTy = ResourcePoolTy<CUstream>;
std::vector<std::unique_ptr<StreamPoolTy>> StreamPool;		std::vector<std::unique_ptr<StreamPoolTy>> StreamPool;

ResourcePoolTy<CUevent> EventPool;		ResourcePoolTy<CUevent> EventPool;

std::vector<DeviceDataTy> DeviceData;		std::vector<DeviceDataTy> DeviceData;
std::vector<CUmodule> Modules;		std::vector<CUmodule> Modules;
▲ Show 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	int initDevice(const int DeviceId) {
// Query attributes to determine number of threads/block and blocks/grid.		// Query attributes to determine number of threads/block and blocks/grid.
int MaxGridDimX;		int MaxGridDimX;
Err = cuDeviceGetAttribute(&MaxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,		Err = cuDeviceGetAttribute(&MaxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,
Device);		Device);
if (Err != CUDA_SUCCESS) {		if (Err != CUDA_SUCCESS) {
DP("Error getting max grid dimension, use default value %d\n",		DP("Error getting max grid dimension, use default value %d\n",
DeviceRTLTy::DefaultNumTeams);		DeviceRTLTy::DefaultNumTeams);
DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::DefaultNumTeams;		DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::DefaultNumTeams;
} else if (MaxGridDimX <= DeviceRTLTy::HardTeamLimit) {		} else {
DP("Using %d CUDA blocks per grid\n", MaxGridDimX);		DP("Using %d CUDA blocks per grid\n", MaxGridDimX);
DeviceData[DeviceId].BlocksPerGrid = MaxGridDimX;		DeviceData[DeviceId].BlocksPerGrid = MaxGridDimX;
} else {
DP("Max CUDA blocks per grid %d exceeds the hard team limit %d, capping "
"at the hard limit\n",
MaxGridDimX, DeviceRTLTy::HardTeamLimit);
DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::HardTeamLimit;
}		}

// We are only exploiting threads along the x axis.		// We are only exploiting threads along the x axis.
int MaxBlockDimX;		int MaxBlockDimX;
Err = cuDeviceGetAttribute(&MaxBlockDimX,		Err = cuDeviceGetAttribute(&MaxBlockDimX,
CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X, Device);		CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X, Device);
if (Err != CUDA_SUCCESS) {		if (Err != CUDA_SUCCESS) {
DP("Error getting max block dimension, use default value %d\n",		DP("Error getting max block dimension, use default value %d\n",
▲ Show 20 Lines • Show All 1,067 Lines • Show Last 20 Lines