This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][CUDA] Remove the hard team limit
ClosedPublic

Authored by tianshilei1992 on Feb 8 2022, 10:11 PM.

Details

Summary

Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a int32_t can represent. There is no limitation mentioned in the spec. This patch simply removes it.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Feb 8 2022, 10:11 PM
tianshilei1992 requested review of this revision.Feb 8 2022, 10:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 8 2022, 10:11 PM

I don't think we should hardcode this at all. It seems to me like a device specific value, no? As such we should store it per device and query it in the beginning.
If this is always going to be 2^31-1, ok, but if there is nothing specifying that I'd say we make it dynamic. I can see little use in a static value anyway.

I don't think we should hardcode this at all. It seems to me like a device specific value, no? As such we should store it per device and query it in the beginning.
If this is always going to be 2^31-1, ok, but if there is nothing specifying that I'd say we make it dynamic. I can see little use in a static value anyway.

No. This value is the hard limit, not the per device value. Per device value is BlocksPerGrid, and it is set by calling CUDA interface. After that, it compares with this hard limit, and caps it accordingly. As a result, even if the device can support more blocks, it will always capped to 65536.

We ask the device how many blocks/teams it supports, do we not?

We ask the device how many blocks/teams it supports, do we not?

Yes, we do. Then after that query, we also have a capping.

// Query attributes to determine number of threads/block and blocks/grid.
int MaxGridDimX;
Err = cuDeviceGetAttribute(&MaxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X,
                           Device);
if (Err != CUDA_SUCCESS) {
  DP("Error getting max grid dimension, use default value %d\n",
     DeviceRTLTy::DefaultNumTeams);
  DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::DefaultNumTeams;
} else if (MaxGridDimX <= DeviceRTLTy::HardTeamLimit) {
  DP("Using %d CUDA blocks per grid\n", MaxGridDimX);
  DeviceData[DeviceId].BlocksPerGrid = MaxGridDimX;
} else {
  DP("Max CUDA blocks per grid %d exceeds the hard team limit %d, capping "
     "at the hard limit\n",
     MaxGridDimX, DeviceRTLTy::HardTeamLimit);
  DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::HardTeamLimit;
}

So actually, I think we even don't need this "hard" limit. We don't have that limit in the spec. I don't know why they are there at the first place.

tianshilei1992 added a comment.EditedFeb 10 2022, 10:55 AM

Actually, the hard limit of 65536 can help with performance in some cases. For example, for BabelStream benchmark, if we don't cap the team number, it could have 262144 blocks. After capping to 65536, the performance improved a lot.

Capping to 65536:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
                   21.96%  361.91ms       100  3.6191ms  3.4761ms  4.2035ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3dotEv_l229
                   12.50%  205.93ms       100  2.0593ms  2.0200ms  2.0720ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE5triadEv_l180
                   12.40%  204.35ms       100  2.0435ms  2.0084ms  2.0561ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3addEv_l155
                    8.57%  141.31ms       100  1.4131ms  1.3905ms  1.4901ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3mulEv_l132
                    8.53%  140.61ms       100  1.4061ms  1.3885ms  1.4647ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE4copyEv_l108
                    0.20%  3.2532ms         1  3.2532ms  3.2532ms  3.2532ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE11init_arraysEddd_l62

Not capping, grid size 262144:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   34.48%  682.98ms       100  6.8298ms  6.6153ms  7.8655ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3dotEv_l229
                   10.30%  204.15ms       100  2.0415ms  2.0165ms  2.0479ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE5triadEv_l180
                   10.26%  203.31ms       100  2.0331ms  2.0084ms  2.0385ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3addEv_l155
                    7.51%  148.83ms       100  1.4883ms  1.4327ms  1.7717ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3mulEv_l132
                    7.46%  147.83ms       100  1.4783ms  1.4251ms  1.7499ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE4copyEv_l108
                    0.15%  3.0440ms         1  3.0440ms  3.0440ms  3.0440ms  __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE11init_arraysEddd_l62

However, the capping doesn't work w/o D119311.

remove the hard limit

tianshilei1992 retitled this revision from [OpenMP][CUDA] Set the hard team limit to 2^31-1 to [OpenMP][CUDA] Remove the hard team limit.Feb 10 2022, 11:42 AM
tianshilei1992 edited the summary of this revision. (Show Details)
jdoerfert accepted this revision.Feb 10 2022, 11:57 AM

LG. Can we improve our heuristic next, e.g., pick a number that is often reasonable without limiting the user if they picked one

This revision is now accepted and ready to land.Feb 10 2022, 11:57 AM
This revision was automatically updated to reflect the committed changes.