Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a int32_t can represent. There is no limitation mentioned in the spec. This patch simply removes it.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I don't think we should hardcode this at all. It seems to me like a device specific value, no? As such we should store it per device and query it in the beginning.
If this is always going to be 2^31-1, ok, but if there is nothing specifying that I'd say we make it dynamic. I can see little use in a static value anyway.
No. This value is the hard limit, not the per device value. Per device value is BlocksPerGrid, and it is set by calling CUDA interface. After that, it compares with this hard limit, and caps it accordingly. As a result, even if the device can support more blocks, it will always capped to 65536.
Yes, we do. Then after that query, we also have a capping.
// Query attributes to determine number of threads/block and blocks/grid. int MaxGridDimX; Err = cuDeviceGetAttribute(&MaxGridDimX, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X, Device); if (Err != CUDA_SUCCESS) { DP("Error getting max grid dimension, use default value %d\n", DeviceRTLTy::DefaultNumTeams); DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::DefaultNumTeams; } else if (MaxGridDimX <= DeviceRTLTy::HardTeamLimit) { DP("Using %d CUDA blocks per grid\n", MaxGridDimX); DeviceData[DeviceId].BlocksPerGrid = MaxGridDimX; } else { DP("Max CUDA blocks per grid %d exceeds the hard team limit %d, capping " "at the hard limit\n", MaxGridDimX, DeviceRTLTy::HardTeamLimit); DeviceData[DeviceId].BlocksPerGrid = DeviceRTLTy::HardTeamLimit; }
So actually, I think we even don't need this "hard" limit. We don't have that limit in the spec. I don't know why they are there at the first place.
Actually, the hard limit of 65536 can help with performance in some cases. For example, for BabelStream benchmark, if we don't cap the team number, it could have 262144 blocks. After capping to 65536, the performance improved a lot.
Capping to 65536: Type Time(%) Time Calls Avg Min Max Name 21.96% 361.91ms 100 3.6191ms 3.4761ms 4.2035ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3dotEv_l229 12.50% 205.93ms 100 2.0593ms 2.0200ms 2.0720ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE5triadEv_l180 12.40% 204.35ms 100 2.0435ms 2.0084ms 2.0561ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3addEv_l155 8.57% 141.31ms 100 1.4131ms 1.3905ms 1.4901ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3mulEv_l132 8.53% 140.61ms 100 1.4061ms 1.3885ms 1.4647ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE4copyEv_l108 0.20% 3.2532ms 1 3.2532ms 3.2532ms 3.2532ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE11init_arraysEddd_l62 Not capping, grid size 262144: Type Time(%) Time Calls Avg Min Max Name GPU activities: 34.48% 682.98ms 100 6.8298ms 6.6153ms 7.8655ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3dotEv_l229 10.30% 204.15ms 100 2.0415ms 2.0165ms 2.0479ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE5triadEv_l180 10.26% 203.31ms 100 2.0331ms 2.0084ms 2.0385ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3addEv_l155 7.51% 148.83ms 100 1.4883ms 1.4327ms 1.7717ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE3mulEv_l132 7.46% 147.83ms 100 1.4783ms 1.4251ms 1.7499ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE4copyEv_l108 0.15% 3.0440ms 1 3.0440ms 3.0440ms 3.0440ms __omp_offloading_fd02_c612a6__ZN9OMPStreamIdE11init_arraysEddd_l62
However, the capping doesn't work w/o D119311.
LG. Can we improve our heuristic next, e.g., pick a number that is often reasonable without limiting the user if they picked one