Infers block/grid dimensions/indices or ranges of such dimensions/indices.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Two initial comments;
- Do we actually know that the number of blocks/threads is bounded above by 2^31? For one thing, the HIP (and presumably CUDA) launch APIs take uint32_t and so the bound could be 2^32
- Could you find a way to propagate these into (outlined) kernels, so that we get bounds on gpu.block_id and friends?
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
695 ↗ | (On Diff #441873) | The ranges already specify a <= %v <= b, so there should be no need to subtract 1? |
CUDA gridDim.x's limit is 2^31-1, everything is well below.
For ROCm, they are 2^16-1 and below.
- Could you find a way to propagate these into (outlined) kernels, so that we get bounds on gpu.block_id and friends?
It should be possible to infer gpu.block/thread_dim/id int ranges, but it would require combining ranges of the gpu.launch_func ops referencing the parent kernel.
IIUC, this would require building a symbol table for every op's inferResultRanges() call. Compared to that, inferring gpu.launch region arg int ranges is much simpler.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
695 ↗ | (On Diff #441873) | threadIdx is in [0, blockDim-1] (inclusive), blockIdx is in [0, gridDim-1] (inclusive). |
Thanks for digging up the sources on those bounds!
Would you be willing to go ahead and give the gpu.*_{dim,id} ops conservative bounds of {[1, 2^31], [0, 2^31 - 1]} in a future revision? (Or I can do it)
Could you add a test before we land this? It looks fine to me, but I'd like to make sure it doesn't break in the future
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
695 ↗ | (On Diff #441873) | Ok, right, now I see what's going on here. That is, if I set a dimension to, say, 256, I'll get the dimension bounds being [256, 256] and the index bounds being [0, 255] like I'd expect. |
Except they were not conservative enough. ;-)
The Vulkan device registry is the better source of what values are out in the wild, e.g.
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupCount[0]
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupSize[0]
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations
https://vulkan.gpuinfo.org/displaycoreproperty.php?name=subgroupProperties.subgroupSize
I adjusted the values accordingly.
Would you be willing to go ahead and give the gpu.*_{dim,id} ops conservative bounds of {[1, 2^31], [0, 2^31 - 1]} in a future revision? (Or I can do it)
Done with the conservative bounds for now.
As a follow-up, how about adding an attribute next to gpu.kernel that encodes the launch bounds?
They can be used to provide ranges to gpu.block_id etc and can be lowered to e.g.:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#performance-tuning-directives
If SCCP supports function calls (not sure if it does), it should be possible to also derive the launch bounds from all gpu.launch_func during analysis.
Could you add a test before we land this? It looks fine to me, but I'd like to make sure it doesn't break in the future
Oops, sorry. Test added.
This sounds like a reasonable approach to me. Outlining seems the ideal place to add these. We could also add a verifier that ensures that the annotations coincide with the call-sites. For example, we could have every gpu.launch_func check whether its parameters fit the annotation.
If SCCP supports function calls (not sure if it does), it should be possible to also derive the launch bounds from all gpu.launch_func during analysis.
I see kernel outlining as a relatively late step in processing and would expect that there is no further optimization after it. After all, the inline representation of gpu.launch is meant for optimization. The only use I could see is if we want to reuse kernels but then the optimization that attaches kernels to multiple call-sites could take care of fixing up annotations. This would only be legal if the new annotations are a superset, anyway, as we might have used the static knowledge in optimization earlier.
I agree that getting anything more than very conservative bounds in an outlined kernel will be tricky and so it might not be worth doing (right now).
One note: that bound of 128 on the workitem ID is definitely wrong: from what I can tell, those Vulkan numbers are counted in units of warps/waves/..., not actual workitems. See, for example, https://www.llvm.org/docs/AMDGPUUsage.html#llvm-ir-attributes (where the default upper bound on workgroup size is 1024) and that, downstream, we routinely use workgroups with 256 items in them without issue.
I'd also *prefer* that we stick with what the API has told us the constraints on maximum workgroup/workitem dimensions are so we don't get suddenly bit by optimizations breaking on future hardware.
Yes, but we can't really query the target hardware in the compiler. At most, it could be a pass option. My feeling is that the current conservative bounds are large enough that this won't break in the foreseeable future: Kernel dispatch (cu/hipLaunchKernel, vkCmdDispatch) use 32-bit parameters for block/grid dimensions.
One note: that bound of 128 on the workitem ID is definitely wrong: from what I can tell, those Vulkan numbers are counted in units of warps/waves/..., not actual workitems. See, for example, https://www.llvm.org/docs/AMDGPUUsage.html#llvm-ir-attributes (where the default upper bound on workgroup size is 1024) and that, downstream, we routinely use workgroups with 256 items in them without issue.
128 is the limit for subgroup (Vulkan-speak for warps/wavefronts), not workgroup. I was quite surprised to see that Turnip Adreno seems to use a subgroup size of 128, and I don't think there is much value in going higher. But if you prefer, we could set this to uint32_max as well.
Ah, I missed that the one with 128 was laneId, not any of the other ones - my bad for not reading carefully.
Also, wrt the value of kMaxDim ... didn't we establish that std::numeric_limits<int32_t>::max would be enough, or did the Vulkan info contradict that?
Mali GPUs report 2^32-1 for max grid dim:
https://vulkan.gpuinfo.org/listreports.php?limit=maxComputeWorkGroupCount[0]
I doubt that the values for max block dim above Intel's 2240 limit are correct, but I didn't want to do too much interpretation:
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupSize[0]
This infers block/grid dimensions/indices from launch operands.
Infers block/grid dimensions/indices or ranges of such dimensions/indices?