Page MenuHomePhabricator

[mlir] Add InferIntRangeInterface to gpu.launch
ClosedPublic

Authored by csigg on Jul 2 2022, 1:30 AM.

Details

Summary

Infers block/grid dimensions/indices or ranges of such dimensions/indices.

Diff Detail

Event Timeline

csigg created this revision.Jul 2 2022, 1:30 AM
csigg requested review of this revision.Jul 2 2022, 1:30 AM

Two initial comments;

  • Do we actually know that the number of blocks/threads is bounded above by 2^31? For one thing, the HIP (and presumably CUDA) launch APIs take uint32_t and so the bound could be 2^32
  • Could you find a way to propagate these into (outlined) kernels, so that we get bounds on gpu.block_id and friends?
krzysz00 added inline comments.Jul 2 2022, 7:14 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
695 ↗(On Diff #441873)

The ranges already specify a <= %v <= b, so there should be no need to subtract 1?

csigg added a comment.Jul 2 2022, 8:40 AM
  • Do we actually know that the number of blocks/threads is bounded above by 2^31? For one thing, the HIP (and presumably CUDA) launch APIs take uint32_t and so the bound could be 2^32

CUDA gridDim.x's limit is 2^31-1, everything is well below.
For ROCm, they are 2^16-1 and below.

  • Could you find a way to propagate these into (outlined) kernels, so that we get bounds on gpu.block_id and friends?

It should be possible to infer gpu.block/thread_dim/id int ranges, but it would require combining ranges of the gpu.launch_func ops referencing the parent kernel.
IIUC, this would require building a symbol table for every op's inferResultRanges() call. Compared to that, inferring gpu.launch region arg int ranges is much simpler.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
695 ↗(On Diff #441873)

threadIdx is in [0, blockDim-1] (inclusive), blockIdx is in [0, gridDim-1] (inclusive).

Thanks for digging up the sources on those bounds!

Would you be willing to go ahead and give the gpu.*_{dim,id} ops conservative bounds of {[1, 2^31], [0, 2^31 - 1]} in a future revision? (Or I can do it)

Could you add a test before we land this? It looks fine to me, but I'd like to make sure it doesn't break in the future

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
695 ↗(On Diff #441873)

Ok, right, now I see what's going on here. That is, if I set a dimension to, say, 256, I'll get the dimension bounds being [256, 256] and the index bounds being [0, 255] like I'd expect.

csigg updated this revision to Diff 442013.Jul 3 2022, 11:12 PM

Add kernel ops bounds and test.

csigg added a comment.EditedJul 3 2022, 11:25 PM

Thanks for digging up the sources on those bounds!

Except they were not conservative enough. ;-)
The Vulkan device registry is the better source of what values are out in the wild, e.g.
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupCount[0]
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupSize[0]
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations
https://vulkan.gpuinfo.org/displaycoreproperty.php?name=subgroupProperties.subgroupSize
I adjusted the values accordingly.

Would you be willing to go ahead and give the gpu.*_{dim,id} ops conservative bounds of {[1, 2^31], [0, 2^31 - 1]} in a future revision? (Or I can do it)

Done with the conservative bounds for now.

As a follow-up, how about adding an attribute next to gpu.kernel that encodes the launch bounds?
They can be used to provide ranges to gpu.block_id etc and can be lowered to e.g.:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#performance-tuning-directives

If SCCP supports function calls (not sure if it does), it should be possible to also derive the launch bounds from all gpu.launch_func during analysis.

Could you add a test before we land this? It looks fine to me, but I'd like to make sure it doesn't break in the future

Oops, sorry. Test added.

herhut added a comment.Jul 4 2022, 5:44 AM

As a follow-up, how about adding an attribute next to gpu.kernel that encodes the launch bounds?
They can be used to provide ranges to gpu.block_id etc and can be lowered to e.g.:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#performance-tuning-directives

This sounds like a reasonable approach to me. Outlining seems the ideal place to add these. We could also add a verifier that ensures that the annotations coincide with the call-sites. For example, we could have every gpu.launch_func check whether its parameters fit the annotation.

If SCCP supports function calls (not sure if it does), it should be possible to also derive the launch bounds from all gpu.launch_func during analysis.

I see kernel outlining as a relatively late step in processing and would expect that there is no further optimization after it. After all, the inline representation of gpu.launch is meant for optimization. The only use I could see is if we want to reuse kernels but then the optimization that attaches kernels to multiple call-sites could take care of fixing up annotations. This would only be legal if the new annotations are a superset, anyway, as we might have used the static knowledge in optimization earlier.

krzysz00 requested changes to this revision.Jul 4 2022, 7:57 AM

I agree that getting anything more than very conservative bounds in an outlined kernel will be tricky and so it might not be worth doing (right now).

One note: that bound of 128 on the workitem ID is definitely wrong: from what I can tell, those Vulkan numbers are counted in units of warps/waves/..., not actual workitems. See, for example, https://www.llvm.org/docs/AMDGPUUsage.html#llvm-ir-attributes (where the default upper bound on workgroup size is 1024) and that, downstream, we routinely use workgroups with 256 items in them without issue.

I'd also *prefer* that we stick with what the API has told us the constraints on maximum workgroup/workitem dimensions are so we don't get suddenly bit by optimizations breaking on future hardware.

This revision now requires changes to proceed.Jul 4 2022, 7:57 AM
csigg added a comment.Jul 4 2022, 8:59 AM

I'd also *prefer* that we stick with what the API has told us the constraints on maximum workgroup/workitem dimensions are so we don't get suddenly bit by optimizations breaking on future hardware.

Yes, but we can't really query the target hardware in the compiler. At most, it could be a pass option. My feeling is that the current conservative bounds are large enough that this won't break in the foreseeable future: Kernel dispatch (cu/hipLaunchKernel, vkCmdDispatch) use 32-bit parameters for block/grid dimensions.

One note: that bound of 128 on the workitem ID is definitely wrong: from what I can tell, those Vulkan numbers are counted in units of warps/waves/..., not actual workitems. See, for example, https://www.llvm.org/docs/AMDGPUUsage.html#llvm-ir-attributes (where the default upper bound on workgroup size is 1024) and that, downstream, we routinely use workgroups with 256 items in them without issue.

128 is the limit for subgroup (Vulkan-speak for warps/wavefronts), not workgroup. I was quite surprised to see that Turnip Adreno seems to use a subgroup size of 128, and I don't think there is much value in going higher. But if you prefer, we could set this to uint32_max as well.

Ah, I missed that the one with 128 was laneId, not any of the other ones - my bad for not reading carefully.

Also, wrt the value of kMaxDim ... didn't we establish that std::numeric_limits<int32_t>::max would be enough, or did the Vulkan info contradict that?

csigg added a comment.Jul 4 2022, 10:08 AM

Also, wrt the value of kMaxDim ... didn't we establish that std::numeric_limits<int32_t>::max would be enough, or did the Vulkan info contradict that?

Mali GPUs report 2^32-1 for max grid dim:
https://vulkan.gpuinfo.org/listreports.php?limit=maxComputeWorkGroupCount[0]

I doubt that the values for max block dim above Intel's 2240 limit are correct, but I didn't want to do too much interpretation:
https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupSize[0]

krzysz00 accepted this revision.Jul 4 2022, 10:37 AM

Thanks for doing all the research on this!

LGTM, approved

This revision is now accepted and ready to land.Jul 4 2022, 10:37 AM

This infers block/grid dimensions/indices from launch operands.

Infers block/grid dimensions/indices or ranges of such dimensions/indices?

csigg edited the summary of this revision. (Show Details)Jul 4 2022, 10:13 PM
This revision was landed with ongoing or failed builds.Jul 4 2022, 10:15 PM
This revision was automatically updated to reflect the committed changes.