When the default flat work group size is 256, it should also apply to
callable functions.
Details
Diff Detail
Event Timeline
What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?
The problem is that user can override default on a kernel with the attribute, but cannot do so on function. So a module can be compiled with a default smaller than requested on one of the kernels.
Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:
LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")
I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.
Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:
LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.
The current backend default ends up greatly restricting the registers used in the functions, and increasing the spilling.
I know the problem, but it should be better to use AMDGPUPropagateAttributes for this. It will clone functions if needed.