Page MenuHomePhabricator

clang/AMDGPU: Apply workgroup related attributes to all functions
Needs ReviewPublic

Authored by arsenm on Fri, Oct 16, 11:51 AM.

Details

Summary

When the default flat work group size is 256, it should also apply to
callable functions.

Diff Detail

Event Timeline

arsenm created this revision.Fri, Oct 16, 11:51 AM
arsenm requested review of this revision.Fri, Oct 16, 11:51 AM

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

The problem is that user can override default on a kernel with the attribute, but cannot do so on function. So a module can be compiled with a default smaller than requested on one of the kernels.

Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:

LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")

I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

The problem is that user can override default on a kernel with the attribute, but cannot do so on function. So a module can be compiled with a default smaller than requested on one of the kernels.

Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:

LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")

I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.

The current backend default ends up greatly restricting the registers used in the functions, and increasing the spilling.

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

The problem is that user can override default on a kernel with the attribute, but cannot do so on function. So a module can be compiled with a default smaller than requested on one of the kernels.

Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:

LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")

I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.

The current backend default ends up greatly restricting the registers used in the functions, and increasing the spilling.

I know the problem, but it should be better to use AMDGPUPropagateAttributes for this. It will clone functions if needed.