If a combined loop has insufficient parallelism (= low trip count), we
might end up with too few teams/blocks. To counter that we can reduce
the number of threads per team we use. This patch implements a heuristic
and exposes a new environment variable to control the minimum of threads
to be employed in this case.
Issue reported by:
Felipe Cabarcas Jaramillo <cabarcas@udel.edu> (@fel-cab).
Shouldn't this correspond to the warp / wavefront size? On NVPTX it's 32 but on AMDGPU it could be 32 or 64. You can check using HSA.