GPU Transform dialect provides followings ops for mapping
gpu.map_nested_foreach_to_threads --> gpu blocks gpu.map_foreach_to_blocks --> gpu threads
D137413 implements a clear loop mapping attributes. There is no need to have two different ops for mapping.
In addition, parallelism the entire kernel with a single op comes with advantages. First, the compiler can calculate gridDim once you know blockDim. This is actually necessary for global mapping (blockIdx.x * blockDim.x + threadIdx.x).
Reference to mapping missing here.