This revision adds a new op map_nested_foreach_thread_to_gpu_threads to transform dialect. The op searches scf.foreach_threads inside the gpu_launch and distributes them with gpu.thread_id attribute.
Loop mapping is explicit and given by the map_nested_foreach_thread_to_gpu_threads op. Mapping is done one-to-one, therefore the loops dissappear.
The dynamic trip count or trip count that are larger than thread size are not supported for the time being. However, we can indeed support them by generating a loop inside with cyclic scheduling.
For the time being, trip counts that are dynamic or bigger than thread sizes are not supported. However, in the future the compiler can indeed generate a loop with static cyclic scheduling to support these cases.
Current mechanism allows scf.foreach_threads to be siblings or nested. There cannot be interleaving code between the loops when they are nested.
there is is no translation_info attribute in MLIR