Current map_nested_foreach_to_threads op expects blockDim argument to
be present. It is tedious to provide it.
This revision automatically sets blockDim if all the 'scf.foreach thread'
trip counts are known at compile-time. It traverses all sibling
scf.foreach_thread ops, finds the largest number of trips in the same
level. Then it assigns the largest trip count to blockDim.
For the example shown below, it sets blockDim = [12, 9, 1] that is x, y and z dimensions respectively.
gpu.launch() { scf.foreach_thread (%i, %j) in (7,9) --> parallelized threadIdx.x + threadIdx.y scf.foreach_thread (%i) in (12) --> parallelized threadIdx.x }
default computation needs to happen here, not hidden under the map function