Depends On D104780
Recursive work splitting instead of sequential async tasks submission gives ~20%-30% speedup in microbenchmarks.
Algorithm outline:
- Collapse scf.parallel dimensions into a single dimension
- Compute the block size for the parallel operations from the 1d problem size
- Launch parallel tasks
- Each parallel task reconstructs its own bounds in the original multi-dimensional iteration space
- Each parallel task computes the original parallel operation body using scf.for loop nest
Longer term this could become an op, as this functionality is needed frequently. Not here, though.