Depends On D104780

Recursive work splitting instead of sequential async tasks submission gives ~20%-30% speedup in microbenchmarks.

Algorithm outline:

- Collapse scf.parallel dimensions into a single dimension
- Compute the block size for the parallel operations from the 1d problem size
- Launch parallel tasks
- Each parallel task reconstructs its own bounds in the original multi-dimensional iteration space
- Each parallel task computes the original parallel operation body using scf.for loop nest