This removes roughly 20% of the memory allocated statically by the
runtime (1010592744 vs 801186888 bytes). It will make loops without
statically known static schedule more expensive (for the one time
allocation and free). It will make new data environments, e.g., nested
parallels, cheaper, as we don't state/reload the "loop data" anymore.
Overall, it's a trade-off which makes sure you only pay for what you
A better solution would be to extend the _dispatch_ API such that we
can pass in "stack-allocations" to hold the data. Since that requires
more work this seemed like a good first step.
Probably needs more testing.