Currently the runtime implementation of __kmpc_alloc_shared is extremely slow because it allocated memory for each thread individually. This patch adds a small buffer for the threads to share data and will greatly improve performance for builds where all globalization could not be optimized out. If the shared buffer is full, then memory will not only be allocated per-warp rather than per-thread.
Depends on D97680
IIRC, in the new deviceRTLs, we only have one stack where the first chunk, which is bigger than the rest, is for the main thread in non-SPMD mode. Why do we want to have two here?