map_nested_foreach_to_threads has implicit loop mapping; for details, see below. While it's correct, it does not make use of warp programming.
for(i) --> threadIdx.x for(j) --> threadIdx.y for(k) --> threadIdx.z
In certain circumstances, the compiler can automatically determine the optimal loop mapping, but not always. We anticipate explicit mapping from the user until we find a fully performant solution. So this revision adds failiure if there is no thread_dim_mapping.