The default schedule type on a worksharing loop is implementation
defined according to the OpenMP specifications. Currently, the
compiler codegens a doubly nested loop that effectively implements
a schedule of type (static). This is ideal for threads on CPUs.
On the NVPTX and other SIMT GPUs, this schedule provides very poor
performance because consecutive threads in a warp access loop arrays
in a non-coalesced manner. That is, to achieve coalescing, and good
performance, the best schedule is static with a chunk size of 1.
This patch adds support for target devices to select the best default
schedule depending on their architecture. It modifies loop codegen
to generate optimized code for (static,1) on the NVPTX device, i.e.,
by using a single loop instead of a doubly nested loop as is
currently the case.
No way, these classes should not be used here