This patch adds a more sophisticated team reduction scheme to the OpenMP libomptarget-nvptx runtime.
The scheme uses a fixed size global memory buffer whose length can be adjusted via compiler flag:
-fopenmp-cuda-teams-reduction-recs-num=1024
The global buffer is a structure of arrays (with default size of 1024 each and controlled by the above flag), one array for each reduction variable.
Values in the buffer are processed by the last team to finish executing the body of the target region.
In addition to adding support for the new flag, the compiler also emits special functions used for the reduction of the intermediate reduction values. These changes will be added in a separate compiler patch following this one.