This draft commit adds new DeviceRTL functions to support reductions in OpenMP that are 50 to over 100 times faster than current openmp reductions. The clang codegen to call these functions is not complete. However, this commit contains an extensive test to test all the functions by simulating the reduction with OpenMP without the reduction clause. The test does the equivalent reduction with OpenMP to show correctness and performance.
EXAMPLE: Given this OpenMP code
```
double sum = 0.0;
#pragma omp target teams distribute parallel for map(tofrom: sum) reduction(+:sum)
for (int64_t i = 0; i < array_size; i++)
sum += a[i] * b[i];
```
Clang should generate code equivalent to this OpenMP target offload without a reduction.
```
#define _NUM_THREADS 1024
devid = 0 ; // or default or whatever is on target construct
TEAM_PROCS = ompx_get_team_procs(devid);
team_vals0 = (double *) omp_target_alloc(sizeof(double) * TEAM_PROCS, devid);
teams_done_ptr0 = (uint32_t *) omp_target_alloc(sizeof(uint32_t),devid);
uint32_t zero = 0;
omp_target_memcpy(teams_done_ptr0, &zero , sizeof(uint32_t), 0,
0, devid, omp_get_initial_device());
#pragma omp target teams distribute parallel for \
num_teams(TEAM_PROCS) num_threads(_NUM_THREADS) \
map(tofrom:sum) is_device_ptr(team_vals0,teams_done_ptr0)
for (unsigned int k=0; k<(LOOP_TEAMS*_NUM_THREADS) ; k++) {
T val0 = 0.0;
LOOP_STRIDE=1; LOOP_START=0; LOOP_SIZE = array_size;
for (int64_t i = ((k * LOOP_STRIDE) + LOOP_START); i < LOOP_SIZE;
i += (TEAM_PROCS * _NUM_THREADS * LOOP_STRIDE))
{
val0 += a[i] * b[i]; // This is the outlined function.
}
// Call the correct xteamr helper function here.
__kmpc_xteamr_d_16x64(val0, &sum, team_vals0, teams_done_ptr0,
__kmpc_rfun_sum_d, __kmpc_rfun_sum_lds_d,0.0);
} // end for loop and target region
```