This draft commit adds new DeviceRTL functions to support reductions in OpenMP that are 50 to over 100 times faster than current openmp reductions. The clang codegen to call these functions is not complete. However, this commit contains an extensive test to test all the functions by simulating the reduction with OpenMP without the reduction clause. The test does the equivalent reduction with OpenMP to show correctness and performance.
To run the test, this revision is needed https://reviews.llvm.org/D135162 to execute ompx_get_team_procs(devid)
EXAMPLE: Given this OpenMP reduction code, a classic dot product
double sum = 0.0; #pragma omp target teams distribute parallel for map(tofrom: sum) reduction(+:sum) for (int64_t i = 0; i < array_size; i++) sum += a[i] * b[i];
Clang should generate code equivalent to this OpenMP target offload without a reduction.
#define _NUM_THREADS 1024 devid = 0 ; // or default or whatever is on target construct TEAM_PROCS = ompx_get_team_procs(devid); team_vals0 = (double *) omp_target_alloc(sizeof(double) * TEAM_PROCS, devid); teams_done_ptr0 = (uint32_t *) omp_target_alloc(sizeof(uint32_t),devid); uint32_t zero = 0; omp_target_memcpy(teams_done_ptr0, &zero , sizeof(uint32_t), 0, 0, devid, omp_get_initial_device()); #pragma omp target teams distribute parallel for \ num_teams(TEAM_PROCS) num_threads(_NUM_THREADS) \ map(tofrom:sum) is_device_ptr(team_vals0,teams_done_ptr0) for (unsigned int k=0; k<(LOOP_TEAMS*_NUM_THREADS) ; k++) { T val0 = 0.0; LOOP_STRIDE=1; LOOP_START=0; LOOP_SIZE = array_size; for (int64_t i = ((k * LOOP_STRIDE) + LOOP_START); i < LOOP_SIZE; i += (TEAM_PROCS * _NUM_THREADS * LOOP_STRIDE)) { val0 += a[i] * b[i]; // This is the outlined function. } // Call the correct xteamr helper function here. __kmpc_xteamr_d_16x64(val0, &sum, team_vals0, teams_done_ptr0, __kmpc_rfun_sum_d, __kmpc_rfun_sum_lds_d,0.0); } // end for loop and target region