Page MenuHomePhabricator

[OPENMP] Xteamr Helper functions for high performance reductions
AbandonedPublicDraft

Authored by gregrodgers on Oct 5 2022, 11:46 AM.

Details

Reviewers
jdoerfert
Summary

This draft commit adds new DeviceRTL functions to support reductions in OpenMP that are 50 to over 100 times faster than current openmp reductions. The clang codegen to call these functions is not complete. However, this commit contains an extensive test to test all the functions by simulating the reduction with OpenMP without the reduction clause. The test does the equivalent reduction with OpenMP to show correctness and performance.

To run the test, this revision is needed https://reviews.llvm.org/D135162 to execute ompx_get_team_procs(devid)

EXAMPLE: Given this OpenMP reduction code, a classic dot product

double sum = 0.0;
 #pragma omp target teams distribute parallel for map(tofrom: sum) reduction(+:sum)
 for (int64_t i = 0; i < array_size; i++)
   sum += a[i] * b[i];

Clang should generate code equivalent to this OpenMP target offload without a reduction.

  #define _NUM_THREADS 1024
  devid = 0 ; // or default or whatever is on target construct
  TEAM_PROCS = ompx_get_team_procs(devid);
  team_vals0  = (double *) omp_target_alloc(sizeof(double) * TEAM_PROCS, devid);
  teams_done_ptr0 = (uint32_t *) omp_target_alloc(sizeof(uint32_t),devid);
  uint32_t zero = 0;
  omp_target_memcpy(teams_done_ptr0, &zero , sizeof(uint32_t), 0,
                  0, devid, omp_get_initial_device());
  #pragma omp target teams distribute parallel for \
     num_teams(TEAM_PROCS) num_threads(_NUM_THREADS) \
     map(tofrom:sum) is_device_ptr(team_vals0,teams_done_ptr0)
  for (unsigned int k=0; k<(LOOP_TEAMS*_NUM_THREADS) ; k++) {
     T val0 = 0.0;
     LOOP_STRIDE=1; LOOP_START=0; LOOP_SIZE = array_size;
     for (int64_t i = ((k * LOOP_STRIDE) + LOOP_START); i < LOOP_SIZE;
           i += (TEAM_PROCS * _NUM_THREADS * LOOP_STRIDE))
     {
         val0 += a[i] * b[i];   // This is the outlined function.
     }
     //   Call the correct xteamr helper function here.
     __kmpc_xteamr_d_16x64(val0, &sum, team_vals0, teams_done_ptr0,
         __kmpc_rfun_sum_d, __kmpc_rfun_sum_lds_d,0.0);
}  // end for loop and target region

Diff Detail

Event Timeline

gregrodgers created this revision.Oct 5 2022, 11:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 5 2022, 11:46 AM
gregrodgers edited the summary of this revision. (Show Details)Oct 5 2022, 11:56 AM
gregrodgers added a reviewer: jdoerfert.
gregrodgers edited the summary of this revision. (Show Details)Oct 5 2022, 12:02 PM
gregrodgers abandoned this revision.Oct 24 2022, 2:37 PM

I will submit another more detailed revision for review.

Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2022, 2:37 PM