This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP] Fast cross-team reduction (xteamr) helper functions.
Needs ReviewPublic

Authored by gregrodgers on Oct 24 2022, 2:59 PM.

Details

Summary

This review creates new DeviceRTL helper functions to support reductions in
OpenMP that are 50 to over 100 times faster than current openmp reductions.
The clang codegen to call these functions is not complete.
However, this review contains an extensive test for all the functions by
simulating the reduction with OpenMP without the reduction clause.
The test does the equivalent reduction with current OpenMP to show correctness
and performance.

EXAMPLE: Given this OpenMP reduction code, which is a classic dot product
with double precision vectors.

double sum = 0.0;
#pragma omp target teams distribute parallel for map(tofrom: sum) reduction(+:sum)
for (int64_t i = 0; i < array_size; i++)
  sum += a[i] * b[i];

A reduction is defined by two pair reduction functions and the reduction null
value (rnv). In the above example, the reduction is defined by the pair
reduction function summing two double values, a pair reduction function
summing two LDS doubles, and rnv = (double) 0. The pair reduction functions
for built-in sum reduction are kmpc_rfun_sum_d and kmpc_rfun_sum_lds_d.
See Xteamr.cpp file for definitions of all xteamr functions.

Currently, the xteamr helper functions support 8 data types and 6 thread
configurations, 3 thread confis for warpsize 32 and 3 for warpsize 64.

Clang will generate code equivalent to the following simulation of
the reduction in OpenMP with a target offload pragma NOT containing a
reduction clause.

#define _NUM_THREADS 512 // or 1024 or 256, number of waves must be power of 2
#define _NUM_TEAMS 80    // or get this value from ompx_get_device_num_units(devid)
devid = 0 ; // default device or device from target construct
uint32_t zero = 0;
struct loop_ctl_t {
  uint32_t *td_ptr;                 // Atomic counter accessed on device
  uint32_t reserved;                // reserved
  const int64_t stride = 1;         // stride to process input vectors
  const int64_t offset = 0;         // Offset to index of input vectors
  const int64_t size = _ARRAY_SIZE; // Size of input vectors
  const T rnv = T(0);               // reduction null value
  T *team_vals;                     // array of global team values
};
loop_ctl_t lc0; // Create and initialize a loop control structure.
lc0.team_vals = (T *)omp_target_alloc(sizeof(T) * _NUM_TEAMS, devid);
lc0.td_ptr    = (uint32_t *)omp_target_alloc(sizeof(uint32_t), devid);
omp_target_memcpy(lc0.td_ptr, &zero, sizeof(uint32_t), 0, 0, devid,
                    omp_get_initial_device());
#pragma omp target teams distribute parallel for num_teams(_NUM_TEAMS) \
   num_threads(_NUM_THREADS) map(tofrom:sum) map(to:lc0)
for (unsigned int k=0; k<(_NUM_TEAMS*_NUM_THREADS) ; k++) {
   T val0 = lc0.rnv;
   // This is the "BIGJUMP" loop code-gened by clang. A more complex form
   // is actually generated using lc0 when non-zero offset and/or
   // non-one stride loop is required.
   for (int64_t i = 0; i<VSIZE; i += (TEAM_PROCS * _NUM_THREADS ))
      val0 += a[i] * b[i]; // This is the outlined function.
   // Each k iteration calls the helper function whose name is based on,
   // data type, and wave config. Args include the function pointers
   // for the pair reduction function.
   __kmpc_xteamr_d_16x32(val0, &sum, lc0.team_vals, lc0.td_ptr,
       __kmpc_rfun_sum_d, __kmpc_rfun_sum_lds_d, lc0.rnv);
}

In openmp/libomptarget/test/xteamr there is a comprehensive test of the
xteamr helper functions defined in this review.

The test_xteamr.cpp code in this review will generate the following
output on Nvidia volta with 80 teams and 512 threads. This shows
significant performance improvement over current OpenMP reductions.
For example, finding the minmum float value in a vector of floats was
over 140 times faster than the current openmp and summing doubles
was over 50 times faster.

`TEST DOUBLE 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Precision: double
Warp size:32
Array elements: 41943040
Array size:     320 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot          15517   0.043249  0.043364   0.043276       15507
  simdot         847885   0.000791  0.000820   0.000803      835647
  ompmax           7756   0.043260  0.043370   0.043289        7751
  simmax         775913   0.000432  0.000451   0.000438      765812
  ompmin           7755   0.043266  0.043312   0.043285        7752
  simmin         773780   0.000434  0.000447   0.000439      765161

TEST FLOAT 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Precision: float
Warp size:32
Array elements: 41943040
Array size:     160 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot           7816   0.042930  0.043042   0.042965        7810
  simdot         744750   0.000451  0.000468   0.000459      730648
  ompmax           3901   0.043010  0.043147   0.043042        3898
  simmax         557542   0.000301  0.000309   0.000307      546905
  ompmin           3898   0.043041  0.043245   0.043094        3893
  simmin         562826   0.000298  0.000337   0.000307      546866

TEST INT 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Integer Size: 4
Warp size:32
Array elements: 41943040
Array size:     160 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot           7811   0.042957  0.042976   0.042966        7810
  simdot         755215   0.000444  0.000460   0.000451      743335
  ompmax           3900   0.043020  0.043047   0.043032        3899
  simmax         561618   0.000299  0.000309   0.000304      551575
  ompmin           3897   0.043057  0.043101   0.043071        3895
  simmin         552518   0.000304  0.000314   0.000309      543749

TEST UNSIGNED INT 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Integer Size: 4
Warp size:32
Array elements: 41943040
Array size:     160 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot           7815   0.042936  0.042965   0.042949        7813
  simdot         750056   0.000447  0.000460   0.000453      741305
  ompmax           3900   0.043023  0.043048   0.043032        3899
  simmax         559142   0.000300  0.000309   0.000305      550425
  ompmin           3897   0.043049  0.043074   0.043064        3896
  simmin         555710   0.000302  0.000311   0.000307      547178

TEST LONG 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Integer Size: 8
Warp size:32
Array elements: 41943040
Array size:     320 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot          15501   0.043294  0.043331   0.043309       15495
  simdot         811168   0.000827  0.000850   0.000838      800531
  ompmax           7763   0.043222  0.043256   0.043241        7760
  simmax         721225   0.000465  0.000482   0.000472      710672
  ompmin           7764   0.043217  0.043251   0.043234        7761
  simmin         728138   0.000461  0.000471   0.000467      718363

TEST UNSIGNED LONG 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Integer Size: 8
Warp size:32
Array elements: 41943040
Array size:     320 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot          15485   0.043339  0.043378   0.043359       15478
  simdot         817903   0.000820  0.000835   0.000829      809555
  ompmax           8004   0.041921  0.042024   0.041938        8001
  simmax         722079   0.000465  0.000479   0.000473      709582
  ompmin           7754   0.043272  0.043298   0.043285        7752
  simmin         721427   0.000465  0.000483   0.000474      707325

TEST DOUBLE COMPLEX 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Precision: double _Complex
Warp size:32
Array elements: 41943040
Array size:     640 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot          30727   0.043680  0.043714   0.043700       30714
  simdot         861368   0.001558  0.001579   0.001566      857175

TEST FLOAT COMPLEX 512 THREADS
Running kernels 12 times
Ignoring timing of first 2  runs
Precision: float _Complex
Warp size:32
Array elements: 41943040
Array size:     320 MB
Function    Best-MB/sec  Min (sec)     Max      Average     Avg-MB/sec
  ompdot          15453   0.043428  0.043452   0.043440       15449
  simdot         797626   0.000841  0.000861   0.000849      790188`

Diff Detail

Event Timeline

gregrodgers created this revision.Oct 24 2022, 2:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2022, 2:59 PM
gregrodgers requested review of this revision.Oct 24 2022, 2:59 PM
Herald added a project: Restricted Project. · View Herald Transcript
gregrodgers edited the summary of this revision. (Show Details)Oct 24 2022, 3:05 PM
gregrodgers edited the summary of this revision. (Show Details)Oct 24 2022, 3:12 PM
  1. - fix for make check-openmp
  • fix lit test of xteamr/test_xteamr.cpp
  • improve performance and lower register utilization by deriving reduction constants from k and passing numteams from codegen to xteamr function. This changes the interface to the xteamr functions and removes dependencies on DeviceRTL mapping functions
  • removed duplicate variants of shfl that were the same between amdgcn and nvptx and simplified nvptx variant of shfl_xor_int

4th update to original D136631 review. Changes:

  • Remove unnecessary thread sync following LDS reduction.
  • Replaced the bool LDS __is_last_team with LDS copy of the team counter value returned from atomic inc. This removes a few more scalar registers. Performance and resource utilization noticably improved from all updates to this review.
  • Change parameter name inival to "rnv" (Reduction Null Value) for consistency.
  • Improved header docs and comment block for main function _xteamr_reduction.