This review creates new DeviceRTL helper functions to support reductions in
OpenMP that are 50 to over 100 times faster than current openmp reductions.
The clang codegen to call these functions is not complete.
However, this review contains an extensive test for all the functions by
simulating the reduction with OpenMP without the reduction clause.
The test does the equivalent reduction with current OpenMP to show correctness
and performance.
EXAMPLE: Given this OpenMP reduction code, which is a classic dot product
with double precision vectors.
double sum = 0.0; #pragma omp target teams distribute parallel for map(tofrom: sum) reduction(+:sum) for (int64_t i = 0; i < array_size; i++) sum += a[i] * b[i];
A reduction is defined by two pair reduction functions and the reduction null
value (rnv). In the above example, the reduction is defined by the pair
reduction function summing two double values, a pair reduction function
summing two LDS doubles, and rnv = (double) 0. The pair reduction functions
for built-in sum reduction are kmpc_rfun_sum_d and kmpc_rfun_sum_lds_d.
See Xteamr.cpp file for definitions of all xteamr functions.
Currently, the xteamr helper functions support 8 data types and 6 thread
configurations, 3 thread confis for warpsize 32 and 3 for warpsize 64.
Clang will generate code equivalent to the following simulation of
the reduction in OpenMP with a target offload pragma NOT containing a
reduction clause.
#define _NUM_THREADS 512 // or 1024 or 256, number of waves must be power of 2 #define _NUM_TEAMS 80 // or get this value from ompx_get_device_num_units(devid) devid = 0 ; // default device or device from target construct uint32_t zero = 0; struct loop_ctl_t { uint32_t *td_ptr; // Atomic counter accessed on device uint32_t reserved; // reserved const int64_t stride = 1; // stride to process input vectors const int64_t offset = 0; // Offset to index of input vectors const int64_t size = _ARRAY_SIZE; // Size of input vectors const T rnv = T(0); // reduction null value T *team_vals; // array of global team values }; loop_ctl_t lc0; // Create and initialize a loop control structure. lc0.team_vals = (T *)omp_target_alloc(sizeof(T) * _NUM_TEAMS, devid); lc0.td_ptr = (uint32_t *)omp_target_alloc(sizeof(uint32_t), devid); omp_target_memcpy(lc0.td_ptr, &zero, sizeof(uint32_t), 0, 0, devid, omp_get_initial_device()); #pragma omp target teams distribute parallel for num_teams(_NUM_TEAMS) \ num_threads(_NUM_THREADS) map(tofrom:sum) map(to:lc0) for (unsigned int k=0; k<(_NUM_TEAMS*_NUM_THREADS) ; k++) { T val0 = lc0.rnv; // This is the "BIGJUMP" loop code-gened by clang. A more complex form // is actually generated using lc0 when non-zero offset and/or // non-one stride loop is required. for (int64_t i = 0; i<VSIZE; i += (TEAM_PROCS * _NUM_THREADS )) val0 += a[i] * b[i]; // This is the outlined function. // Each k iteration calls the helper function whose name is based on, // data type, and wave config. Args include the function pointers // for the pair reduction function. __kmpc_xteamr_d_16x32(val0, &sum, lc0.team_vals, lc0.td_ptr, __kmpc_rfun_sum_d, __kmpc_rfun_sum_lds_d, lc0.rnv); }
In openmp/libomptarget/test/xteamr there is a comprehensive test of the
xteamr helper functions defined in this review.
The test_xteamr.cpp code in this review will generate the following
output on Nvidia volta with 80 teams and 512 threads. This shows
significant performance improvement over current OpenMP reductions.
For example, finding the minmum float value in a vector of floats was
over 140 times faster than the current openmp and summing doubles
was over 50 times faster.
`TEST DOUBLE 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Precision: double Warp size:32 Array elements: 41943040 Array size: 320 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 15517 0.043249 0.043364 0.043276 15507 simdot 847885 0.000791 0.000820 0.000803 835647 ompmax 7756 0.043260 0.043370 0.043289 7751 simmax 775913 0.000432 0.000451 0.000438 765812 ompmin 7755 0.043266 0.043312 0.043285 7752 simmin 773780 0.000434 0.000447 0.000439 765161 TEST FLOAT 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Precision: float Warp size:32 Array elements: 41943040 Array size: 160 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 7816 0.042930 0.043042 0.042965 7810 simdot 744750 0.000451 0.000468 0.000459 730648 ompmax 3901 0.043010 0.043147 0.043042 3898 simmax 557542 0.000301 0.000309 0.000307 546905 ompmin 3898 0.043041 0.043245 0.043094 3893 simmin 562826 0.000298 0.000337 0.000307 546866 TEST INT 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Integer Size: 4 Warp size:32 Array elements: 41943040 Array size: 160 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 7811 0.042957 0.042976 0.042966 7810 simdot 755215 0.000444 0.000460 0.000451 743335 ompmax 3900 0.043020 0.043047 0.043032 3899 simmax 561618 0.000299 0.000309 0.000304 551575 ompmin 3897 0.043057 0.043101 0.043071 3895 simmin 552518 0.000304 0.000314 0.000309 543749 TEST UNSIGNED INT 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Integer Size: 4 Warp size:32 Array elements: 41943040 Array size: 160 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 7815 0.042936 0.042965 0.042949 7813 simdot 750056 0.000447 0.000460 0.000453 741305 ompmax 3900 0.043023 0.043048 0.043032 3899 simmax 559142 0.000300 0.000309 0.000305 550425 ompmin 3897 0.043049 0.043074 0.043064 3896 simmin 555710 0.000302 0.000311 0.000307 547178 TEST LONG 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Integer Size: 8 Warp size:32 Array elements: 41943040 Array size: 320 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 15501 0.043294 0.043331 0.043309 15495 simdot 811168 0.000827 0.000850 0.000838 800531 ompmax 7763 0.043222 0.043256 0.043241 7760 simmax 721225 0.000465 0.000482 0.000472 710672 ompmin 7764 0.043217 0.043251 0.043234 7761 simmin 728138 0.000461 0.000471 0.000467 718363 TEST UNSIGNED LONG 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Integer Size: 8 Warp size:32 Array elements: 41943040 Array size: 320 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 15485 0.043339 0.043378 0.043359 15478 simdot 817903 0.000820 0.000835 0.000829 809555 ompmax 8004 0.041921 0.042024 0.041938 8001 simmax 722079 0.000465 0.000479 0.000473 709582 ompmin 7754 0.043272 0.043298 0.043285 7752 simmin 721427 0.000465 0.000483 0.000474 707325 TEST DOUBLE COMPLEX 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Precision: double _Complex Warp size:32 Array elements: 41943040 Array size: 640 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 30727 0.043680 0.043714 0.043700 30714 simdot 861368 0.001558 0.001579 0.001566 857175 TEST FLOAT COMPLEX 512 THREADS Running kernels 12 times Ignoring timing of first 2 runs Precision: float _Complex Warp size:32 Array elements: 41943040 Array size: 320 MB Function Best-MB/sec Min (sec) Max Average Avg-MB/sec ompdot 15453 0.043428 0.043452 0.043440 15449 simdot 797626 0.000841 0.000861 0.000849 790188`