This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
ClosedPublic

Authored by gtbercea on Sep 24 2018, 1:44 PM.

Details

Summary

For the OpenMP NVPTX toolchain choose a default distribute schedule that ensures coalescing on the GPU when in SPMD mode. This significantly increases the performance of offloaded target code and reduces the number of registers used on the GPU side.

Diff Detail

Repository
rC Clang

Event Timeline

gtbercea created this revision.Sep 24 2018, 1:44 PM
ABataev added inline comments.Sep 25 2018, 10:30 AM
lib/CodeGen/CGStmtOpenMP.cpp
2304

Restore original code here

2315

Why you don't want to have only one function for the default scheduling/chunk? Also, you should change DistSchedule code, not Schedule

3333

Restore original

gtbercea updated this revision to Diff 167172.Sep 26 2018, 12:16 PM
gtbercea edited the summary of this revision. (Show Details)

Only change default schedule for distribute directive.

ABataev added inline comments.Sep 26 2018, 12:29 PM
lib/CodeGen/CGOpenMPRuntime.cpp
9199 ↗(On Diff #167172)

Remove return;, it is not required

lib/CodeGen/CGOpenMPRuntimeNVPTX.h
345

Modify it to be the reference rather than the pointer.

gtbercea updated this revision to Diff 167326.Sep 27 2018, 7:55 AM

Fix type of chunk size.

gtbercea marked 3 inline comments as done.Sep 27 2018, 7:56 AM
This revision is now accepted and ready to land.Sep 27 2018, 7:59 AM

Should we also change the default schedule to static, 1? I know that's not really needed for teams distribute parallel for (because the new default dist_schedule only leaves one iteration per thread), but this doesn't happen for target parallel for. Additionally it would make the intent more explicit and LLVM doesn't need to look through divisions needed to implement static without chunk. Just thinking aloud, not sure if that's worth it.

Should we also change the default schedule to static, 1? I know that's not really needed for teams distribute parallel for (because the new default dist_schedule only leaves one iteration per thread), but this doesn't happen for target parallel for. Additionally it would make the intent more explicit and LLVM doesn't need to look through divisions needed to implement static without chunk. Just thinking aloud, not sure if that's worth it.

That is the intention. I just took out that part from here to not confuse things since I wanted to have that as a separate patch. :)

Hahnfeld accepted this revision.Sep 27 2018, 8:23 AM

That is the intention. I just took out that part from here to not confuse things since I wanted to have that as a separate patch. :)

Ok, perfect. I was probably confused that the title still speaks about "schedules" (plural).

gtbercea retitled this revision from [OpenMP] Make default schedules for NVPTX target regions in SPMD mode achieve coalescing to [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing.Sep 27 2018, 12:23 PM
gtbercea edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.

Just tested this and got very weird results for register usage:

void func(double *a) {
  #pragma omp target teams distribute parallel for map(a[0:100]) // dist_schedule(static)
  for (int i = 0; i < 100; i++) {
    a[i]++;
  }
}

Compiling with current trunk for sm_60 (Pascal): 29 registers
Adding dist_schedule(static) (the previous default): 19 registers
For reference: dist_schedule(static, 128) also uses 29 registers

Any ideas? This significantly slows down STREAM...

Just tested this and got very weird results for register usage:

void func(double *a) {
  #pragma omp target teams distribute parallel for map(a[0:100]) // dist_schedule(static)
  for (int i = 0; i < 100; i++) {
    a[i]++;
  }
}

Compiling with current trunk for sm_60 (Pascal): 29 registers
Adding dist_schedule(static) (the previous default): 19 registers
For reference: dist_schedule(static, 128) also uses 29 registers

Any ideas? This significantly slows down STREAM...

Jonas, without an explicit dist_schedule clause the program will run with schedule(static, <number of threads in block>). It looks like that happens fine since you get the same register count in the explicit static chunk variant as in the default case.

The difference you see in register count is (I suspect) driven by the runtime code (less registers for non-chunked than for chunked). I am currently investigating this and trying to find ways to reduce this number.

One big problem your code has is that the trip count is incredibly small, especially for STREAM and especially on GPUs. You need a much larger loop size otherwise the timings will be dominated by OpenMP setups costs.

One big problem your code has is that the trip count is incredibly small, especially for STREAM and especially on GPUs. You need a much larger loop size otherwise the timings will be dominated by OpenMP setups costs.

Sure, I'm not that dump. The real code has larger loops, this was just for demonstration purposes. I don't expect the register count to change based on loop size - is that too optimistic?

One big problem your code has is that the trip count is incredibly small, especially for STREAM and especially on GPUs. You need a much larger loop size otherwise the timings will be dominated by OpenMP setups costs.

Sure, I'm not that dump. The real code has larger loops, this was just for demonstration purposes. I don't expect the register count to change based on loop size - is that too optimistic?

The register count will of course not change with loop size.

One big problem your code has is that the trip count is incredibly small, especially for STREAM and especially on GPUs. You need a much larger loop size otherwise the timings will be dominated by OpenMP setups costs.

Sure, I'm not that dump. The real code has larger loops, this was just for demonstration purposes. I don't expect the register count to change based on loop size - is that too optimistic?

I checked the different combinations of schedules and the current default is the fastest compared to previous defaults. The old defaults are about 10x slower than the current set of defaults (dist_schedule(static, <num threads>) and schedule(static, 1)). The register allocation looks strange but it's just a consequence of using different schedules.

You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post?

You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post?

No, only dist_schedule(static) which is faster. Tested on a Tesla P100 with today's trunk version:

#pragma omp target teams distribute parallel for (new defaults)190 - 250 GB/s
adding clauses for old defaults: schedule(static) dist_schedule(static)30 - 50 GB/s
same directive with only dist_schedule(static) added (fewer registers)320 - 400 GB/s

You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post?

No, only dist_schedule(static) which is faster. Tested on a Tesla P100 with today's trunk version:

#pragma omp target teams distribute parallel for (new defaults)190 - 250 GB/s
adding clauses for old defaults: schedule(static) dist_schedule(static)30 - 50 GB/s
same directive with only dist_schedule(static) added (fewer registers)320 - 400 GB/s

Which loop size you're using ? What runtime does nvprof report for these kernels?

You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post?

No, only dist_schedule(static) which is faster. Tested on a Tesla P100 with today's trunk version:

#pragma omp target teams distribute parallel for (new defaults)190 - 250 GB/s
adding clauses for old defaults: schedule(static) dist_schedule(static)30 - 50 GB/s
same directive with only dist_schedule(static) added (fewer registers)320 - 400 GB/s

Which loop size you're using ? What runtime does nvprof report for these kernels?

Sorry, forgot to mention: I'm using the original STREAM code with 80,000,000 double elements in each vector.

Output from nvprof:

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   70.05%  676.71ms         9  75.191ms  1.3760us  248.09ms  [CUDA memcpy DtoH]
                   7.67%  74.102ms        10  7.4102ms  7.3948ms  7.4220ms  __omp_offloading_34_b871a7d5_main_l307
                   7.63%  73.679ms        10  7.3679ms  7.3457ms  7.3811ms  __omp_offloading_34_b871a7d5_main_l301
                   6.78%  65.516ms        10  6.5516ms  6.5382ms  6.5763ms  __omp_offloading_34_b871a7d5_main_l295
                   6.77%  65.399ms        10  6.5399ms  6.5319ms  6.5495ms  __omp_offloading_34_b871a7d5_main_l289
                   0.68%  6.6106ms         1  6.6106ms  6.6106ms  6.6106ms  __omp_offloading_34_b871a7d5_main_l264
                   0.41%  3.9659ms         1  3.9659ms  3.9659ms  3.9659ms  __omp_offloading_34_b871a7d5_main_l245
                   0.00%  1.1200us         1  1.1200us  1.1200us  1.1200us  [CUDA memcpy HtoD]
     API calls:   51.12%  678.90ms         9  75.434ms  24.859us  248.70ms  cuMemcpyDtoH
                  22.40%  297.51ms        42  7.0835ms  4.0042ms  7.6802ms  cuCtxSynchronize
                  20.31%  269.72ms         1  269.72ms  269.72ms  269.72ms  cuCtxCreate
                   5.32%  70.631ms         1  70.631ms  70.631ms  70.631ms  cuCtxDestroy
                   0.46%  6.1607ms         1  6.1607ms  6.1607ms  6.1607ms  cuModuleLoadDataEx
                   0.28%  3.7628ms         1  3.7628ms  3.7628ms  3.7628ms  cuModuleUnload
                   0.10%  1.2977ms        42  30.898us  13.930us  60.092us  cuLaunchKernel
                   0.00%  56.142us        42  1.3360us     677ns  2.0930us  cuFuncGetAttribute
                   0.00%  43.957us        46     955ns     454ns  1.7670us  cuCtxSetCurrent
                   0.00%  15.179us         1  15.179us  15.179us  15.179us  cuMemcpyHtoD
                   0.00%  7.2780us        10     727ns     358ns  1.4760us  cuModuleGetGlobal
                   0.00%  6.9910us         2  3.4950us  2.2660us  4.7250us  cuDeviceGetPCIBusId
                   0.00%  5.7500us         6     958ns     333ns  3.5270us  cuModuleGetFunction
                   0.00%  3.7530us         9     417ns     184ns  1.0850us  cuDeviceGetAttribute
                   0.00%  2.6790us         3     893ns     370ns  1.9300us  cuDeviceGetCount
                   0.00%  2.0090us         3     669ns     484ns     767ns  cuDeviceGet

The memcpy comes from a target update to verify the results on the host. It's not included in the measurement itself, so STREAM only evaluates the kernel execution time:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          190819.6     0.006781     0.006708     0.006841
Scale:         189065.7     0.006800     0.006770     0.006831
Add:           253831.7     0.007616     0.007564     0.007646
Triad:         253432.3     0.007668     0.007576     0.007737