For the OpenMP NVPTX toolchain choose a default distribute schedule that ensures coalescing on the GPU when in SPMD mode. This significantly increases the performance of offloaded target code and reduces the number of registers used on the GPU side.
Details
- Reviewers
ABataev caomhin Hahnfeld - Commits
- rG02650d4c2cd0: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode…
rL343253: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode…
rC343253: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode…
Diff Detail
- Repository
- rC Clang
Event Timeline
Should we also change the default schedule to static, 1? I know that's not really needed for teams distribute parallel for (because the new default dist_schedule only leaves one iteration per thread), but this doesn't happen for target parallel for. Additionally it would make the intent more explicit and LLVM doesn't need to look through divisions needed to implement static without chunk. Just thinking aloud, not sure if that's worth it.
That is the intention. I just took out that part from here to not confuse things since I wanted to have that as a separate patch. :)
Ok, perfect. I was probably confused that the title still speaks about "schedules" (plural).
Just tested this and got very weird results for register usage:
void func(double *a) { #pragma omp target teams distribute parallel for map(a[0:100]) // dist_schedule(static) for (int i = 0; i < 100; i++) { a[i]++; } }
Compiling with current trunk for sm_60 (Pascal): 29 registers
Adding dist_schedule(static) (the previous default): 19 registers
For reference: dist_schedule(static, 128) also uses 29 registers
Any ideas? This significantly slows down STREAM...
Jonas, without an explicit dist_schedule clause the program will run with schedule(static, <number of threads in block>). It looks like that happens fine since you get the same register count in the explicit static chunk variant as in the default case.
The difference you see in register count is (I suspect) driven by the runtime code (less registers for non-chunked than for chunked). I am currently investigating this and trying to find ways to reduce this number.
One big problem your code has is that the trip count is incredibly small, especially for STREAM and especially on GPUs. You need a much larger loop size otherwise the timings will be dominated by OpenMP setups costs.
Sure, I'm not that dump. The real code has larger loops, this was just for demonstration purposes. I don't expect the register count to change based on loop size - is that too optimistic?
I checked the different combinations of schedules and the current default is the fastest compared to previous defaults. The old defaults are about 10x slower than the current set of defaults (dist_schedule(static, <num threads>) and schedule(static, 1)). The register allocation looks strange but it's just a consequence of using different schedules.
You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post?
No, only dist_schedule(static) which is faster. Tested on a Tesla P100 with today's trunk version:
#pragma omp target teams distribute parallel for (new defaults) | 190 - 250 GB/s |
adding clauses for old defaults: schedule(static) dist_schedule(static) | 30 - 50 GB/s |
same directive with only dist_schedule(static) added (fewer registers) | 320 - 400 GB/s |
Sorry, forgot to mention: I'm using the original STREAM code with 80,000,000 double elements in each vector.
Output from nvprof:
Type Time(%) Time Calls Avg Min Max Name GPU activities: 70.05% 676.71ms 9 75.191ms 1.3760us 248.09ms [CUDA memcpy DtoH] 7.67% 74.102ms 10 7.4102ms 7.3948ms 7.4220ms __omp_offloading_34_b871a7d5_main_l307 7.63% 73.679ms 10 7.3679ms 7.3457ms 7.3811ms __omp_offloading_34_b871a7d5_main_l301 6.78% 65.516ms 10 6.5516ms 6.5382ms 6.5763ms __omp_offloading_34_b871a7d5_main_l295 6.77% 65.399ms 10 6.5399ms 6.5319ms 6.5495ms __omp_offloading_34_b871a7d5_main_l289 0.68% 6.6106ms 1 6.6106ms 6.6106ms 6.6106ms __omp_offloading_34_b871a7d5_main_l264 0.41% 3.9659ms 1 3.9659ms 3.9659ms 3.9659ms __omp_offloading_34_b871a7d5_main_l245 0.00% 1.1200us 1 1.1200us 1.1200us 1.1200us [CUDA memcpy HtoD] API calls: 51.12% 678.90ms 9 75.434ms 24.859us 248.70ms cuMemcpyDtoH 22.40% 297.51ms 42 7.0835ms 4.0042ms 7.6802ms cuCtxSynchronize 20.31% 269.72ms 1 269.72ms 269.72ms 269.72ms cuCtxCreate 5.32% 70.631ms 1 70.631ms 70.631ms 70.631ms cuCtxDestroy 0.46% 6.1607ms 1 6.1607ms 6.1607ms 6.1607ms cuModuleLoadDataEx 0.28% 3.7628ms 1 3.7628ms 3.7628ms 3.7628ms cuModuleUnload 0.10% 1.2977ms 42 30.898us 13.930us 60.092us cuLaunchKernel 0.00% 56.142us 42 1.3360us 677ns 2.0930us cuFuncGetAttribute 0.00% 43.957us 46 955ns 454ns 1.7670us cuCtxSetCurrent 0.00% 15.179us 1 15.179us 15.179us 15.179us cuMemcpyHtoD 0.00% 7.2780us 10 727ns 358ns 1.4760us cuModuleGetGlobal 0.00% 6.9910us 2 3.4950us 2.2660us 4.7250us cuDeviceGetPCIBusId 0.00% 5.7500us 6 958ns 333ns 3.5270us cuModuleGetFunction 0.00% 3.7530us 9 417ns 184ns 1.0850us cuDeviceGetAttribute 0.00% 2.6790us 3 893ns 370ns 1.9300us cuDeviceGetCount 0.00% 2.0090us 3 669ns 484ns 767ns cuDeviceGet
The memcpy comes from a target update to verify the results on the host. It's not included in the measurement itself, so STREAM only evaluates the kernel execution time:
Function Best Rate MB/s Avg time Min time Max time Copy: 190819.6 0.006781 0.006708 0.006841 Scale: 189065.7 0.006800 0.006770 0.006831 Add: 253831.7 0.007616 0.007564 0.007646 Triad: 253432.3 0.007668 0.007576 0.007737
Modify it to be the reference rather than the pointer.