This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Optimized default kernel launch parameters in CUDA plugin
ClosedPublic

Authored by grokos on Apr 20 2017, 4:15 PM.

Details

Summary

This patch modifies the default target kernel launch parameters (num_teams and thread_limit). The default thread_limit is set to 128 threads per team. In SPMD mode the kernel is launched with 128 threads, in non-SPMD mode we use 96 threads (+32 of the master warp).

The default number of teams has been optimized as follows. For the constructs below:

#target teams distribute
#teams distribute
#target teams distribute simd
#teams distribute simd

if the associated loop trip count is N, then the kernel is launched with N teams.

Diff Detail

Repository
rL LLVM

Event Timeline

grokos created this revision.Apr 20 2017, 4:15 PM
Hahnfeld edited edge metadata.Apr 20 2017, 11:15 PM

Does this change result in a lower runtime? Last time I tested clang-ykt on Pascal GPUs, 1024 threads were really the best thing to do...

libomptarget/plugins/cuda/src/rtl.cpp
594–598 ↗(On Diff #96044)

Just move this code under if (thread_limit > 0)?

622–624 ↗(On Diff #96044)

So each block executes one iteration? What is left for the threads in each block?

arpith-jacob edited edge metadata.Apr 21 2017, 5:12 PM

Hi Jonas,

The numbers are based on my testing of the Rodinia benchmark on k40m.

We don't have a working compiler on Pascal as yet (many of the omptests fail on Pascal) so I have not benchmarked on that GPU. Our compiler exposes a bug in the CUDA toolkit that is being fixed. It is possible that 1024 threads perform better on Pascal (these are of course heuristics) so we should extend the functionality here once Pascal support is added to the compiler/runtime.

libomptarget/plugins/cuda/src/rtl.cpp
622–624 ↗(On Diff #96044)

Correct.

This case is for the 'teams distribute' construct. The assumption is that there is a nested parallel construct in which the threads within the block participate. Example:

#pragma omp target teams distribute
for(...) {

#pragma omp parallel for reduction(..)
for(..) {}

}

grokos updated this revision to Diff 96567.Apr 25 2017, 8:10 AM
grokos marked an inline comment as done.

Wrote inline comments to make clear what the new default launch configuration is about.

This revision is now accepted and ready to land.Apr 25 2017, 9:41 AM
This revision was automatically updated to reflect the committed changes.