This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Improve default block count selection fow low block counts
ClosedPublic

Authored by jdoerfert on Jun 2 2023, 10:48 AM.

Details

Summary

If a combined loop has insufficient parallelism (= low trip count), we
might end up with too few teams/blocks. To counter that we can reduce
the number of threads per team we use. This patch implements a heuristic
and exposes a new environment variable to control the minimum of threads
to be employed in this case.

Issue reported by:
Felipe Cabarcas Jaramillo <cabarcas@udel.edu> (@fel-cab).

Diff Detail

Event Timeline

jdoerfert created this revision.Jun 2 2023, 10:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2023, 10:48 AM
jdoerfert requested review of this revision.Jun 2 2023, 10:48 AM
jhuber6 added inline comments.Jun 2 2023, 10:56 AM
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
800

Shouldn't this correspond to the warp / wavefront size? On NVPTX it's 32 but on AMDGPU it could be 32 or 64. You can check using HSA.

jdoerfert added inline comments.Jun 2 2023, 10:58 AM
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
800

Not necessarily. AMD doesn't even have one 64 wide wave anyway, IIRC. We are running some tests on AMD hardware right now, will adjust if 64 comes back better.

This also breaks thread_limit right?

omp target teams thread_limit(16)
omp parallel
jdoerfert added inline comments.Jun 2 2023, 2:46 PM
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
339

@tianshilei1992 Yes, I missed a std::min here, will fix that in the final version.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
800

Results are in for Frontier. 8,16,32 are all "the same" for the code, 64 is worse. 32 is the winner (so far).

jdoerfert updated this revision to Diff 528003.Jun 2 2023, 2:58 PM

Ensure thread_limit is honored.

fel-cab commandeered this revision.Jun 2 2023, 3:07 PM
fel-cab added a reviewer: jdoerfert.
fel-cab added a subscriber: fel-cab.

I have tested it on frontier with SPECacc 552.pep with different values of

LIBOMPTARGET_MIN_THREADS_FOR_LOW_TRIP_COUNT
Env Execution_Time(secs)
4    7
8    5
16   5
32   5
64   10
128  17
256  30
Without the patch 30
jdoerfert commandeered this revision.Jun 2 2023, 3:24 PM
jdoerfert updated this revision to Diff 528019.
jdoerfert edited reviewers, added: fel-cab; removed: jdoerfert.

Force a power of two for the "middle" case, ensure thread_limit is honored.

This revision is now accepted and ready to land.Jun 2 2023, 4:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 4:36 PM