This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Add experimental nesting mode
ClosedPublic

Authored by tlwilmar on May 10 2021, 12:59 PM.

Details

Summary

Nesting mode is a new experimental feature in the OpenMP runtime. It allows a user to set up nesting for an application in a way that corresponds to the hardware topology levels on the machine an application is being run on. For example, if a machine has 2 sockets, each with 12 cores, then use of nesting mode could set up an outer level of nesting that uses 2 threads per parallel region, and an inner level of nesting that uses 12 threads per parallel region.

Nesting mode is controlled with the KMP_NESTING_MODE environment variable as follows:

  1. KMP_NESTING_MODE = 0: nesting mode is off (default); max-active-levels-var is set to 1 (the default -- nesting is off, nested parallel regions are serialized).
  2. KMP_NESTING_MODE = 1: nesting mode is on, and a number of threads will be assigned for each level discovered in the machine topology; max-active-levels-var is set to the number of levels discovered.

If the user sets OMP_NUM_THREADS or OMP_MAX_ACTIVE_LEVELS, they will override KMP_NESTING_MODE settings for the associated environment variables. The detected topology may be limited by an affinity mask setting on the initial thread, or if the user sets KMP_HW_SUBSET. See also: KMP_HOT_TEAMS_MAX_LEVEL for controlling use of hot teams for nested parallel regions. Note that this feature only sets numbers of threads used at nesting levels. The user should make use of OMP_PLACES and OMP_PROC_BIND or KMP_AFFINITY for affinitizing those threads, if desired.

Diff Detail

Event Timeline

tlwilmar requested review of this revision.May 10 2021, 12:59 PM
tlwilmar created this revision.
jdoerfert retitled this revision from Add experimental nesting mode to [OpenMP] Add experimental nesting mode.May 11 2021, 2:53 PM

There's a bunch of potentially worthwhile topology descriptions we might want. Some systems do big/little cores. Some x64 chips have numa, e.g. the threadripper cores that don't have direct access to dram. There's also the mapping from offload target to cores, where some cores may have a better/direct connection to some GPUs.

N sockets with ~equal chips and lower bandwidth between them seems a good start point. E.g. a recent epyc probably wants to present as 8 top level nodes, each with 8 cores.

ye-luo added a subscriber: ye-luo.May 11 2021, 4:51 PM

Why "12 threads per parallel region" instead of 24? I thought hyper-threads are used by default. Maybe you assume no-SMT in your example? it will be helpful if you make it clear.

Why "12 threads per parallel region" instead of 24? I thought hyper-threads are used by default. Maybe you assume no-SMT in your example? it will be helpful if you make it clear.

Yes, it works that way if SMT is available. So, if there are 2 threads available per core, it would allow for a third nesting level, with size 2.

There's a bunch of potentially worthwhile topology descriptions we might want. Some systems do big/little cores. Some x64 chips have numa, e.g. the threadripper cores that don't have direct access to dram. There's also the mapping from offload target to cores, where some cores may have a better/direct connection to some GPUs.

N sockets with ~equal chips and lower bandwidth between them seems a good start point. E.g. a recent epyc probably wants to present as 8 top level nodes, each with 8 cores.

For now, this is a shortcut to enable a "reasonable" nesting set-up for fairly simple topologies for users who want to use nesting, but in a portable fashion, without really knowing what hardware the code might be running on. Existing features like KMP_HW_SUBSET, use of list for OMP_NUM_THREADS, KMP_AFFINITY, OMP_PLACES, OMP_PROC_BIND, and the APIs for setting the associated ICVs, can all be used to get a suitable nest on specific, more complex, hardware. But I'd welcome improvements to this that use topo info to design a better nest for such hardware.

Why "12 threads per parallel region" instead of 24? I thought hyper-threads are used by default. Maybe you assume no-SMT in your example? it will be helpful if you make it clear.

Yes, it works that way if SMT is available. So, if there are 2 threads available per core, it would allow for a third nesting level, with size 2.

I should also add, since it uses whatever topology hwloc discovers, if it sees NUMA nodes, tiles, etc., it adds levels for those too -- any level in the topology that is distinct and has more than one item gets a nesting level.

tlwilmar updated this revision to Diff 347136.May 21 2021, 3:18 PM

A few updates that were missed in the original patch were added. Condition fix in the init function, and a topology method check to make use of hwloc if available.

Nawrin added a subscriber: Nawrin.Jun 2 2021, 12:25 PM
Nawrin added inline comments.
openmp/runtime/src/kmp_runtime.cpp
8740

Could you please add a note about what KMP_NESTING_MODE > 1 means in the top where you described KMP_NESTING_MODE 0 and 1?

tlwilmar updated this revision to Diff 349864.Jun 4 2021, 7:26 AM

Nawrin -- I added a comment about it. I'm hesitant to document that part of this experimental feature in its current form -- it's needed for experimentation at the moment, but is likely to change in syntax or otherwise. It's for having an app use a limited outer level of parallelism, and then library calls into selected libraries can turn on nesting and avoid oversubscription. So, I've left out that option in the commit message, but added the comment to the code, emphasizing that it is an experimental option on an experimental feature.

Nawrin accepted this revision.Jun 4 2021, 7:32 AM

LGTM

This revision is now accepted and ready to land.Jun 4 2021, 7:32 AM
This revision was landed with ongoing or failed builds.Jun 4 2021, 2:01 PM
This revision was automatically updated to reflect the committed changes.