diff --git a/openmp/docs/design/Runtimes.rst b/openmp/docs/design/Runtimes.rst --- a/openmp/docs/design/Runtimes.rst +++ b/openmp/docs/design/Runtimes.rst @@ -12,6 +12,633 @@ the LLVM/OpenMP host runtime, aka. `libomp.so`, is available as a `pdf `_. +.. _libomp_environment_vars: + +Environment Variables +^^^^^^^^^^^^^^^^^^^^^ + +OMP_CANCELLATION +"""""""""""""""" + +Enables cancellation of the innermost enclosing region of the type specified. +If set to ``true``, the effects of the cancel construct and of cancellation +points are enabled and cancellation is activated. If set to ``false``, +cancellation is disabled and the cancel construct and cancellation points are +effectively ignored. + +.. note:: + Internal barrier code will work differently depending on whether cancellation + is enabled. Barrier code should repeatedly check the global flag to figure + out if cancellation has been triggered. If a thread observes cancellation, it + should leave the barrier prematurely with the return value 1 (and may wake up + other threads). Otherwise, it should leave the barrier with the return value 0. + +Enables (``true``) or disables (``false``) cancellation of the innermost +enclosing region of the type specified. + +**Default:** ``false`` + + +OMP_DISPLAY_ENV +""""""""""""""" + +Enables (``true``) or disables (``false``) the printing to ``stderr`` of +the OpenMP version number and the values associated with the OpenMP +environment variables. + +Possible values are: ``true``, ``false``, or ``verbose``. + +**Default:** ``false`` + +OMP_DEFAULT_DEVICE +"""""""""""""""""" + +Sets the device that will be used in a target region. The OpenMP routine +``omp_set_default_device`` or a device clause in a parallel pragma can override +this variable. If no device with the specified device number exists, the code is +executed on the host. If this environment variable is not set, device number 0 +is used. + +OMP_DYNAMIC +""""""""""" + +Enables (``true``) or disables (``false``) the dynamic adjustment of the +number of threads. + +| **Default:** ``false`` + +OMP_MAX_ACTIVE_LEVELS +""""""""""""""""""""" + +The maximum number of levels of parallel nesting for the program. + +| **Default:** ``1`` + +OMP_NESTED +"""""""""" + +.. warning:: + Deprecated. Please use ``OMP_MAX_ACTIVE_LEVELS`` to control nested parallelism + +Enables (``true``) or disables (``false``) nested parallelism. + +| **Default:** ``false`` + +OMP_NUM_THREADS +""""""""""""""" + +Sets the maximum number of threads to use for OpenMP parallel regions if no +other value is specified in the application. + +The value can be a single integer, in which case it specifies the number of threads +for all parallel regions. The value can also be a comma-separated list of integers, +in which case each integer specifies the number of threads for a parallel +region at that particular nesting level. + +The first position in the list represents the outer-most parallel nesting level, +the second position represents the next-inner parallel nesting level, and so on. +At any level, the integer can be left out of the list. If the first integer in a +list is left out, it implies the normal default value for threads is used at the +outer-most level. If the integer is left out of any other level, the number of +threads for that level is inherited from the previous level. + +| **Default:** The number of processors visible to the operating system on which the program is executed. +| **Syntax:** ``OMP_NUM_THREADS=value[,value]*`` +| **Example:** ``OMP_NUM_THREADS=4,3`` + +OMP_PLACES +"""""""""" + +Specifies an explicit ordered list of places, either as an abstract name +describing a set of places or as an explicit list of places described by +non-negative numbers. An exclusion operator, ``!``, can also be used to exclude +the number or place immediately following the operator. + +For **explicit lists**, an ordered list of places is specified with each place +represented as a set of non-negative numbers. The non-negative numbers represent +operating system logical processor numbers and can be thought of as an OS affinity mask. + +Individual places can be specified through two methods. +Both the **examples** below represent the same place. + +* An explicit list of comma-separated non-negatives numbers **Example:** ``{0,2,4,6}`` +* An interval with notation ``:[:]``. **Example:** ``{0:4:2}``. When ```` is omitted, a unit stride is assumed. + The interval notation represents this set of numbers: + +:: + + , + , ..., + ( - 1) * + + +A place list can also be specified using the same interval +notation: ``{place}:[:]``. +This represents the list of length ```` places determined by the following: + +.. code-block:: c + + {place}, {place} + , ..., {place} + (-1)* + Where given {place} and integer N, {place} + N = {place with every number offset by N} + Example: {0,3,6}:4:1 represents {0,3,6}, {1,4,7}, {2,5,8}, {3,6,9} + +**Examples of explicit lists:** +These all represent the same set of places + +:: + + OMP_PLACES="{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}" + OMP_PLACES="{0:4},{4:4},{8:4},{12:4}" + OMP_PLACES="{0:4}:4:4" + +.. note:: + When specifying a place using a set of numbers, if any number cannot be + mapped to a processor on the target platform, then that number is + ignored within the place, but the rest of the place is kept intact. + If all numbers within a place are invalid, then the entire place is removed + from the place list, but the rest of place list is kept intact. + +The **abstract names** listed below are understood by the run-time environment: + +* ``threads:`` Each place corresponds to a single hardware thread. +* ``cores:`` Each place corresponds to a single core (having one or more hardware threads). +* ``sockets:`` Each place corresponds to a single socket (consisting of one or more cores). +* ``numa_domains:`` Each place corresponds to a single NUMA domain (consisting of one or more cores). +* ``ll_caches:`` Each place corresponds to a last-level cache (consisting of one or more cores). + +The abstract name may be appended by a positive number in parentheses to +denote the length of the place list to be created, that is ``abstract_name(num-places)``. +If the optional number isn't specified, then the runtime will use all available +resources of type ``abstract_name``. When requesting fewer places than available +on the system, the first available resources as determined by ``abstract_name`` +are used. When requesting more places than available on the system, only the +available resources are used. + +**Examples of abstract names:** +:: + + OMP_PLACES=threads + OMP_PLACES=threads(4) + +OMP_PROC_BIND (Windows, Linux) +"""""""""""""""""""""""""""""" +Sets the thread affinity policy to be used for parallel regions at the +corresponding nested level. Enables (``true``) or disables (``false``) +the binding of threads to processor contexts. If enabled, this is the +same as specifying ``KMP_AFFINITY=scatter``. If disabled, this is the +same as specifying ``KMP_AFFINITY=none``. + +**Acceptable values:** ``true``, ``false``, or a comma separated list, each +element of which is one of the following values: ``master``, ``close``, ``spread``, or ``primary``. + +**Default:** ``false`` + +.. warning:: + ``master`` is deprecated. The semantics of ``master`` are the same as ``primary``. + +If set to ``false``, the execution environment may move OpenMP threads between +OpenMP places, thread affinity is disabled, and ``proc_bind`` clauses on +parallel constructs are ignored. Otherwise, the execution environment should +not move OpenMP threads between OpenMP places, thread affinity is enabled, and +the initial thread is bound to the first place in the OpenMP place list. + +If set to ``primary``, all threads are bound to the same place as the primary +thread. + +If set to ``close``, threads are bound to successive places, near where the +primary thread is bound. + +If set to ``spread``, the primary thread's partition is subdivided and threads +are bound to single place successive sub-partitions. + +| **Related environment variables:** ``KMP_AFFINITY`` (overrides ``OMP_PROC_BIND``). + +OMP_SCHEDULE +"""""""""""" +Sets the run-time schedule type and an optional chunk size. + +| **Default:** ``static``, no chunk size specified +| **Syntax:** ``OMP_SCHEDULE="kind[,chunk_size]"`` + +OMP_STACKSIZE +""""""""""""" + +Sets the number of bytes to allocate for each OpenMP thread to use as the +private stack for the thread. Recommended size is 16M. + +Use the optional suffixes to specify byte units: ``B`` (bytes), ``K`` (Kilobytes), +``M`` (Megabytes), ``G`` (Gigabytes), or ``T`` (Terabytes) to specify the units. +If you specify a value without a suffix, the byte unit +is assumed to be ``K`` (Kilobytes). + +This variable does not affect the native operating system threads created by the +user program, or the thread executing the sequential part of an OpenMP program. + +The ``kmp_{set,get}_stacksize_s()`` routines set/retrieve the value. +The ``kmp_set_stacksize_s()`` routine must be called from sequential part, before +first parallel region is created. Otherwise, calling ``kmp_set_stacksize_s()`` +has no effect. + +| **Default:** + +* 32-bit architecture: ``2M`` +* 64-bit architecture: ``4M`` + +| **Related environment variables:** ``KMP_STACKSIZE`` (overrides ``OMP_STACKSIZE``). +| **Example:** ``OMP_STACKSIZE=8M`` + +OMP_THREAD_LIMIT +"""""""""""""""" + +Limits the number of simultaneously-executing threads in an OpenMP program. + +If this limit is reached and another native operating system thread encounters +OpenMP API calls or constructs, the program can abort with an error message. +If this limit is reached when an OpenMP parallel region begins, a one-time +warning message might be generated indicating that the number of threads in +the team was reduced, but the program will continue. + +The ``omp_get_thread_limit()`` routine returns the value of the limit. + +| **Default:** No enforced limit +| **Related environment variable:** ``KMP_ALL_THREADS`` (overrides ``OMP_THREAD_LIMIT``). + +OMP_WAIT_POLICY +""""""""""""""" + +Decides whether threads spin (active) or yield (passive) while they are waiting. +``OMP_WAIT_POLICY=active`` is an alias for ``KMP_LIBRARY=turnaround``, and +``OMP_WAIT_POLICY=passive`` is an alias for ``KMP_LIBRARY=throughput``. + +| **Default:** ``passive`` + +.. note:: + Although the default is ``passive``, unless the user has explicitly set + ``OMP_WAIT_POLICY``, there is a small period of active spinning determined + by ``KMP_BLOCKTIME``. + +KMP_AFFINITY (Windows, Linux) +""""""""""""""""""""""""""""" + +Enables run-time library to bind threads to physical processing units. + +You must set this environment variable before the first parallel region, or +certain API calls including ``omp_get_max_threads()``, ``omp_get_num_procs()`` +and any affinity API calls. + +**Syntax:** ``KMP_AFFINITY=[,...][,][,]`` + +``modifiers`` are optional strings consisting of a keyword and possibly a specifier + +* ``respect`` (default) and ``norespect`` - determine whether to respect the original process affinity mask. +* ``verbose`` and ``noverbose`` (default) - determine whether to display affinity information. +* ``warnings`` (default) and ``nowarnings`` - determine whether to display warnings during affinity detection. +* ``granularity=`` - takes the following specifiers ``thread``, ``core`` (default), ``tile``, + ``socket``, ``die``, ``group`` (Windows only). + The granularity describes the lowest topology levels that OpenMP threads are allowed to float within a topology map. + For example, if ``granularity=core``, then the OpenMP threads will be allowed to move between logical processors within + a single core. If ``granularity=thread``, then the OpenMP threads will be restricted to a single logical processor. +* ``proclist=[]`` - The ``proc_list`` is specified by + ++--------------------+----------------------------------------+ +| Value | Description | ++====================+========================================+ +| := | | { } | ++--------------------+----------------------------------------+ +| := | | , | ++--------------------+----------------------------------------+ + +Where each ``proc_id`` represents an operating system logical processor ID. +For example, ``proclist=[3,0,{1,2},{0,3}]`` with ``OMP_NUM_THREADS=4`` would place thread 0 on +OS logical processor 3, thread 1 on OS logical processor 0, thread 2 on both OS logical +processors 1 & 2, and thread 3 on OS logical processors 0 & 3. + +``type`` is the thread affinity policy to choose. +Valid choices are ``none``, ``balanced``, ``compact``, ``scatter``, ``explicit``, ``disabled`` + +* type ``none`` (default) - Does not bind OpenMP threads to particular thread contexts; + however, if the operating system supports affinity, the compiler still uses the + OpenMP thread affinity interface to determine machine topology. + Specify ``KMP_AFFINITY=verbose,none`` to list a machine topology map. +* type ``compact`` - Specifying compact assigns the OpenMP thread +1 to a free thread + context as close as possible to the thread context where the OpenMP thread was + placed. For example, in a topology map, the nearer a node is to the root, the more + significance the node has when sorting the threads. +* type ``scatter`` - Specifying scatter distributes the threads as evenly as + possible across the entire system. ``scatter`` is the opposite of ``compact``; so the + leaves of the node are most significant when sorting through the machine topology map. +* type ``balanced`` - Places threads on separate cores until all cores have at least one thread, + similar to the ``scatter`` type. However, when the runtime must use multiple hardware thread + contexts on the same core, the balanced type ensures that the OpenMP thread numbers are close + to each other, which scatter does not do. This affinity type is supported on the CPU only for + single socket systems. +* type ``explicit`` - Specifying explicit assigns OpenMP threads to a list of OS proc IDs that + have been explicitly specified by using the ``proclist`` modifier, which is required + for this affinity type. +* type ``disabled`` - Specifying disabled completely disables the thread affinity interfaces. + This forces the OpenMP run-time library to behave as if the affinity interface was not + supported by the operating system. This includes the low-level API interfaces such + as ``kmp_set_affinity`` and ``kmp_get_affinity``, which have no effect and will return + a nonzero error code. + +For both ``compact`` and ``scatter``, ``permute`` and ``offset`` are allowed; +however, if you specify only one integer, the runtime interprets the value as +a permute specifier. **Both permute and offset default to 0.** + +The ``permute`` specifier controls which levels are most significant when sorting +the machine topology map. A value for ``permute`` forces the mappings to make the +specified number of most significant levels of the sort the least significant, +and it inverts the order of significance. The root node of the tree is not +considered a separate level for the sort operations. + +The ``offset`` specifier indicates the starting position for thread assignment. + +| **Default:** ``noverbose,warnings,respect,granularity=core,none`` +| **Related environment variable:** ``OMP_PROC_BIND`` (``KMP_AFFINITY`` takes precedence) + +.. note:: + On Windows with multiple processor groups, the norespect affinity modifier + is assumed when the process affinity mask equals a single processor group + (which is default on Windows). Otherwise, the respect affinity modifier is used. + +.. note:: + On Windows with multiple processor groups, if the granularity is too coarse, it + will be set to ``granularity=group``. For example, if two processor groups exist + across one socket, and ``granularity=socket`` the runtime will shift the + granularity down to group since that is the largest granularity allowed by the OS. + +KMP_ALL_THREADS +""""""""""""""" + +Limits the number of simultaneously-executing threads in an OpenMP program. +If this limit is reached and another native operating system thread encounters +OpenMP API calls or constructs, then the program may abort with an error +message. If this limit is reached at the time an OpenMP parallel region begins, +a one-time warning message may be generated indicating that the number of +threads in the team was reduced, but the program will continue execution. + +| **Default:** No enforced limit. +| **Related environment variable:** ``OMP_THREAD_LIMIT`` (``KMP_ALL_THREADS`` takes precedence) + +KMP_BLOCKTIME +""""""""""""" + +Sets the time, in milliseconds, that a thread should wait, after completing +the execution of a parallel region, before sleeping. + +Use the optional character suffixes: ``s`` (seconds), ``m`` (minutes), +``h`` (hours), or ``d`` (days) to specify the units. + +Specify infinite for an unlimited wait time. + +| **Default:** 200 milliseconds +| **Related Environment Variable:** ``KMP_LIBRARY`` +| **Example:** ``KMP_BLOCKTIME=1s`` + +KMP_CPUINFO_FILE +"""""""""""""""" + +Specifies an alternate file name for a file containing the machine topology +description. The file must be in the same format as :file:`/proc/cpuinfo`. + +**Default:** None + +KMP_DETERMINISTIC_REDUCTION +""""""""""""""""""""""""""" + +Enables (``true``) or disables (``false``) the use of a specific ordering of +the reduction operations for implementing the reduction clause for an OpenMP +parallel region. This has the effect that, for a given number of threads, in +a given parallel region, for a given data set and reduction operation, a +floating point reduction done for an OpenMP reduction clause has a consistent +floating point result from run to run, since round-off errors are identical. + +| **Default:** ``false`` +| **Example:** ``KMP_DETERMINISTIC_REDUCTION=true`` + +KMP_DYNAMIC_MODE +"""""""""""""""" + +Selects the method used to determine the number of threads to use for a parallel +region when ``OMP_DYNAMIC=true``. Possible values: (``load_balance`` | ``thread_limit``), where, + +* ``load_balance``: tries to avoid using more threads than available execution units on the machine; +* ``thread_limit``: tries to avoid using more threads than total execution units on the machine. + +**Default:** ``load_balance`` (on all supported platforms) + +KMP_HOT_TEAMS_MAX_LEVEL +""""""""""""""""""""""" +Sets the maximum nested level to which teams of threads will be hot. + +.. note:: + A hot team is a team of threads optimized for faster reuse by subsequent + parallel regions. In a hot team, threads are kept ready for execution of + the next parallel region, in contrast to the cold team, which is freed + after each parallel region, with its threads going into a common pool + of threads. + +For values of 2 and above, nested parallelism should be enabled. + +**Default:** 1 + +KMP_HOT_TEAMS_MODE +"""""""""""""""""" + +Specifies the run-time behavior when the number of threads in a hot team is reduced. +Possible values: + +* ``0`` - Extra threads are freed and put into a common pool of threads. +* ``1`` - Extra threads are kept in the team in reserve, for faster reuse + in subsequent parallel regions. + +**Default:** 0 + +KMP_HW_SUBSET +""""""""""""" + +Specifies the subset of available hardware resources for the hardware topology +hierarchy. The subset is specified in terms of number of units per upper layer +unit starting from top layer downwards. E.g. the number of sockets (top layer +units), cores per socket, and the threads per core, to use with an OpenMP +application, as an alternative to writing complicated explicit affinity settings +or a limiting process affinity mask. You can also specify an offset value to set +which resources to use. + +An extended syntax is available when ``KMP_TOPOLOGY_METHOD=hwloc``. Depending on what +resources are detected, you may be able to specify additional resources, such as +NUMA domains and groups of hardware resources that share certain cache levels. + +**Basic syntax:** ``num_unitsID[@offset] [,num_unitsID[@offset]...]`` + +Supported unit IDs are not case-insensitive. + +| ``S`` - socket +| ``num_units`` specifies the requested number of sockets. + +| ``D`` - die +| ``num_units`` specifies the requested number of dies per socket. + +| ``C`` - core +| ``num_units`` specifies the requested number of cores per die - if any - otherwise, per socket. + +| ``T`` - thread +| ``num_units`` specifies the requested number of HW threads per core. + +``offset`` - (Optional) The number of units to skip. + +.. note:: + The hardware cache can be specified as a unit, e.g. L2 for L2 cache, + or LL for last level cache. + +**Extended syntax when KMP_TOPOLOGY_METHOD=hwloc:** + +Additional IDs can be specified if detected. For example: + +``N`` - numa +``num_units`` specifies the requested number of NUMA nodes per upper layer +unit, e.g. per socket. + +``TI`` - tile +num_units specifies the requested number of tiles to use per upper layer +unit, e.g. per NUMA node. + +When any numa or tile units are specified in ``KMP_HW_SUBSET`` and the hwloc +topology method is available, the ``KMP_TOPOLOGY_METHOD`` will be automatically +set to hwloc, so there is no need to set it explicitly. + +If you don't specify one or more types of resource, such as socket or thread, +all available resources of that type are used. + +The run-time library prints a warning, and the setting of +``KMP_HW_SUBSET`` is ignored if: + +* a resource is specified, but detection of that resource is not supported + by the chosen topology detection method and/or +* a resource is specified twice. + +This variable does not work if ``KMP_AFFINITY=disabled``. + +**Default:** If omitted, the default value is to use all the +available hardware resources. + +**Examples:** + +* ``2s,4c,2t``: Use the first 2 sockets (s0 and s1), the first 4 cores on each + socket (c0 - c3), and 2 threads per core. +* ``2s@2,4c@8,2t``: Skip the first 2 sockets (s0 and s1) and use 2 sockets + (s2-s3), skip the first 8 cores (c0-c7) and use 4 cores on each socket + (c8-c11), and use 2 threads per core. +* ``5C@1,3T``: Use all available sockets, skip the first core and use 5 cores, + and use 3 threads per core. +* ``1T``: Use all cores on all sockets, 1 thread per core. +* ``1s, 1d, 1n, 1c, 1t``: Use 1 socket, 1 die, 1 NUMA node, 1 core, 1 thread + - use HW thread as a result. +* ``1s, 1c, 1t``: Use 1 socket, 1 core, 1 thread. This may result in using + single thread on a 3-layer topology architecture, or multiple threads on + 4-layer or 5-layer architecture. Result may even be different on the same + architecture, depending on ``KMP_TOPOLOGY_METHOD`` specified, as hwloc can + often detect more topology layers than the default method used by the OpenMP + run-time library. + +To see the result of the setting, you can specify ``verbose`` modifier in +``KMP_AFFINITY`` environment variable. The OpenMP run-time library will output +to ``stderr`` the information about the discovered hardware topology before and +after the ``KMP_HW_SUBSET`` setting was applied. + +KMP_INHERIT_FP_CONTROL +"""""""""""""""""""""" + +Enables (``true``) or disables (``false``) the copying of the floating-point +control settings of the primary thread to the floating-point control settings +of the OpenMP worker threads at the start of each parallel region. + +**Default:** ``true`` + +KMP_LIBRARY +""""""""""" + +Selects the OpenMP run-time library execution mode. The values for this variable +are ``serial``, ``turnaround``, or ``throughput``. + +| **Default:** ``throughput`` +| **Related environment variable:** ``KMP_BLOCKTIME`` and ``OMP_WAIT_POLICY`` + +KMP_SETTINGS +"""""""""""" + +Enables (``true``) or disables (``false``) the printing of OpenMP run-time library +environment variables during program execution. Two lists of variables are printed: +user-defined environment variables settings and effective values of variables used +by OpenMP run-time library. + +**Default:** ``false`` + +KMP_STACKSIZE +""""""""""""" + +Sets the number of bytes to allocate for each OpenMP thread to use as its private stack. + +Recommended size is ``16M``. + +Use the optional suffixes to specify byte units: ``B`` (bytes), ``K`` (Kilobytes), +``M`` (Megabytes), ``G`` (Gigabytes), or ``T`` (Terabytes) to specify the units. +If you specify a value without a suffix, the byte unit is assumed to be K (Kilobytes). + +**Related environment variable:** ``KMP_STACKSIZE`` overrides ``GOMP_STACKSIZE``, which +overrides ``OMP_STACKSIZE``. + +**Default:** + +* 32-bit architectures: ``2M`` +* 64-bit architectures: ``4M`` + +KMP_TOPOLOGY_METHOD +""""""""""""""""""" + +Forces OpenMP to use a particular machine topology modeling method. + +Possible values are: + +* ``all`` - Let OpenMP choose which topology method is most appropriate + based on the platform and possibly other environment variable settings. +* ``cpuid_leaf31`` (x86 only) - Decodes the APIC identifiers as specified by leaf 31 of the + cpuid instruction. The runtime will produce an error if the machine does not support leaf 31. +* ``cpuid_leaf11`` (x86 only) - Decodes the APIC identifiers as specified by leaf 11 of the + cpuid instruction. The runtime will produce an error if the machine does not support leaf 11. +* ``cpuid_leaf4`` (x86 only) - Decodes the APIC identifiers as specified in leaf 4 + of the cpuid instruction. The runtime will produce an error if the machine does not support leaf 4. +* ``cpuinfo`` - If ``KMP_CPUINFO_FILE`` is not specified, forces OpenMP to + parse :file:`/proc/cpuinfo` to determine the topology (Linux only). + If ``KMP_CPUINFO_FILE`` is specified as described above, uses it (Windows or Linux). +* ``group`` - Models the machine as a 2-level map, with level 0 specifying the + different processors in a group, and level 1 specifying the different + groups (Windows 64-bit only). + +.. note:: + Support for group is now deprecated and will be removed in a future release. Use all instead. + +* ``flat`` - Models the machine as a flat (linear) list of processors. +* ``hwloc`` - Models the machine as the Portable Hardware Locality (hwloc) library does. + This model is the most detailed and includes, but is not limited to: numa domains, + packages, cores, hardware threads, caches, and Windows processor groups. This method is + only available if you have configured libomp to use hwloc during CMake configuration. + +**Default:** all + +KMP_VERSION +""""""""""" + +Enables (``true``) or disables (``false``) the printing of OpenMP run-time +library version information during program execution. + +**Default:** ``false`` + +KMP_WARNINGS +"""""""""""" + +Enables (``true``) or disables (``false``) displaying warnings from the +OpenMP run-time library during program execution. + +**Default:** ``true`` LLVM/OpenMP Target Host Runtime (``libomptarget``) --------------------------------------------------