diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst --- a/llvm/docs/AMDGPUUsage.rst +++ b/llvm/docs/AMDGPUUsage.rst @@ -3468,7 +3468,7 @@ Reserved, must be 0. GFX10 Controls the behavior of the - waitcnt's vmcnt and vscnt + s_waitcnt's vmcnt and vscnt counters. - If 0 vmcnt reports completion @@ -4140,24 +4140,22 @@ Memory Model ~~~~~~~~~~~~ -This section describes the mapping of LLVM memory model onto AMDGPU machine code -(see :ref:`memmodel`). +This section describes the mapping of the LLVM memory model onto AMDGPU machine +code (see :ref:`memmodel`). The AMDGPU backend supports the memory synchronization scopes specified in :ref:`amdgpu-memory-scopes`. -The code sequences used to implement the memory model are defined in table -:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`. - -The sequences specify the order of instructions that a single thread must -execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect -to other memory instructions executed by the same thread. This allows them to be +The code sequences used to implement the memory model specify the order of +instructions that a single thread must execute. The ``s_waitcnt`` and cache +mnagement insructions such as ``buffer_wbinvl1_vol`` are defined with respect to +other memory instructions executed by the same thread. This allows them to be moved earlier or later which can allow them to be combined with other instances -of the same instruction, or hoisted/sunk out of loops to improve -performance. Only the instructions related to the memory model are given; -additional ``s_waitcnt`` instructions are required to ensure registers are -defined before being used. These may be able to be combined with the memory -model ``s_waitcnt`` instructions as described above. +of the same instruction, or hoisted/sunk out of loops to improve performance. +Only the instructions related to the memory model are given; additional +``s_waitcnt`` instructions are required to ensure registers are defined before +being used. These may be able to be combined with the memory model ``s_waitcnt`` +instructions as described above. The AMDGPU backend supports the following memory models: @@ -4183,6 +4181,79 @@ ``buffer/global/flat_load/store/atomic`` instructions to global memory are termed vector memory operations. +Private address space uses ``buffer_load/store`` using the scratch V# +(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread +is accessing the memory, atomic memory orderings are not meaningful, and all +accesses are treated as non-atomic. + +Constant address space uses ``buffer/global_load`` instructions (or equivalent +scalar memory instructions). Since the constant address space contents do not +change during the execution of a kernel dispatch it is not legal to perform +stores, and atomic memory orderings are not meaningful, and all access are +treated as non-atomic. + +A memory synchronization scope wider than work-group is not meaningful for the +group (LDS) address space and is treated as work-group. + +The memory model does not support the region address space which is treated as +non-atomic. + +Acquire memory ordering is not meaningful on store atomic instructions and is +treated as non-atomic. + +Release memory ordering is not meaningful on load atomic instructions and is +treated a non-atomic. + +Acquire-release memory ordering is not meaningful on load or store atomic +instructions and is treated as acquire and release respectively. + +The memory order also adds the single thread optimization constraints defined in +table +:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`. + + .. table:: AMDHSA Memory Model Single Thread Optimization Constraints + :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table + + ============ ============================================================== + LLVM Memory Optimization Constraints + Ordering + ============ ============================================================== + unordered *none* + monotonic *none* + acquire - If a load atomic/atomicrmw then no following load/load + atomic/store/ store atomic/atomicrmw/fence instruction can + be moved before the acquire. + - If a fence then same as load atomic, plus no preceding + associated fence-paired-atomic can be moved after the fence. + release - If a store atomic/atomicrmw then no preceding load/load + atomic/store/ store atomic/atomicrmw/fence instruction can + be moved after the release. + - If a fence then same as store atomic, plus no following + associated fence-paired-atomic can be moved before the + fence. + acq_rel Same constraints as both acquire and release. + seq_cst - If a load atomic then same constraints as acquire, plus no + preceding sequentially consistent load atomic/store + atomic/atomicrmw/fence instruction can be moved after the + seq_cst. + - If a store atomic then the same constraints as release, plus + no following sequentially consistent load atomic/store + atomic/atomicrmw/fence instruction can be moved before the + seq_cst. + - If an atomicrmw/fence then same constraints as acq_rel. + ============ ============================================================== + +The code sequences used to implement the memory model are defined in the +following sections: + +* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9` +* :ref:`amdgpu-amdhsa-memory-model-gfx10` + +.. _amdgpu-amdhsa-memory-model-gfx6-gfx9: + +Memory Model GFX6-GFX9 +++++++++++++++++++++++ + For GFX6-GFX9: * Each agent has multiple shader arrays (SA). @@ -4233,110 +4304,13 @@ * The L2 cache can be kept coherent with other agents on some targets, or ranges of virtual addresses can be set up to bypass it to ensure system coherence. -For GFX10: - -* Each agent has multiple shader arrays (SA). -* Each SA has multiple work-group processors (WGP). -* Each WGP has multiple compute units (CU). -* Each CU has multiple SIMDs that execute wavefronts. -* The wavefronts for a single work-group are executed in the same - WGP. In CU wavefront execution mode the wavefronts may be executed by - different SIMDs in the same CU. In WGP wavefront execution mode the - wavefronts may be executed by different SIMDs in different CUs in the same - WGP. -* Each WGP has a single LDS memory shared by the wavefronts of the work-groups - executing on it. -* All LDS operations of a WGP are performed as wavefront wide operations in a - global order and involve no caching. Completion is reported to a wavefront in - execution order. -* The LDS memory has multiple request queues shared by the SIMDs of a - WGP. Therefore, the LDS operations performed by different wavefronts of a - work-group can be reordered relative to each other, which can result in - reordering the visibility of vector memory operations with respect to LDS - operations of other wavefronts in the same work-group. A ``s_waitcnt - lgkmcnt(0)`` is required to ensure synchronization between LDS operations and - vector memory operations between wavefronts of a work-group, but not between - operations performed by the same wavefront. -* The vector memory operations are performed as wavefront wide operations. - Completion of load/store/sample operations are reported to a wavefront in - execution order of other load/store/sample operations performed by that - wavefront. -* The vector memory operations access a vector L0 cache. There is a single L0 - cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no - special action is required for coherence between the lanes of a single - wavefront. However, a ``buffer_gl0_inv`` is required for coherence between - wavefronts executing in the same work-group as they may be executing on SIMDs - of different CUs that access different L0s. A ``buffer_gl0_inv`` is also - required for coherence between wavefronts executing in different work-groups - as they may be executing on different WGPs. -* The scalar memory operations access a scalar L0 cache shared by all wavefronts - on a WGP. The scalar and vector L0 caches are not coherent. However, scalar - operations are used in a restricted way so do not impact the memory model. See - :ref:`amdgpu-amdhsa-memory-spaces`. -* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on - the same SA. Therefore, no special action is required for coherence between - the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is - required for coherence between wavefronts executing in different work-groups - as they may be executing on different SAs that access different L1s. -* The L1 caches have independent quadrants to service disjoint ranges of virtual - addresses. -* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the - vector and scalar memory operations performed by different wavefronts, whether - executing in the same or different work-groups (which may be executing on - different CUs accessing different L0s), can be reordered relative to each - other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure - synchronization between vector memory operations of different wavefronts. It - ensures a previous vector memory operation has completed before executing a - subsequent vector memory or LDS operation and so can be used to meet the - requirements of acquire, release and sequential consistency. -* The L1 caches use an L2 cache shared by all SAs on the same agent. -* The L2 cache has independent channels to service disjoint ranges of virtual - addresses. -* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 - quadrant has a separate request queue per L2 channel. Therefore, the vector - and scalar memory operations performed by wavefronts executing in different - work-groups (which may be executing on different SAs) of an agent can be - reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is - required to ensure synchronization between vector memory operations of - different SAs. It ensures a previous vector memory operation has completed - before executing a subsequent vector memory and so can be used to meet the - requirements of acquire, release and sequential consistency. -* The L2 cache can be kept coherent with other agents on some targets, or ranges - of virtual addresses can be set up to bypass it to ensure system coherence. - -Private address space uses ``buffer_load/store`` using the scratch V# -(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread -is accessing the memory, atomic memory orderings are not meaningful, and all -accesses are treated as non-atomic. - -Constant address space uses ``buffer/global_load`` instructions (or equivalent -scalar memory instructions). Since the constant address space contents do not -change during the execution of a kernel dispatch it is not legal to perform -stores, and atomic memory orderings are not meaningful, and all access are -treated as non-atomic. - -A memory synchronization scope wider than work-group is not meaningful for the -group (LDS) address space and is treated as work-group. - -The memory model does not support the region address space which is treated as -non-atomic. - -Acquire memory ordering is not meaningful on store atomic instructions and is -treated as non-atomic. - -Release memory ordering is not meaningful on load atomic instructions and is -treated a non-atomic. - -Acquire-release memory ordering is not meaningful on load or store atomic -instructions and is treated as acquire and release respectively. - -AMDGPU backend only uses scalar memory operations to access memory that is -proven to not change during the execution of the kernel dispatch. This includes -constant address space and global address space for program scope const -variables. Therefore, the kernel machine code does not have to maintain the -scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar -and vector L1 caches are invalidated between kernel dispatches by CP since -constant address space data may change between kernel dispatch executions. See +Scalar memory operations are only ued to access memory that is proven to not +change during the execution of the kernel dispatch. This includes constant +address space and global address space for program scope const variables. +Therefore, the kernel machine code does not have to maintain the scalar cache to +ensure it is coherent with the vector caches. The scalar and vector caches are +invalidated between kernel dispatches by CP since constant address space data +may change between kernel dispatch executions. See :ref:`amdgpu-amdhsa-memory-spaces`. The one exception is if scalar writes are used to spill SGPR registers. In this @@ -4348,452 +4322,123 @@ creates a frame at the same address, respectively. There is no need for a ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. -For GFX6-GFX9, scratch backing memory (which is used for the private address -space) is accessed with MTYPE NC_NV (non-coherent non-volatile). Since the -private address space is only accessed by a single thread, and is always -write-before-read, there is never a need to invalidate these entries from the L1 -cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the -volatile cache lines. - -For GFX10, scratch backing memory (which is used for the private address space) -is accessed with MTYPE NC (non-coherent). Since the private address space is -only accessed by a single thread, and is always write-before-read, there is -never a need to invalidate these entries from the L0 or L1 caches. - -For GFX10, wavefronts are executed in native mode with in-order reporting of -loads and sample instructions. In this mode vmcnt reports completion of load, -atomic with return and sample instructions in order, and the vscnt reports the -completion of store and atomic without return in order. See ``MEM_ORDERED`` -field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. +For kernarg backing memory: -In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode: - -* In WGP wavefront execution mode the wavefronts of a work-group are executed - on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per - CU L0 caches is required for work-group synchronization. Also accesses to L1 - at work-group scope need to be explicitly ordered as the accesses from - different CUs are not ordered. -* In CU wavefront execution mode the wavefronts of a work-group are executed on - the SIMDs of a single CU of the WGP. Therefore, all global memory access by - the work-group access the same L0 which in turn ensures L1 accesses are - ordered and so do not require explicit management of the caches for - work-group synchronization. +* CP invalidates the L1 cache at the start of each kernel dispatch. +* On dGPU the kernarg backing memory is allocated in host memory accessed as + MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also + causes it to be treated as non-volatile and so is not invalidated by + ``*_vol``. +* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent) + and so the L2 cache will be coherent with the CPU and other agents. -See ``WGP_MODE`` field in -:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and -:ref:`amdgpu-target-features`. +Scratch backing memory (which is used for the private address space) is accessed +with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is +only accessed by a single thread, and is always write-before-read, there is +never a need to invalidate these entries from the L1 cache. Hence all cache +invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. -On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing -to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as -non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC -(cache coherent) and so the L2 cache will be coherent with the CPU and other -agents. +The code sequences used to implement the memory model for GFX6-GFX9 are defined +in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. - .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10 - :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table + .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 + :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table - ============ ============ ============== ========== ================================ ================================ - LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code AMDGPU Machine Code - Ordering Sync Scope Address GFX6-9 GFX10 + ============ ============ ============== ========== ================================ + LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code + Ordering Sync Scope Address GFX6-9 Space - ============ ============ ============== ========== ================================ ================================ + ============ ============ ============== ========== ================================ **Non-Atomic** - --------------------------------------------------------------------------------------------------------------------- - load *none* *none* - global - !volatile & !nontemporal - !volatile & !nontemporal + ------------------------------------------------------------------------------------ + load *none* *none* - global - !volatile & !nontemporal - generic - - private 1. buffer/global/flat_load 1. buffer/global/flat_load + - private 1. buffer/global/flat_load - constant - - volatile & !nontemporal - volatile & !nontemporal + - volatile & !nontemporal - 1. buffer/global/flat_load 1. buffer/global/flat_load - glc=1 glc=1 dlc=1 + 1. buffer/global/flat_load + glc=1 - - nontemporal - nontemporal + - nontemporal - 1. buffer/global/flat_load 1. buffer/global/flat_load - glc=1 slc=1 slc=1 + 1. buffer/global/flat_load + glc=1 slc=1 - load *none* *none* - local 1. ds_load 1. ds_load - store *none* *none* - global - !nontemporal - !nontemporal + load *none* *none* - local 1. ds_load + store *none* *none* - global - !nontemporal - generic - - private 1. buffer/global/flat_store 1. buffer/global/flat_store + - private 1. buffer/global/flat_store - constant - - nontemporal - nontemporal + - nontemporal - 1. buffer/global/flat_store 1. buffer/global/flat_store - glc=1 slc=1 slc=1 + 1. buffer/global/flat_store + glc=1 slc=1 - store *none* *none* - local 1. ds_store 1. ds_store + store *none* *none* - local 1. ds_store **Unordered Atomic** - --------------------------------------------------------------------------------------------------------------------- - load atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. - store atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. - atomicrmw unordered *any* *any* *Same as monotonic *Same as monotonic - atomic*. atomic*. + ------------------------------------------------------------------------------------ + load atomic unordered *any* *any* *Same as non-atomic*. + store atomic unordered *any* *any* *Same as non-atomic*. + atomicrmw unordered *any* *any* *Same as monotonic + atomic*. **Monotonic Atomic** - --------------------------------------------------------------------------------------------------------------------- - load atomic monotonic - singlethread - global 1. buffer/global/flat_load 1. buffer/global/flat_load + ------------------------------------------------------------------------------------ + load atomic monotonic - singlethread - global 1. buffer/global/flat_load - wavefront - generic - load atomic monotonic - workgroup - global 1. buffer/global/flat_load 1. buffer/global/flat_load - - generic glc=1 - - - If CU wavefront execution - mode, omit glc=1. + load atomic monotonic - workgroup - global 1. buffer/global/flat_load + - generic - load atomic monotonic - singlethread - local 1. ds_load 1. ds_load + load atomic monotonic - singlethread - local 1. ds_load - wavefront - workgroup - load atomic monotonic - agent - global 1. buffer/global/flat_load 1. buffer/global/flat_load - - system - generic glc=1 glc=1 dlc=1 - store atomic monotonic - singlethread - global 1. buffer/global/flat_store 1. buffer/global/flat_store + load atomic monotonic - agent - global 1. buffer/global/flat_load + - system - generic glc=1 + store atomic monotonic - singlethread - global 1. buffer/global/flat_store - wavefront - generic - workgroup - agent - system - store atomic monotonic - singlethread - local 1. ds_store 1. ds_store + store atomic monotonic - singlethread - local 1. ds_store - wavefront - workgroup - atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 1. buffer/global/flat_atomic + atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic - wavefront - generic - workgroup - agent - system - atomicrmw monotonic - singlethread - local 1. ds_atomic 1. ds_atomic + atomicrmw monotonic - singlethread - local 1. ds_atomic - wavefront - workgroup **Acquire Atomic** - --------------------------------------------------------------------------------------------------------------------- - load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 1. buffer/global/ds/flat_load - - wavefront - local - - generic - load atomic acquire - workgroup - global 1. buffer/global_load 1. buffer/global_load glc=1 - - - If CU wavefront execution - mode, omit glc=1. - - 2. s_waitcnt vmcnt(0) - - - If CU wavefront execution - mode, omit. - - Must happen before - the following buffer_gl0_inv - and before any following - global/generic - load/load - atomic/store/store - atomic/atomicrmw. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - load atomic acquire - workgroup - local 1. ds_load 1. ds_load - 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) - - - If OpenCL, omit. - If OpenCL, omit. - - Must happen before - Must happen before - any following the following buffer_gl0_inv - global/generic and before any following - load/load global/generic load/load - atomic/store/store atomic/store/store - atomic/atomicrmw. atomic/atomicrmw. - - Ensures any - Ensures any - following global following global - data read is no data read is no - older than the load older than the load - atomic value being atomic value being - acquired. acquired. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - If OpenCL, omit. - - Ensures that - following - loads will not see - stale data. - - load atomic acquire - workgroup - generic 1. flat_load 1. flat_load glc=1 - - - If CU wavefront execution - mode, omit glc=1. - - 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) & - vmcnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0). - - If OpenCL, omit. - If OpenCL, omit - lgkmcnt(0). - - Must happen before - Must happen before - any following the following - global/generic buffer_gl0_inv and any - load/load following global/generic - atomic/store/store load/load - atomic/atomicrmw. atomic/store/store - atomic/atomicrmw. - - Ensures any - Ensures any - following global following global - data read is no data read is no - older than the load older than the load - atomic value being atomic value being - acquired. acquired. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - load atomic acquire - agent - global 1. buffer/global_load 1. buffer/global_load - - system glc=1 glc=1 dlc=1 - 2. s_waitcnt vmcnt(0) 2. s_waitcnt vmcnt(0) - - - Must happen before - Must happen before - following following - buffer_wbinvl1_vol. buffer_gl*_inv. - - Ensures the load - Ensures the load - has completed has completed - before invalidating before invalidating - the cache. the caches. - - 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following following - loads will not see loads will not see - stale global data. stale global data. - - load atomic acquire - agent - generic 1. flat_load glc=1 1. flat_load glc=1 dlc=1 - - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vmcnt(0) & - lgkmcnt(0) lgkmcnt(0) - - - If OpenCL omit - If OpenCL omit - lgkmcnt(0). lgkmcnt(0). - - Must happen before - Must happen before - following following - buffer_wbinvl1_vol. buffer_gl*_invl. - - Ensures the flat_load - Ensures the flat_load - has completed has completed - before invalidating before invalidating - the cache. the caches. - - 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following loads following loads - will not see stale will not see stale - global data. global data. - - atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic + ------------------------------------------------------------------------------------ + load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load - wavefront - local - generic - atomicrmw acquire - workgroup - global 1. buffer/global_atomic 1. buffer/global_atomic - 2. s_waitcnt vm/vscnt(0) - - - If CU wavefront execution - mode, omit. - - Use vmcnt(0) if atomic with - return and vscnt(0) if - atomic with no-return. - - Must happen before - the following buffer_gl0_inv - and before any following - global/generic - load/load - atomic/store/store - atomic/atomicrmw. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - atomicrmw acquire - workgroup - local 1. ds_atomic 1. ds_atomic - 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) - - - If OpenCL, omit. - If OpenCL, omit. - - Must happen before - Must happen before - any following the following - global/generic buffer_gl0_inv. - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures any - Ensures any - following global following global - data read is no data read is no - older than the older than the - atomicrmw value atomicrmw value - being acquired. being acquired. - - 3. buffer_gl0_inv - - - If OpenCL omit. - - Ensures that - following - loads will not see - stale data. - - atomicrmw acquire - workgroup - generic 1. flat_atomic 1. flat_atomic - 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) & - vm/vscnt(0) - - - If CU wavefront execution - mode, omit vm/vscnt(0). - - If OpenCL, omit. - If OpenCL, omit - waitcnt lgkmcnt(0). - - Use vmcnt(0) if atomic with - return and vscnt(0) if - atomic with no-return. - - Must happen before - Must happen before - any following the following - global/generic buffer_gl0_inv. + load atomic acquire - workgroup - global 1. buffer/global_load + + load atomic acquire - workgroup - local 1. ds_load + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + any following + global/generic load/load atomic/store/store atomic/atomicrmw. - - Ensures any - Ensures any - following global following global - data read is no data read is no - older than the older than the - atomicrmw value atomicrmw value - being acquired. being acquired. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - atomicrmw acquire - agent - global 1. buffer/global_atomic 1. buffer/global_atomic - - system 2. s_waitcnt vmcnt(0) 2. s_waitcnt vm/vscnt(0) - - - Use vmcnt(0) if atomic with - return and vscnt(0) if - atomic with no-return. - waitcnt lgkmcnt(0). - - Must happen before - Must happen before - following following - buffer_wbinvl1_vol. buffer_gl*_inv. - - Ensures the - Ensures the - atomicrmw has atomicrmw has - completed before completed before - invalidating the invalidating the - cache. caches. - - 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following loads following loads - will not see stale will not see stale - global data. global data. - - atomicrmw acquire - agent - generic 1. flat_atomic 1. flat_atomic - - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vm/vscnt(0) & - lgkmcnt(0) lgkmcnt(0) - - - If OpenCL, omit - If OpenCL, omit - lgkmcnt(0). lgkmcnt(0). - - Use vmcnt(0) if atomic with - return and vscnt(0) if - atomic with no-return. - - Must happen before - Must happen before - following following - buffer_wbinvl1_vol. buffer_gl*_inv. - - Ensures the - Ensures the - atomicrmw has atomicrmw has - completed before completed before - invalidating the invalidating the - cache. caches. - - 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following loads following loads - will not see stale will not see stale - global data. global data. - - fence acquire - singlethread *none* *none* *none* - - wavefront - fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL and - If OpenCL and - address space is address space is - not generic, omit. not generic, omit - lgkmcnt(0). - - If OpenCL and - address space is - local, omit - vmcnt(0) and vscnt(0). - - However, since LLVM - However, since LLVM - currently has no currently has no - address space on address space on - the fence need to the fence need to - conservatively conservatively - always generate. If always generate. If - fence had an fence had an - address space then address space then - set to address set to address - space of OpenCL space of OpenCL - fence flag, or to fence flag, or to - generic if both generic if both - local and global local and global - flags are flags are - specified. specified. - - Must happen after - any preceding - local/generic load - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + load atomic acquire - workgroup - generic 1. flat_load + + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. - Must happen before any following global/generic @@ -4803,1777 +4448,3280 @@ - Ensures any following global data read is no - older than the - value read by the - fence-paired-atomic. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load - atomic/ - atomicrmw-with-return-value - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - atomicrmw-no-return-value - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic load - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - Must happen before - the following - buffer_gl0_inv. - - Ensures that the - fence-paired atomic - has completed - before invalidating - the - cache. Therefore - any following - locations read must - be no older than - the value read by - the - fence-paired-atomic. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) vmcnt(0) & vscnt(0) - - - If OpenCL and - If OpenCL and - address space is address space is - not generic, omit not generic, omit - lgkmcnt(0). lgkmcnt(0). - - If OpenCL and - address space is - local, omit - vmcnt(0) and vscnt(0). - - However, since LLVM - However, since LLVM - currently has no currently has no - address space on address space on - the fence need to the fence need to - conservatively conservatively - always generate always generate - (see comment for (see comment for - previous fence). previous fence). - - Could be split into - separate s_waitcnt - vmcnt(0) and - s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic load - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). + older than the load + atomic value being + acquired. + + load atomic acquire - agent - global 1. buffer/global_load + - system glc=1 + 2. s_waitcnt vmcnt(0) + - Must happen before - the following + following buffer_wbinvl1_vol. - - Ensures that the - fence-paired atomic + - Ensures the load has completed before invalidating - the - cache. Therefore - any following - locations read must + the cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following + loads will not see + stale global data. + + load atomic acquire - agent - generic 1. flat_load glc=1 + - system 2. s_waitcnt vmcnt(0) & + lgkmcnt(0) + + - If OpenCL omit + lgkmcnt(0). + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the flat_load + has completed + before invalidating + the cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acquire - workgroup - global 1. buffer/global_atomic + + atomicrmw acquire - workgroup - local 1. ds_atomic + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + atomicrmw acquire - workgroup - generic 1. flat_atomic + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + atomicrmw acquire - agent - global 1. buffer/global_atomic + - system 2. s_waitcnt vmcnt(0) + + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - agent - generic 1. flat_atomic + - system 2. s_waitcnt vmcnt(0) & + lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acquire - singlethread *none* *none* + - wavefront + fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit. + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate. If + fence had an + address space then + set to address + space of OpenCL + fence flag, or to + generic if both + local and global + flags are + specified. + - Must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the + value read by the + fence-paired-atomic. + + fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_wbinvl1_vol. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + cache. Therefore + any following + locations read must be no older than the value read by the fence-paired-atomic. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load - atomic/ - atomicrmw-with-return-value - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - atomicrmw-no-return-value - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic load - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - Must happen before - the following - buffer_gl*_inv. - - Ensures that the - fence-paired atomic - has completed - before invalidating - the - caches. Therefore - any following - locations read must - be no older than - the value read by - the - fence-paired-atomic. - - 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before any - Must happen before any - following global/generic following global/generic - load/load load/load - atomic/store/store atomic/store/store - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following loads following loads - will not see stale will not see stale - global data. global data. + + 2. buffer_wbinvl1_vol + + - Must happen before any + following global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. **Release Atomic** - --------------------------------------------------------------------------------------------------------------------- - store atomic release - singlethread - global 1. buffer/global/ds/flat_store 1. buffer/global/ds/flat_store + ------------------------------------------------------------------------------------ + store atomic release - singlethread - global 1. buffer/global/ds/flat_store - wavefront - local - generic - store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL, omit. - If OpenCL, omit - lgkmcnt(0). + store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store - atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - store. store. - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing the performing the - store that is being store that is being - released. released. - - 2. buffer/global_store 2. buffer/global_store - store atomic release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit. - - If OpenCL, omit. - - Could be split into - separate s_waitcnt - vmcnt(0) and s_waitcnt - vscnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - Must happen before - the following - store. - - Ensures that all - global memory - operations have - completed before - performing the - store that is being - released. - - 1. ds_store 2. ds_store - store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL, omit. - If OpenCL, omit - lgkmcnt(0). + - Must happen before + the following + store. + - Ensures that all + memory operations + to local have + completed before + performing the + store that is being + released. + + 2. buffer/global_store + store atomic release - workgroup - local 1. ds_store + store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store - atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - store. store. - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing the performing the - store that is being store that is being - released. released. - - 2. flat_store 2. flat_store - store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & - - system - generic vmcnt(0) vmcnt(0) & vscnt(0) - - - If OpenCL, omit - If OpenCL, omit - lgkmcnt(0). lgkmcnt(0). - - Could be split into - Could be split into - separate s_waitcnt separate s_waitcnt - vmcnt(0) and vmcnt(0), s_waitcnt vscnt(0) - s_waitcnt and s_waitcnt - lgkmcnt(0) to allow lgkmcnt(0) to allow - them to be them to be - independently moved independently moved - according to the according to the - following rules. following rules. - - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) - must happen after must happen after - any preceding any preceding - global/generic global/generic - load/store/load load/load - atomic/store atomic/ - atomic/atomicrmw. atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) - must happen after must happen after - any preceding any preceding - local/generic local/generic - load/store/load load/store/load - atomic/store atomic/store - atomic/atomicrmw. atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - store. store. - - Ensures that all - Ensures that all - memory operations memory operations - to memory have to memory have - completed before completed before - performing the performing the - store that is being store that is being - released. released. - - 2. buffer/global/flat_store 2. buffer/global/flat_store - atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic + - Must happen before + the following + store. + - Ensures that all + memory operations + to local have + completed before + performing the + store that is being + released. + + 2. flat_store + store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + store. + - Ensures that all + memory operations + to memory have + completed before + performing the + store that is being + released. + + 2. buffer/global/flat_store + atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic - wavefront - local - generic - atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) + atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - If OpenCL, omit. - - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store - atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - atomicrmw. atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing the performing the - atomicrmw that is atomicrmw that is - being released. being released. - - 2. buffer/global_atomic 2. buffer/global_atomic - atomicrmw release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit. - - If OpenCL, omit. - - Could be split into - separate s_waitcnt - vmcnt(0) and s_waitcnt - vscnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - Must happen before - the following - store. - - Ensures that all - global memory - operations have - completed before - performing the - store that is being - released. - - 1. ds_atomic 2. ds_atomic - atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL, omit. - If OpenCL, omit - lgkmcnt(0). + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + atomicrmw release - workgroup - local 1. ds_atomic + atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store - atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - atomicrmw. atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing the performing the - atomicrmw that is atomicrmw that is - being released. being released. - - 2. flat_atomic 2. flat_atomic - atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lkkmcnt(0) & - - system - generic vmcnt(0) vmcnt(0) & vscnt(0) - - - If OpenCL, omit - If OpenCL, omit - lgkmcnt(0). lgkmcnt(0). - - Could be split into - Could be split into - separate s_waitcnt separate s_waitcnt - vmcnt(0) and vmcnt(0), s_waitcnt - s_waitcnt vscnt(0) and s_waitcnt - lgkmcnt(0) to allow lgkmcnt(0) to allow - them to be them to be - independently moved independently moved - according to the according to the - following rules. following rules. - - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) - must happen after must happen after - any preceding any preceding - global/generic global/generic - load/store/load load/load atomic/ - atomic/store atomicrmw-with-return-value. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/atomicrmw. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) - must happen after must happen after - any preceding any preceding - local/generic local/generic - load/store/load load/store/load - atomic/store atomic/store - atomic/atomicrmw. atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - atomicrmw. atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to global and local to global and local - have completed have completed - before performing before performing - the atomicrmw that the atomicrmw that - is being released. is being released. - - 2. buffer/global/flat_atomic 2. buffer/global/flat_atomic - fence release - singlethread *none* *none* *none* + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global and local + have completed + before performing + the atomicrmw that + is being released. + + 2. buffer/global/flat_atomic + fence release - singlethread *none* *none* - wavefront - fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL and - If OpenCL and - address space is address space is - not generic, omit. not generic, omit - lgkmcnt(0). - - If OpenCL and - address space is - local, omit - vmcnt(0) and vscnt(0). - - However, since LLVM - However, since LLVM - currently has no currently has no - address space on address space on - the fence need to the fence need to - conservatively conservatively - always generate. If always generate. If - fence had an fence had an - address space then address space then - set to address set to address - space of OpenCL space of OpenCL - fence flag, or to fence flag, or to - generic if both generic if both - local and global local and global - flags are flags are - specified. specified. + fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit. + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate. If + fence had an + address space then + set to address + space of OpenCL + fence flag, or to + generic if both + local and global + flags are + specified. - Must happen after any preceding local/generic load/load atomic/store/store atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store atomic/ - atomicrmw. - - Must happen before - Must happen before - any following store any following store - atomic/atomicrmw atomic/atomicrmw - with an equal or with an equal or - wider sync scope wider sync scope - and memory ordering and memory ordering - stronger than stronger than - unordered (this is unordered (this is - termed the termed the - fence-paired-atomic). fence-paired-atomic). - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing the performing the - following following - fence-paired-atomic. fence-paired-atomic. - - fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) vmcnt(0) & vscnt(0) - - - If OpenCL and - If OpenCL and - address space is address space is - not generic, omit not generic, omit - lgkmcnt(0). lgkmcnt(0). - - If OpenCL and - If OpenCL and - address space is address space is - local, omit local, omit - vmcnt(0). vmcnt(0) and vscnt(0). - - However, since LLVM - However, since LLVM - currently has no currently has no - address space on address space on - the fence need to the fence need to - conservatively conservatively - always generate. If always generate. If - fence had an fence had an - address space then address space then - set to address set to address - space of OpenCL space of OpenCL - fence flag, or to fence flag, or to - generic if both generic if both - local and global local and global - flags are flags are - specified. specified. - - Could be split into - Could be split into - separate s_waitcnt separate s_waitcnt - vmcnt(0) and vmcnt(0), s_waitcnt - s_waitcnt vscnt(0) and s_waitcnt - lgkmcnt(0) to allow lgkmcnt(0) to allow - them to be them to be - independently moved independently moved - according to the according to the - following rules. following rules. - - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) - must happen after must happen after - any preceding any preceding - global/generic global/generic - load/store/load load/load atomic/ - atomic/store atomicrmw-with-return-value. - atomic/atomicrmw. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) - must happen after must happen after - any preceding any preceding - local/generic local/generic - load/store/load load/store/load - atomic/store atomic/store - atomic/atomicrmw. atomic/atomicrmw. - - Must happen before - Must happen before - any following store any following store - atomic/atomicrmw atomic/atomicrmw - with an equal or with an equal or - wider sync scope wider sync scope - and memory ordering and memory ordering - stronger than stronger than - unordered (this is unordered (this is - termed the termed the - fence-paired-atomic). fence-paired-atomic). - - Ensures that all - Ensures that all - memory operations memory operations - have have - completed before completed before - performing the performing the - following following - fence-paired-atomic. fence-paired-atomic. + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + to local have + completed before + performing the + following + fence-paired-atomic. - **Acquire-Release Atomic** - --------------------------------------------------------------------------------------------------------------------- - atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic - - wavefront - local - - generic - atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL, omit. - If OpenCL, omit - lgkmcnt(0). - - Must happen after - Must happen after - any preceding any preceding - local/generic local/generic - load/store/load load/store/load - atomic/store atomic/store - atomic/atomicrmw. atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0), and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store - atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - atomicrmw. atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing the performing the - atomicrmw that is atomicrmw that is - being released. being released. - - 2. buffer/global_atomic 2. buffer/global_atomic - 3. s_waitcnt vm/vscnt(0) - - - If CU wavefront execution - mode, omit. - - Use vmcnt(0) if atomic with - return and vscnt(0) if - atomic with no-return. - - Must happen before - the following - buffer_gl0_inv. - - Ensures any - following global - data read is no - older than the - atomicrmw value - being acquired. - - 4. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - atomicrmw acq_rel - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit. - - If OpenCL, omit. - - Could be split into - separate s_waitcnt - vmcnt(0) and s_waitcnt - vscnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - Must happen before - the following - store. - - Ensures that all - global memory - operations have - completed before - performing the - store that is being - released. - - 1. ds_atomic 2. ds_atomic - 2. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) - - - If OpenCL, omit. - If OpenCL, omit. - - Must happen before - Must happen before - any following the following - global/generic buffer_gl0_inv. - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures any - Ensures any - following global following global - data read is no data read is no - older than the load older than the load - atomic value being atomic value being - acquired. acquired. - - 4. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - If OpenCL omit. - - Ensures that - following - loads will not see - stale data. - - atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL, omit. - If OpenCL, omit lgkmcnt(0). - - Must happen after + fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate. If + fence had an + address space then + set to address + space of OpenCL + fence flag, or to + generic if both + local and global + flags are + specified. + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after any preceding - local/generic + global/generic load/store/load atomic/store atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store - atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - atomicrmw. atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing the performing the - atomicrmw that is atomicrmw that is - being released. being released. - - 2. flat_atomic 2. flat_atomic - 3. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL, omit. - If OpenCL, omit lgkmcnt(0). - - Must happen before - Must happen before - any following the following - global/generic buffer_gl0_inv. - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures any - Ensures any - following global following global - data read is no data read is no - older than the load older than the load - atomic value being atomic value being - acquired. acquired. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) vmcnt(0) & vscnt(0) - - - If OpenCL, omit - If OpenCL, omit - lgkmcnt(0). lgkmcnt(0). - - Could be split into - Could be split into - separate s_waitcnt separate s_waitcnt - vmcnt(0) and vmcnt(0), s_waitcnt - s_waitcnt vscnt(0) and s_waitcnt - lgkmcnt(0) to allow lgkmcnt(0) to allow - them to be them to be - independently moved independently moved - according to the according to the - following rules. following rules. - - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) - must happen after must happen after - any preceding any preceding - global/generic global/generic - load/store/load load/load atomic/ - atomic/store atomicrmw-with-return-value. - atomic/atomicrmw. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) - must happen after must happen after - any preceding any preceding - local/generic local/generic - load/store/load load/store/load - atomic/store atomic/store - atomic/atomicrmw. atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - atomicrmw. atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to global have to global have - completed before completed before - performing the performing the - atomicrmw that is atomicrmw that is - being released. being released. - - 2. buffer/global_atomic 2. buffer/global_atomic - 3. s_waitcnt vmcnt(0) 3. s_waitcnt vm/vscnt(0) - - - Use vmcnt(0) if atomic with - return and vscnt(0) if - atomic with no-return. - waitcnt lgkmcnt(0). - - Must happen before - Must happen before - following following - buffer_wbinvl1_vol. buffer_gl*_inv. - - Ensures the - Ensures the - atomicrmw has atomicrmw has - completed before completed before - invalidating the invalidating the - cache. caches. - - 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following loads following loads - will not see stale will not see stale - global data. global data. - - atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) vmcnt(0) & vscnt(0) - - - If OpenCL, omit - If OpenCL, omit - lgkmcnt(0). lgkmcnt(0). - - Could be split into - Could be split into - separate s_waitcnt separate s_waitcnt - vmcnt(0) and vmcnt(0), s_waitcnt - s_waitcnt vscnt(0), and s_waitcnt - lgkmcnt(0) to allow lgkmcnt(0) to allow - them to be them to be - independently moved independently moved - according to the according to the - following rules. following rules. - - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) - must happen after must happen after - any preceding any preceding - global/generic global/generic - load/store/load load/load atomic - atomic/store atomicrmw-with-return-value. - atomic/atomicrmw. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) - must happen after must happen after - any preceding any preceding - local/generic local/generic - load/store/load load/store/load - atomic/store atomic/store - atomic/atomicrmw. atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - atomicrmw. atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to global have have - completed before completed before - performing the performing the - atomicrmw that is atomicrmw that is - being released. being released. - - 2. flat_atomic 2. flat_atomic - 3. s_waitcnt vmcnt(0) & 3. s_waitcnt vm/vscnt(0) & - lgkmcnt(0) lgkmcnt(0) - - - If OpenCL, omit - If OpenCL, omit - lgkmcnt(0). lgkmcnt(0). - - Use vmcnt(0) if atomic with - return and vscnt(0) if - atomic with no-return. - - Must happen before - Must happen before - following following - buffer_wbinvl1_vol. buffer_gl*_inv. - - Ensures the - Ensures the - atomicrmw has atomicrmw has - completed before completed before - invalidating the invalidating the - cache. caches. - - 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following loads following loads - will not see stale will not see stale - global data. global data. - - fence acq_rel - singlethread *none* *none* *none* - - wavefront - fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - If OpenCL and - If OpenCL and - address space is address space is - not generic, omit. not generic, omit - lgkmcnt(0). - - If OpenCL and - address space is - local, omit - vmcnt(0) and vscnt(0). - - However, - However, - since LLVM since LLVM - currently has no currently has no - address space on address space on - the fence need to the fence need to - conservatively conservatively - always generate always generate - (see comment for (see comment for - previous fence). previous fence). - - Must happen after + - s_waitcnt lgkmcnt(0) + must happen after any preceding local/generic - load/load - atomic/store/store + load/store/load + atomic/store atomic/atomicrmw. - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0) and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/load - atomic/ - atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store atomic/ - atomicrmw. - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/store/store atomic/store/store - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that all - Ensures that all - memory operations memory operations - to local have have - completed before completed before - performing any performing any - following global following global - memory operations. memory operations. - - Ensures that the - Ensures that the - preceding preceding - local/generic load local/generic load - atomic/atomicrmw atomic/atomicrmw - with an equal or with an equal or - wider sync scope wider sync scope - and memory ordering and memory ordering - stronger than stronger than - unordered (this is unordered (this is - termed the termed the - acquire-fence-paired-atomic acquire-fence-paired-atomic - ) has completed ) has completed - before following before following - global memory global memory - operations. This operations. This - satisfies the satisfies the - requirements of requirements of - acquire. acquire. - - Ensures that all - Ensures that all - previous memory previous memory - operations have operations have - completed before a completed before a - following following - local/generic store local/generic store - atomic/atomicrmw atomic/atomicrmw - with an equal or with an equal or - wider sync scope wider sync scope - and memory ordering and memory ordering - stronger than stronger than - unordered (this is unordered (this is - termed the termed the - release-fence-paired-atomic release-fence-paired-atomic - ). This satisfies the ). This satisfies the - requirements of requirements of - release. release. - - Must happen before - the following - buffer_gl0_inv. - - Ensures that the - acquire-fence-paired - atomic has completed - before invalidating - the - cache. Therefore - any following - locations read must - be no older than - the value read by - the - acquire-fence-paired-atomic. - - 3. buffer_gl0_inv - - - If CU wavefront execution - mode, omit. - - Ensures that - following - loads will not see - stale data. - - fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) vmcnt(0) & vscnt(0) - - - If OpenCL and - If OpenCL and - address space is address space is - not generic, omit not generic, omit - lgkmcnt(0). lgkmcnt(0). - - If OpenCL and - address space is - local, omit - vmcnt(0) and vscnt(0). - - However, since LLVM - However, since LLVM - currently has no currently has no - address space on address space on - the fence need to the fence need to - conservatively conservatively - always generate always generate - (see comment for (see comment for - previous fence). previous fence). - - Could be split into - Could be split into - separate s_waitcnt separate s_waitcnt - vmcnt(0) and vmcnt(0), s_waitcnt - s_waitcnt vscnt(0) and s_waitcnt - lgkmcnt(0) to allow lgkmcnt(0) to allow - them to be them to be - independently moved independently moved - according to the according to the - following rules. following rules. - - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) - must happen after must happen after - any preceding any preceding - global/generic global/generic - load/store/load load/load - atomic/store atomic/ - atomic/atomicrmw. atomicrmw-with-return-value. - - s_waitcnt vscnt(0) - must happen after - any preceding - global/generic - store/store atomic/ - atomicrmw-no-return-value. - - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) - must happen after must happen after - any preceding any preceding - local/generic local/generic - load/store/load load/store/load - atomic/store atomic/store - atomic/atomicrmw. atomic/atomicrmw. - - Must happen before - Must happen before - the following the following - buffer_wbinvl1_vol. buffer_gl*_inv. - - Ensures that the - Ensures that the - preceding preceding - global/local/generic global/local/generic - load load - atomic/atomicrmw atomic/atomicrmw - with an equal or with an equal or - wider sync scope wider sync scope - and memory ordering and memory ordering - stronger than stronger than - unordered (this is unordered (this is - termed the termed the - acquire-fence-paired-atomic acquire-fence-paired-atomic - ) has completed ) has completed - before invalidating before invalidating - the cache. This the caches. This - satisfies the satisfies the - requirements of requirements of - acquire. acquire. - - Ensures that all - Ensures that all - previous memory previous memory - operations have operations have - completed before a completed before a - following following - global/local/generic global/local/generic - store store - atomic/atomicrmw atomic/atomicrmw - with an equal or with an equal or - wider sync scope wider sync scope - and memory ordering and memory ordering - stronger than stronger than - unordered (this is unordered (this is - termed the termed the - release-fence-paired-atomic release-fence-paired-atomic - ). This satisfies the ). This satisfies the - requirements of requirements of - release. release. - - 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; - buffer_gl1_inv - - - Must happen before - Must happen before - any following any following - global/generic global/generic - load/load load/load - atomic/store/store atomic/store/store - atomic/atomicrmw. atomic/atomicrmw. - - Ensures that - Ensures that - following loads following loads - will not see stale will not see stale - global data. This global data. This - satisfies the satisfies the - requirements of requirements of - acquire. acquire. - - **Sequential Consistent Atomic** - --------------------------------------------------------------------------------------------------------------------- - load atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding - - wavefront - local load atomic acquire, load atomic acquire, - - generic except must generated except must generated - all instructions even all instructions even - for OpenCL.* for OpenCL.* - load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & - - generic vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit vmcnt(0) and - vscnt(0). - - Could be split into - separate s_waitcnt - vmcnt(0), s_waitcnt - vscnt(0), and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - Must - waitcnt lgkmcnt(0) must - happen after happen after - preceding preceding - global/generic load local load - atomic/store atomic/store - atomic/atomicrmw atomic/atomicrmw - with memory with memory - ordering of seq_cst ordering of seq_cst - and with equal or and with equal or - wider sync scope. wider sync scope. - (Note that seq_cst (Note that seq_cst - fences have their fences have their - own s_waitcnt own s_waitcnt - lgkmcnt(0) and so do lgkmcnt(0) and so do - not need to be not need to be - considered.) considered.) - - waitcnt vmcnt(0) - Must happen after - preceding - global/generic load - atomic/ - atomicrmw-with-return-value - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - vmcnt(0) and so do - not need to be - considered.) - - waitcnt vscnt(0) - Must happen after - preceding - global/generic store - atomic/ - atomicrmw-no-return-value - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - vscnt(0) and so do - not need to be - considered.) - - Ensures any - Ensures any - preceding preceding - sequential sequential - consistent local consistent global/local - memory instructions memory instructions - have completed have completed - before executing before executing - this sequentially this sequentially - consistent consistent - instruction. This instruction. This - prevents reordering prevents reordering - a seq_cst store a seq_cst store - followed by a followed by a - seq_cst load. (Note seq_cst load. (Note - that seq_cst is that seq_cst is - stronger than stronger than - acquire/release as acquire/release as - the reordering of the reordering of - load acquire load acquire - followed by a store followed by a store - release is release is - prevented by the prevented by the - waitcnt of waitcnt of - the release, but the release, but - there is nothing there is nothing - preventing a store preventing a store - release followed by release followed by - load acquire from load acquire from - completing out of completing out of - order. The waitcnt order. The waitcnt - could be placed after could be placed after - seq_store or before seq_store or before - the seq_load. We the seq_load. We - choose the load to choose the load to - make the waitcnt be make the waitcnt be - as late as possible as late as possible - so that the store so that the store - may have already may have already - completed.) completed.) - - 2. *Following 2. *Following - instructions same as instructions same as - corresponding load corresponding load - atomic acquire, atomic acquire, - except must generated except must generated - all instructions even all instructions even - for OpenCL.* for OpenCL.* + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + have + completed before + performing the + following + fence-paired-atomic. + + **Acquire-Release Atomic** + ------------------------------------------------------------------------------------ + atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + + atomicrmw acq_rel - workgroup - local 1. ds_atomic + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + 3. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + 3. s_waitcnt vmcnt(0) + + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 4. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + 3. s_waitcnt vmcnt(0) & + lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 4. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acq_rel - singlethread *none* *none* + - wavefront + fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit. + - However, + since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Must happen after + any preceding + local/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing any + following global + memory operations. + - Ensures that the + preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + acquire-fence-paired-atomic) + has completed + before following + global memory + operations. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + local/generic store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + release-fence-paired-atomic). + This satisfies the + requirements of + release. + + fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + buffer_wbinvl1_vol. + - Ensures that the + preceding + global/local/generic + load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + acquire-fence-paired-atomic) + has completed + before invalidating + the cache. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + global/local/generic + store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + release-fence-paired-atomic). + This satisfies the + requirements of + release. + + 2. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. This + satisfies the + requirements of + acquire. + + **Sequential Consistent Atomic** + ------------------------------------------------------------------------------------ + load atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local load atomic acquire, + - generic except must generated + all instructions even + for OpenCL.* + load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) + - generic + + - Must + happen after + preceding + local/generic load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + lgkmcnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent local + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + s_waitcnt of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The s_waitcnt + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the s_waitcnt be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generated + all instructions even + for OpenCL.* load atomic seq_cst - workgroup - local *Same as corresponding load atomic acquire, except must generated all instructions even for OpenCL.* - 1. s_waitcnt vmcnt(0) & vscnt(0) - - - If CU wavefront execution - mode, omit. - - Could be split into - separate s_waitcnt - vmcnt(0) and s_waitcnt - vscnt(0) to allow - them to be - independently moved - according to the - following rules. - - waitcnt vmcnt(0) - Must happen after - preceding - global/generic load - atomic/ - atomicrmw-with-return-value - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - vmcnt(0) and so do - not need to be - considered.) - - waitcnt vscnt(0) - Must happen after - preceding - global/generic store - atomic/ - atomicrmw-no-return-value - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - vscnt(0) and so do - not need to be - considered.) - - Ensures any - preceding - sequential - consistent global - memory instructions - have completed - before executing - this sequentially - consistent - instruction. This - prevents reordering - a seq_cst store - followed by a - seq_cst load. (Note - that seq_cst is - stronger than - acquire/release as - the reordering of - load acquire - followed by a store - release is - prevented by the - waitcnt of - the release, but - there is nothing - preventing a store - release followed by - load acquire from - completing out of - order. The waitcnt - could be placed after - seq_store or before - the seq_load. We - choose the load to - make the waitcnt be - as late as possible - so that the store - may have already - completed.) - - 2. *Following - instructions same as - corresponding load - atomic acquire, - except must generated - all instructions even - for OpenCL.* - - load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & - - system - generic vmcnt(0) vmcnt(0) & vscnt(0) - - - Could be split into - Could be split into - separate s_waitcnt separate s_waitcnt - vmcnt(0) vmcnt(0), s_waitcnt - and s_waitcnt vscnt(0) and s_waitcnt - lgkmcnt(0) to allow lgkmcnt(0) to allow - them to be them to be - independently moved independently moved - according to the according to the - following rules. following rules. - - waitcnt lgkmcnt(0) - waitcnt lgkmcnt(0) - must happen after must happen after - preceding preceding - global/generic load local load - atomic/store atomic/store - atomic/atomicrmw atomic/atomicrmw - with memory with memory - ordering of seq_cst ordering of seq_cst - and with equal or and with equal or - wider sync scope. wider sync scope. - (Note that seq_cst (Note that seq_cst - fences have their fences have their - own s_waitcnt own s_waitcnt - lgkmcnt(0) and so do lgkmcnt(0) and so do - not need to be not need to be - considered.) considered.) - - waitcnt vmcnt(0) - waitcnt vmcnt(0) - must happen after must happen after - preceding preceding - global/generic load global/generic load - atomic/store atomic/ - atomic/atomicrmw atomicrmw-with-return-value - with memory with memory - ordering of seq_cst ordering of seq_cst - and with equal or and with equal or - wider sync scope. wider sync scope. - (Note that seq_cst (Note that seq_cst - fences have their fences have their - own s_waitcnt own s_waitcnt - vmcnt(0) and so do vmcnt(0) and so do - not need to be not need to be - considered.) considered.) - - waitcnt vscnt(0) - Must happen after - preceding - global/generic store - atomic/ - atomicrmw-no-return-value - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - vscnt(0) and so do - not need to be - considered.) - - Ensures any - Ensures any - preceding preceding - sequential sequential - consistent global consistent global - memory instructions memory instructions - have completed have completed - before executing before executing - this sequentially this sequentially - consistent consistent - instruction. This instruction. This - prevents reordering prevents reordering - a seq_cst store a seq_cst store - followed by a followed by a - seq_cst load. (Note seq_cst load. (Note - that seq_cst is that seq_cst is - stronger than stronger than - acquire/release as acquire/release as - the reordering of the reordering of - load acquire load acquire - followed by a store followed by a store - release is release is - prevented by the prevented by the - waitcnt of waitcnt of - the release, but the release, but - there is nothing there is nothing - preventing a store preventing a store - release followed by release followed by - load acquire from load acquire from - completing out of completing out of - order. The waitcnt order. The waitcnt - could be placed after could be placed after - seq_store or before seq_store or before - the seq_load. We the seq_load. We - choose the load to choose the load to - make the waitcnt be make the waitcnt be - as late as possible as late as possible - so that the store so that the store - may have already may have already - completed.) completed.) - - 2. *Following 2. *Following - instructions same as instructions same as - corresponding load corresponding load - atomic acquire, atomic acquire, - except must generated except must generated - all instructions even all instructions even - for OpenCL.* for OpenCL.* - store atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding - - wavefront - local store atomic release, store atomic release, - - workgroup - generic except must generated except must generated - all instructions even all instructions even - for OpenCL.* for OpenCL.* - store atomic seq_cst - agent - global *Same as corresponding *Same as corresponding - - system - generic store atomic release, store atomic release, - except must generated except must generated - all instructions even all instructions even - for OpenCL.* for OpenCL.* - atomicrmw seq_cst - singlethread - global *Same as corresponding *Same as corresponding - - wavefront - local atomicrmw acq_rel, atomicrmw acq_rel, - - workgroup - generic except must generated except must generated - all instructions even all instructions even - for OpenCL.* for OpenCL.* - atomicrmw seq_cst - agent - global *Same as corresponding *Same as corresponding - - system - generic atomicrmw acq_rel, atomicrmw acq_rel, - except must generated except must generated - all instructions even all instructions even - for OpenCL.* for OpenCL.* - fence seq_cst - singlethread *none* *Same as corresponding *Same as corresponding - - wavefront fence acq_rel, fence acq_rel, - - workgroup except must generated except must generated - - agent all instructions even all instructions even - - system for OpenCL.* for OpenCL.* - ============ ============ ============== ========== ================================ ================================ - -The memory order also adds the single thread optimization constrains defined in -table -:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`. + load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) - .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10 - :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table + - Could be split into + separate s_waitcnt + vmcnt(0) + and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt lgkmcnt(0) + must happen after + preceding + global/generic load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + lgkmcnt(0) and so do + not need to be + considered.) + - s_waitcnt vmcnt(0) + must happen after + preceding + global/generic load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + s_waitcnt of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The s_waitcnt + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the s_waitcnt be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generated + all instructions even + for OpenCL.* + store atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local store atomic release, + - workgroup - generic except must generated + all instructions even + for OpenCL.* + store atomic seq_cst - agent - global *Same as corresponding + - system - generic store atomic release, + except must generated + all instructions even + for OpenCL.* + atomicrmw seq_cst - singlethread - global *Same as corresponding + - wavefront - local atomicrmw acq_rel, + - workgroup - generic except must generated + all instructions even + for OpenCL.* + atomicrmw seq_cst - agent - global *Same as corresponding + - system - generic atomicrmw acq_rel, + except must generated + all instructions even + for OpenCL.* + fence seq_cst - singlethread *none* *Same as corresponding + - wavefront fence acq_rel, + - workgroup except must generated + - agent all instructions even + - system for OpenCL.* + ============ ============ ============== ========== ================================ - ============ ============================================================== - LLVM Memory Optimization Constraints - Ordering - ============ ============================================================== - unordered *none* - monotonic *none* - acquire - If a load atomic/atomicrmw then no following load/load - atomic/store/ store atomic/atomicrmw/fence instruction can - be moved before the acquire. - - If a fence then same as load atomic, plus no preceding - associated fence-paired-atomic can be moved after the fence. - release - If a store atomic/atomicrmw then no preceding load/load - atomic/store/ store atomic/atomicrmw/fence instruction can - be moved after the release. - - If a fence then same as store atomic, plus no following - associated fence-paired-atomic can be moved before the - fence. - acq_rel Same constraints as both acquire and release. - seq_cst - If a load atomic then same constraints as acquire, plus no - preceding sequentially consistent load atomic/store - atomic/atomicrmw/fence instruction can be moved after the - seq_cst. - - If a store atomic then the same constraints as release, plus - no following sequentially consistent load atomic/store - atomic/atomicrmw/fence instruction can be moved before the - seq_cst. - - If an atomicrmw/fence then same constraints as acq_rel. - ============ ============================================================== +.. _amdgpu-amdhsa-memory-model-gfx10: + +Memory Model GFX10 +++++++++++++++++++ + +For GFX10: + +* Each agent has multiple shader arrays (SA). +* Each SA has multiple work-group processors (WGP). +* Each WGP has multiple compute units (CU). +* Each CU has multiple SIMDs that execute wavefronts. +* The wavefronts for a single work-group are executed in the same + WGP. In CU wavefront execution mode the wavefronts may be executed by + different SIMDs in the same CU. In WGP wavefront execution mode the + wavefronts may be executed by different SIMDs in different CUs in the same + WGP. +* Each WGP has a single LDS memory shared by the wavefronts of the work-groups + executing on it. +* All LDS operations of a WGP are performed as wavefront wide operations in a + global order and involve no caching. Completion is reported to a wavefront in + execution order. +* The LDS memory has multiple request queues shared by the SIMDs of a + WGP. Therefore, the LDS operations performed by different wavefronts of a + work-group can be reordered relative to each other, which can result in + reordering the visibility of vector memory operations with respect to LDS + operations of other wavefronts in the same work-group. A ``s_waitcnt + lgkmcnt(0)`` is required to ensure synchronization between LDS operations and + vector memory operations between wavefronts of a work-group, but not between + operations performed by the same wavefront. +* The vector memory operations are performed as wavefront wide operations. + Completion of load/store/sample operations are reported to a wavefront in + execution order of other load/store/sample operations performed by that + wavefront. +* The vector memory operations access a vector L0 cache. There is a single L0 + cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no + special action is required for coherence between the lanes of a single + wavefront. However, a ``buffer_gl0_inv`` is required for coherence between + wavefronts executing in the same work-group as they may be executing on SIMDs + of different CUs that access different L0s. A ``buffer_gl0_inv`` is also + required for coherence between wavefronts executing in different work-groups + as they may be executing on different WGPs. +* The scalar memory operations access a scalar L0 cache shared by all wavefronts + on a WGP. The scalar and vector L0 caches are not coherent. However, scalar + operations are used in a restricted way so do not impact the memory model. See + :ref:`amdgpu-amdhsa-memory-spaces`. +* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on + the same SA. Therefore, no special action is required for coherence between + the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is + required for coherence between wavefronts executing in different work-groups + as they may be executing on different SAs that access different L1s. +* The L1 caches have independent quadrants to service disjoint ranges of virtual + addresses. +* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the + vector and scalar memory operations performed by different wavefronts, whether + executing in the same or different work-groups (which may be executing on + different CUs accessing different L0s), can be reordered relative to each + other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure + synchronization between vector memory operations of different wavefronts. It + ensures a previous vector memory operation has completed before executing a + subsequent vector memory or LDS operation and so can be used to meet the + requirements of acquire, release and sequential consistency. +* The L1 caches use an L2 cache shared by all SAs on the same agent. +* The L2 cache has independent channels to service disjoint ranges of virtual + addresses. +* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 + quadrant has a separate request queue per L2 channel. Therefore, the vector + and scalar memory operations performed by wavefronts executing in different + work-groups (which may be executing on different SAs) of an agent can be + reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is + required to ensure synchronization between vector memory operations of + different SAs. It ensures a previous vector memory operation has completed + before executing a subsequent vector memory and so can be used to meet the + requirements of acquire, release and sequential consistency. +* The L2 cache can be kept coherent with other agents on some targets, or ranges + of virtual addresses can be set up to bypass it to ensure system coherence. + +Scalar memory operations are only ued to access memory that is proven to not +change during the execution of the kernel dispatch. This includes constant +address space and global address space for program scope const variables. +Therefore, the kernel machine code does not have to maintain the scalar cache to +ensure it is coherent with the vector caches. The scalar and vector caches are +invalidated between kernel dispatches by CP since constant address space data +may change between kernel dispatch executions. See +:ref:`amdgpu-amdhsa-memory-spaces`. + +The one exception is if scalar writes are used to spill SGPR registers. In this +case the AMDGPU backend ensures the memory location used to spill is never +accessed by vector memory operations at the same time. If scalar writes are used +then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function +return since the locations may be used for vector memory instructions by a +future wavefront that uses the same scratch area, or a function call that +creates a frame at the same address, respectively. There is no need for a +``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. + +For kernarg backing memory: + +* CP invalidates the L0 and L1 caches at the start of each kernel dispatch. +* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid + needing to invalidate the L2 cache. +* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and + so the L2 cache will be coherent with the CPU and other agents. + +Scratch backing memory (which is used for the private address space) is accessed +with MTYPE NC (non-coherent). Since the private address space is only accessed +by a single thread, and is always write-before-read, there is never a need to +invalidate these entries from the L0 or L1 caches. + +Wavefronts are executed in native mode with in-order reporting of loads and +sample instructions. In this mode vmcnt reports completion of load, atomic with +return and sample instructions in order, and the vscnt reports the completion of +store and atomic without return in order. See ``MEM_ORDERED`` field in +:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + +Wavefronts can be executed in WGP or CU wavefront execution mode: + +* In WGP wavefront execution mode the wavefronts of a work-group are executed + on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per + CU L0 caches is required for work-group synchronization. Also accesses to L1 + at work-group scope need to be explicitly ordered as the accesses from + different CUs are not ordered. +* In CU wavefront execution mode the wavefronts of a work-group are executed on + the SIMDs of a single CU of the WGP. Therefore, all global memory access by + the work-group access the same L0 which in turn ensures L1 accesses are + ordered and so do not require explicit management of the caches for + work-group synchronization. + +See ``WGP_MODE`` field in +:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and +:ref:`amdgpu-target-features`. + +The code sequences used to implement the memory model for GFX10 are defined in +table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`. + + .. table:: AMDHSA Memory Model Code Sequences GFX10 + :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table + + ============ ============ ============== ========== ================================ + LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code + Ordering Sync Scope Address GFX10 + Space + ============ ============ ============== ========== ================================ + **Non-Atomic** + ------------------------------------------------------------------------------------ + load *none* *none* - global - !volatile & !nontemporal + - generic + - private 1. buffer/global/flat_load + - constant + - volatile & !nontemporal + + 1. buffer/global/flat_load + glc=1 dlc=1 + + - nontemporal + + 1. buffer/global/flat_load + slc=1 + + load *none* *none* - local 1. ds_load + store *none* *none* - global - !nontemporal + - generic + - private 1. buffer/global/flat_store + - constant + - nontemporal + + 1. buffer/global/flat_store + slc=1 + + store *none* *none* - local 1. ds_store + **Unordered Atomic** + ------------------------------------------------------------------------------------ + load atomic unordered *any* *any* *Same as non-atomic*. + store atomic unordered *any* *any* *Same as non-atomic*. + atomicrmw unordered *any* *any* *Same as monotonic + atomic*. + **Monotonic Atomic** + ------------------------------------------------------------------------------------ + load atomic monotonic - singlethread - global 1. buffer/global/flat_load + - wavefront - generic + load atomic monotonic - workgroup - global 1. buffer/global/flat_load + - generic glc=1 + + - If CU wavefront execution + mode, omit glc=1. + + load atomic monotonic - singlethread - local 1. ds_load + - wavefront + - workgroup + load atomic monotonic - agent - global 1. buffer/global/flat_load + - system - generic glc=1 dlc=1 + store atomic monotonic - singlethread - global 1. buffer/global/flat_store + - wavefront - generic + - workgroup + - agent + - system + store atomic monotonic - singlethread - local 1. ds_store + - wavefront + - workgroup + atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic + - wavefront - generic + - workgroup + - agent + - system + atomicrmw monotonic - singlethread - local 1. ds_atomic + - wavefront + - workgroup + **Acquire Atomic** + ------------------------------------------------------------------------------------ + load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load + - wavefront - local + - generic + load atomic acquire - workgroup - global 1. buffer/global_load glc=1 + + - If CU wavefront execution + mode, omit glc=1. + + 2. s_waitcnt vmcnt(0) + + - If CU wavefront execution + mode, omit. + - Must happen before + the following buffer_gl0_inv + and before any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + load atomic acquire - workgroup - local 1. ds_load + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + the following buffer_gl0_inv + and before any following + global/generic load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - If OpenCL, omit. + - Ensures that + following + loads will not see + stale data. + + load atomic acquire - workgroup - generic 1. flat_load glc=1 + + - If CU wavefront execution + mode, omit glc=1. + + 2. s_waitcnt lgkmcnt(0) & + vmcnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0). + - If OpenCL, omit + lgkmcnt(0). + - Must happen before + the following + buffer_gl0_inv and any + following global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + load atomic acquire - agent - global 1. buffer/global_load + - system glc=1 dlc=1 + 2. s_waitcnt vmcnt(0) + + - Must happen before + following + buffer_gl*_inv. + - Ensures the load + has completed + before invalidating + the caches. + + 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following + loads will not see + stale global data. + + load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1 + - system 2. s_waitcnt vmcnt(0) & + lgkmcnt(0) + + - If OpenCL omit + lgkmcnt(0). + - Must happen before + following + buffer_gl*_invl. + - Ensures the flat_load + has completed + before invalidating + the caches. + + 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acquire - workgroup - global 1. buffer/global_atomic + 2. s_waitcnt vm/vscnt(0) + + - If CU wavefront execution + mode, omit. + - Use vmcnt(0) if atomic with + return and vscnt(0) if + atomic with no-return. + - Must happen before + the following buffer_gl0_inv + and before any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - workgroup - local 1. ds_atomic + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + the following + buffer_gl0_inv. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + 3. buffer_gl0_inv + + - If OpenCL omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - workgroup - generic 1. flat_atomic + 2. s_waitcnt lgkmcnt(0) & + vm/vscnt(0) + + - If CU wavefront execution + mode, omit vm/vscnt(0). + - If OpenCL, omit lgkmcnt(0). + - Use vmcnt(0) if atomic with + return and vscnt(0) if + atomic with no-return. + - Must happen before + the following + buffer_gl0_inv. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - agent - global 1. buffer/global_atomic + - system 2. s_waitcnt vm/vscnt(0) + + - Use vmcnt(0) if atomic with + return and vscnt(0) if + atomic with no-return. + - Must happen before + following + buffer_gl*_inv. + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - agent - generic 1. flat_atomic + - system 2. s_waitcnt vm/vscnt(0) & + lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Use vmcnt(0) if atomic with + return and vscnt(0) if + atomic with no-return. + - Must happen before + following + buffer_gl*_inv. + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acquire - singlethread *none* *none* + - wavefront + fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate. If + fence had an + address space then + set to address + space of OpenCL + fence flag, or to + generic if both + local and global + flags are + specified. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_gl0_inv. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) & vscnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_gl*_inv. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + caches. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + + 2. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before any + following global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + **Release Atomic** + ------------------------------------------------------------------------------------ + store atomic release - singlethread - global 1. buffer/global/ds/flat_store + - wavefront - local + - generic + store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + store. + - Ensures that all + memory operations + have + completed before + performing the + store that is being + released. + + 2. buffer/global_store + store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 2. ds_store + store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + store. + - Ensures that all + memory operations + have + completed before + performing the + store that is being + released. + + 2. flat_store + store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) & vscnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt vscnt(0) + and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + store. + - Ensures that all + memory operations + to memory have + completed before + performing the + store that is being + released. + + 2. buffer/global/flat_store + atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL, omit lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 2. ds_atomic + atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + atomicrmw release - agent - global 1. s_waitcnt lkkmcnt(0) & + - system - generic vmcnt(0) & vscnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global and local + have completed + before performing + the atomicrmw that + is being released. + + 2. buffer/global/flat_atomic + fence release - singlethread *none* *none* + - wavefront + fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate. If + fence had an + address space then + set to address + space of OpenCL + fence flag, or to + generic if both + local and global + flags are + specified. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + have + completed before + performing the + following + fence-paired-atomic. + + fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) & vscnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate. If + fence had an + address space then + set to address + space of OpenCL + fence flag, or to + generic if both + local and global + flags are + specified. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + have + completed before + performing the + following + fence-paired-atomic. + + **Acquire-Release Atomic** + ------------------------------------------------------------------------------------ + atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL, omit + lgkmcnt(0). + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0), and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + 3. s_waitcnt vm/vscnt(0) + + - If CU wavefront execution + mode, omit. + - Use vmcnt(0) if atomic with + return and vscnt(0) if + atomic with no-return. + - Must happen before + the following + buffer_gl0_inv. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + 4. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 2. ds_atomic + 3. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. + - Must happen before + the following + buffer_gl0_inv. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + 4. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - If OpenCL omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL, omit lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + 3. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL, omit lgkmcnt(0). + - Must happen before + the following + buffer_gl0_inv. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) & vscnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + 3. s_waitcnt vm/vscnt(0) + + - Use vmcnt(0) if atomic with + return and vscnt(0) if + atomic with no-return. + - Must happen before + following + buffer_gl*_inv. + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 4. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) & vscnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0), and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load atomic + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + 3. s_waitcnt vm/vscnt(0) & + lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Use vmcnt(0) if atomic with + return and vscnt(0) if + atomic with no-return. + - Must happen before + following + buffer_gl*_inv. + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 4. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acq_rel - singlethread *none* *none* + - wavefront + fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, + since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that all + memory operations + have + completed before + performing any + following global + memory operations. + - Ensures that the + preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + acquire-fence-paired-atomic) + has completed + before following + global memory + operations. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + local/generic store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + release-fence-paired-atomic). + This satisfies the + requirements of + release. + - Must happen before + the following + buffer_gl0_inv. + - Ensures that the + acquire-fence-paired + atomic has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + acquire-fence-paired-atomic. + + 3. buffer_gl0_inv + + - If CU wavefront execution + mode, omit. + - Ensures that + following + loads will not see + stale data. + + fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) & vscnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + buffer_gl*_inv. + - Ensures that the + preceding + global/local/generic + load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + acquire-fence-paired-atomic) + has completed + before invalidating + the caches. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + global/local/generic + store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + release-fence-paired-atomic). + This satisfies the + requirements of + release. + + 2. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. This + satisfies the + requirements of + acquire. + + **Sequential Consistent Atomic** + ------------------------------------------------------------------------------------ + load atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local load atomic acquire, + - generic except must generated + all instructions even + for OpenCL.* + load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) & + - generic vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit vmcnt(0) and + vscnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0), and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt lgkmcnt(0) must + happen after + preceding + local/generic load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + lgkmcnt(0) and so do + not need to be + considered.) + - s_waitcnt vmcnt(0) + must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - s_waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global/local + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + s_waitcnt of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The s_waitcnt + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the s_waitcnt be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generated + all instructions even + for OpenCL.* + load atomic seq_cst - workgroup - local + + 1. s_waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution + mode, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + Must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - s_waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + s_waitcnt of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The s_waitcnt + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the s_waitcnt be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generated + all instructions even + for OpenCL.* + + load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) & vscnt(0) + + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt lgkmcnt(0) + must happen after + preceding + local load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + lgkmcnt(0) and so do + not need to be + considered.) + - s_waitcnt vmcnt(0) + must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - s_waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + s_waitcnt of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The s_waitcnt + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the s_waitcnt be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generated + all instructions even + for OpenCL.* + store atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local store atomic release, + - workgroup - generic except must generated + all instructions even + for OpenCL.* + store atomic seq_cst - agent - global *Same as corresponding + - system - generic store atomic release, + except must generated + all instructions even + for OpenCL.* + atomicrmw seq_cst - singlethread - global *Same as corresponding + - wavefront - local atomicrmw acq_rel, + - workgroup - generic except must generated + all instructions even + for OpenCL.* + atomicrmw seq_cst - agent - global *Same as corresponding + - system - generic atomicrmw acq_rel, + except must generated + all instructions even + for OpenCL.* + fence seq_cst - singlethread *none* *Same as corresponding + - wavefront fence acq_rel, + - workgroup except must generated + - agent all instructions even + - system for OpenCL.* + ============ ============ ============== ========== ================================ Trap Handler ABI ~~~~~~~~~~~~~~~~ @@ -6738,7 +7886,7 @@ after the last local allocation. 9. All other registers are unspecified. -10. Any necessary ``waitcnt`` has been performed to ensure memory is available +10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available to the function. On exit from a function: @@ -6778,7 +7926,7 @@ 2. The PC is set to the RA provided on entry. 3. MODE register: *TBD*. 4. All other registers are clobbered. -5. Any necessary ``waitcnt`` has been performed to ensure memory accessed by +5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by function is available to the caller. .. TODO::