Index: docs/AMDGPUUsage.rst =================================================================== --- docs/AMDGPUUsage.rst +++ docs/AMDGPUUsage.rst @@ -503,6 +503,11 @@ target feature is enabled for all code contained in the code object. + If the processor + does not support the + ``xnack`` target + feature then must + be 0. See :ref:`amdgpu-target-features`. ================================= ========== ============================= @@ -1455,7 +1460,7 @@ There are different ways that the wavefront scratch base address is determined by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This memory can be accessed in an interleaved manner using buffer instruction with -the scratch buffer descriptor and per wave scratch offset, by the scratch +the scratch buffer descriptor and per wavefront scratch offset, by the scratch instructions, or by flat instructions. If each lane of a wavefront accesses the same private address, the interleaving results in adjacent dwords being accessed and hence requires fewer cache lines to be fetched. Multi-dword access is not @@ -1796,7 +1801,7 @@ Bits Size Field Name Description ======= ======= =============================== =========================================================================== 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the - _WAVE_OFFSET SGPR wave scratch offset + _WAVEFRONT_OFFSET SGPR wavefront scratch offset system register (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). @@ -1883,7 +1888,7 @@ exceptions exceptions enabled which are generated when a memory violation has - occurred for this wave from + occurred for this wavefront from L1 or LDS (write-to-read-only-memory, mis-aligned atomic, LDS @@ -2007,10 +2012,10 @@ an SGPR number. The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to -all waves of the grid. It is possible to specify more than 16 User SGPRs using +all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually initialized. These are then immediately followed by the System SGPRs that are -set up by ADC/SPI and can have different values for each wave of the grid +set up by ADC/SPI and can have different values for each wavefront of the grid dispatch. SGPR register initial state is defined in @@ -2025,10 +2030,10 @@ field) SGPRs ========== ========================== ====== ============================== First Private Segment Buffer 4 V# that can be used, together - (enable_sgpr_private with Scratch Wave Offset as an - _segment_buffer) offset, to access the private - memory space using a segment - address. + (enable_sgpr_private with Scratch Wavefront Offset + _segment_buffer) as an offset, to access the + private memory space using a + segment address. CP uses the value provided by the runtime. @@ -2068,7 +2073,7 @@ address is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.) The value - of Scratch Wave Offset must + of Scratch Wavefront Offset must be added to this offset by the kernel machine code, right shifted by 8, and @@ -2078,13 +2083,13 @@ to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where SGPRn is the highest numbered SGPR - allocated to the wave). + allocated to the wavefront). FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` - to calculate the per wave + to calculate the per wavefront FLAT SCRATCH BASE in flat memory instructions that access the scratch @@ -2124,7 +2129,7 @@ divides it if there are multiple Shader Arrays each with its own SPI). The value - of Scratch Wave Offset must + of Scratch Wavefront Offset must be added by the kernel machine code and the result moved to the FLAT_SCRATCH @@ -2193,12 +2198,12 @@ then Work-Group Id Z 1 32 bit work-group id in Z (enable_sgpr_workgroup_id dimension of grid for _Z) wavefront. - then Work-Group Info 1 {first_wave, 14'b0000, + then Work-Group Info 1 {first_wavefront, 14'b0000, (enable_sgpr_workgroup ordered_append_term[10:0], - _info) threadgroup_size_in_waves[5:0]} - then Scratch Wave Offset 1 32 bit byte offset from base + _info) threadgroup_size_in_wavefronts[5:0]} + then Scratch Wavefront Offset 1 32 bit byte offset from base (enable_sgpr_private of scratch base of queue - _segment_wave_offset) executing the kernel + _segment_wavefront_offset) executing the kernel dispatch. Must be used as an offset with Private segment address when using @@ -2244,8 +2249,8 @@ registers. 2. Work-group Id registers X, Y, Z are set by ADC which supports any combination including none. -3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value - cannot included with the flat scratch init value which is per queue. +3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why + its value cannot included with the flat scratch init value which is per queue. 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) or (X, Y, Z). @@ -2293,7 +2298,7 @@ If the kernel may use flat operations to access scratch memory, the prolog code must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which -are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave +are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): GFX6 @@ -2304,7 +2309,7 @@ ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory being managed by SPI for the queue executing the kernel dispatch. This is the same value used in the Scratch Segment Buffer V# base address. The - prolog must add the value of Scratch Wave Offset to get the wave's byte + prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted by 8 before moving into FLAT_SCRATCH_LO. @@ -2318,7 +2323,7 @@ GFX9 The Flat Scratch Init is the 64 bit address of the base of scratch backing memory being managed by SPI for the queue executing the kernel dispatch. The - prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH + prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH pair for use as the flat scratch base in flat memory instructions. .. _amdgpu-amdhsa-memory-model: @@ -2384,12 +2389,12 @@ global order and involve no caching. Completion is reported to a wavefront in execution order. * The LDS memory has multiple request queues shared by the SIMDs of a - CU. Therefore, the LDS operations performed by different waves of a work-group + CU. Therefore, the LDS operations performed by different wavefronts of a work-group can be reordered relative to each other, which can result in reordering the visibility of vector memory operations with respect to LDS operations of other wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to ensure synchronization between LDS operations and vector memory operations - between waves of a work-group, but not between operations performed by the + between wavefronts of a work-group, but not between operations performed by the same wavefront. * The vector memory operations are performed as wavefront wide operations and completion is reported to a wavefront in execution order. The exception is @@ -2399,7 +2404,7 @@ * The vector memory operations access a single vector L1 cache shared by all SIMDs a CU. Therefore, no special action is required for coherence between the lanes of a single wavefront, or for coherence between wavefronts in the same - work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves + work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts executing in different work-groups as they may be executing on different CUs. * The scalar memory operations access a scalar L1 cache shared by all wavefronts on a group of CUs. The scalar and vector L1 caches are not coherent. However, @@ -2410,7 +2415,7 @@ * The L2 cache has independent channels to service disjoint ranges of virtual addresses. * Each CU has a separate request queue per channel. Therefore, the vector and - scalar memory operations performed by waves executing in different work-groups + scalar memory operations performed by wavefronts executing in different work-groups (which may be executing on different CUs) of an agent can be reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between vector memory operations of different CUs. It ensures a @@ -2460,7 +2465,7 @@ accessed by vector memory operations at the same time. If scalar writes are used then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function return since the locations may be used for vector memory instructions by a -future wave that uses the same scratch area, or a function call that creates a +future wavefront that uses the same scratch area, or a function call that creates a frame at the same address, respectively. There is no need for a ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.