@@ -503,6 +503,11 @@ The AMDGPU backend uses the following ELF header:
503
503
target feature is
504
504
enabled for all code
505
505
contained in the code object.
506
+ If the processor
507
+ does not support the
508
+ ``xnack `` target
509
+ feature then must
510
+ be 0.
506
511
See
507
512
:ref: `amdgpu-target-features `.
508
513
================================= ========== =============================
@@ -1455,7 +1460,7 @@ address to physical address is:
1455
1460
There are different ways that the wavefront scratch base address is determined
1456
1461
by a wavefront (see :ref: `amdgpu-amdhsa-initial-kernel-execution-state `). This
1457
1462
memory can be accessed in an interleaved manner using buffer instruction with
1458
- the scratch buffer descriptor and per wave scratch offset, by the scratch
1463
+ the scratch buffer descriptor and per wavefront scratch offset, by the scratch
1459
1464
instructions, or by flat instructions. If each lane of a wavefront accesses the
1460
1465
same private address, the interleaving results in adjacent dwords being accessed
1461
1466
and hence requires fewer cache lines to be fetched. Multi-dword access is not
@@ -1796,7 +1801,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1796
1801
Bits Size Field Name Description
1797
1802
======= ======= =============================== ===========================================================================
1798
1803
0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1799
- _WAVE_OFFSET SGPR wave scratch offset
1804
+ _WAVEFRONT_OFFSET SGPR wavefront scratch offset
1800
1805
system register (see
1801
1806
:ref: `amdgpu-amdhsa-initial-kernel-execution-state `).
1802
1807
@@ -1883,7 +1888,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1883
1888
exceptions exceptions
1884
1889
enabled which are generated
1885
1890
when a memory violation has
1886
- occurred for this wave from
1891
+ occurred for this wavefront from
1887
1892
L1 or LDS
1888
1893
(write-to-read-only-memory,
1889
1894
mis-aligned atomic, LDS
@@ -2007,10 +2012,10 @@ SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2007
2012
an SGPR number.
2008
2013
2009
2014
The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2010
- all waves of the grid. It is possible to specify more than 16 User SGPRs using
2015
+ all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2011
2016
the ``enable_sgpr_* `` bit fields, in which case only the first 16 are actually
2012
2017
initialized. These are then immediately followed by the System SGPRs that are
2013
- set up by ADC/SPI and can have different values for each wave of the grid
2018
+ set up by ADC/SPI and can have different values for each wavefront of the grid
2014
2019
dispatch.
2015
2020
2016
2021
SGPR register initial state is defined in
@@ -2025,10 +2030,10 @@ SGPR register initial state is defined in
2025
2030
field) SGPRs
2026
2031
========== ========================== ====== ==============================
2027
2032
First Private Segment Buffer 4 V# that can be used, together
2028
- (enable_sgpr_private with Scratch Wave Offset as an
2029
- _segment_buffer) offset, to access the private
2030
- memory space using a segment
2031
- address.
2033
+ (enable_sgpr_private with Scratch Wavefront Offset
2034
+ _segment_buffer) as an offset, to access the
2035
+ private memory space using a
2036
+ segment address.
2032
2037
2033
2038
CP uses the value provided by
2034
2039
the runtime.
@@ -2068,7 +2073,7 @@ SGPR register initial state is defined in
2068
2073
address is
2069
2074
``SH_HIDDEN_PRIVATE_BASE_VIMID ``
2070
2075
plus this offset.) The value
2071
- of Scratch Wave Offset must
2076
+ of Scratch Wavefront Offset must
2072
2077
be added to this offset by
2073
2078
the kernel machine code,
2074
2079
right shifted by 8, and
@@ -2078,13 +2083,13 @@ SGPR register initial state is defined in
2078
2083
to SGPRn-4 on GFX7, and
2079
2084
SGPRn-6 on GFX8 (where SGPRn
2080
2085
is the highest numbered SGPR
2081
- allocated to the wave ).
2086
+ allocated to the wavefront ).
2082
2087
FLAT_SCRATCH_HI is
2083
2088
multiplied by 256 (as it is
2084
2089
in units of 256 bytes) and
2085
2090
added to
2086
2091
``SH_HIDDEN_PRIVATE_BASE_VIMID ``
2087
- to calculate the per wave
2092
+ to calculate the per wavefront
2088
2093
FLAT SCRATCH BASE in flat
2089
2094
memory instructions that
2090
2095
access the scratch
@@ -2124,7 +2129,7 @@ SGPR register initial state is defined in
2124
2129
divides it if there are
2125
2130
multiple Shader Arrays each
2126
2131
with its own SPI). The value
2127
- of Scratch Wave Offset must
2132
+ of Scratch Wavefront Offset must
2128
2133
be added by the kernel
2129
2134
machine code and the result
2130
2135
moved to the FLAT_SCRATCH
@@ -2193,12 +2198,12 @@ SGPR register initial state is defined in
2193
2198
then Work-Group Id Z 1 32 bit work-group id in Z
2194
2199
(enable_sgpr_workgroup_id dimension of grid for
2195
2200
_Z) wavefront.
2196
- then Work-Group Info 1 {first_wave , 14'b0000,
2201
+ then Work-Group Info 1 {first_wavefront , 14'b0000,
2197
2202
(enable_sgpr_workgroup ordered_append_term[10:0],
2198
- _info) threadgroup_size_in_waves [5:0]}
2199
- then Scratch Wave Offset 1 32 bit byte offset from base
2203
+ _info) threadgroup_size_in_wavefronts [5:0]}
2204
+ then Scratch Wavefront Offset 1 32 bit byte offset from base
2200
2205
(enable_sgpr_private of scratch base of queue
2201
- _segment_wave_offset) executing the kernel
2206
+ _segment_wavefront_offset) executing the kernel
2202
2207
dispatch. Must be used as an
2203
2208
offset with Private
2204
2209
segment address when using
@@ -2244,8 +2249,8 @@ The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2244
2249
registers.
2245
2250
2. Work-group Id registers X, Y, Z are set by ADC which supports any
2246
2251
combination including none.
2247
- 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2248
- cannot included with the flat scratch init value which is per queue.
2252
+ 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2253
+ its value cannot included with the flat scratch init value which is per queue.
2249
2254
4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2250
2255
or (X, Y, Z).
2251
2256
@@ -2293,7 +2298,7 @@ Flat Scratch
2293
2298
2294
2299
If the kernel may use flat operations to access scratch memory, the prolog code
2295
2300
must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2296
- are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2301
+ are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
2297
2302
Offset SGPR registers (see :ref: `amdgpu-amdhsa-initial-kernel-execution-state `):
2298
2303
2299
2304
GFX6
@@ -2304,7 +2309,7 @@ GFX7-GFX8
2304
2309
``SH_HIDDEN_PRIVATE_BASE_VIMID `` to the base of scratch backing memory
2305
2310
being managed by SPI for the queue executing the kernel dispatch. This is
2306
2311
the same value used in the Scratch Segment Buffer V# base address. The
2307
- prolog must add the value of Scratch Wave Offset to get the wave 's byte
2312
+ prolog must add the value of Scratch Wavefront Offset to get the wavefront 's byte
2308
2313
scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID ``. Since
2309
2314
FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2310
2315
by 8 before moving into FLAT_SCRATCH_LO.
@@ -2318,7 +2323,7 @@ GFX7-GFX8
2318
2323
GFX9
2319
2324
The Flat Scratch Init is the 64 bit address of the base of scratch backing
2320
2325
memory being managed by SPI for the queue executing the kernel dispatch. The
2321
- prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2326
+ prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
2322
2327
pair for use as the flat scratch base in flat memory instructions.
2323
2328
2324
2329
.. _amdgpu-amdhsa-memory-model:
@@ -2384,12 +2389,12 @@ For GFX6-GFX9:
2384
2389
global order and involve no caching. Completion is reported to a wavefront in
2385
2390
execution order.
2386
2391
* The LDS memory has multiple request queues shared by the SIMDs of a
2387
- CU. Therefore, the LDS operations performed by different waves of a work-group
2392
+ CU. Therefore, the LDS operations performed by different wavefronts of a work-group
2388
2393
can be reordered relative to each other, which can result in reordering the
2389
2394
visibility of vector memory operations with respect to LDS operations of other
2390
2395
wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0) `` is required to
2391
2396
ensure synchronization between LDS operations and vector memory operations
2392
- between waves of a work-group, but not between operations performed by the
2397
+ between wavefronts of a work-group, but not between operations performed by the
2393
2398
same wavefront.
2394
2399
* The vector memory operations are performed as wavefront wide operations and
2395
2400
completion is reported to a wavefront in execution order. The exception is
@@ -2399,7 +2404,7 @@ For GFX6-GFX9:
2399
2404
* The vector memory operations access a single vector L1 cache shared by all
2400
2405
SIMDs a CU. Therefore, no special action is required for coherence between the
2401
2406
lanes of a single wavefront, or for coherence between wavefronts in the same
2402
- work-group. A ``buffer_wbinvl1_vol `` is required for coherence between waves
2407
+ work-group. A ``buffer_wbinvl1_vol `` is required for coherence between wavefronts
2403
2408
executing in different work-groups as they may be executing on different CUs.
2404
2409
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2405
2410
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
@@ -2410,7 +2415,7 @@ For GFX6-GFX9:
2410
2415
* The L2 cache has independent channels to service disjoint ranges of virtual
2411
2416
addresses.
2412
2417
* Each CU has a separate request queue per channel. Therefore, the vector and
2413
- scalar memory operations performed by waves executing in different work-groups
2418
+ scalar memory operations performed by wavefronts executing in different work-groups
2414
2419
(which may be executing on different CUs) of an agent can be reordered
2415
2420
relative to each other. A ``s_waitcnt vmcnt(0) `` is required to ensure
2416
2421
synchronization between vector memory operations of different CUs. It ensures a
@@ -2460,7 +2465,7 @@ case the AMDGPU backend ensures the memory location used to spill is never
2460
2465
accessed by vector memory operations at the same time. If scalar writes are used
2461
2466
then a ``s_dcache_wb `` is inserted before the ``s_endpgm `` and before a function
2462
2467
return since the locations may be used for vector memory instructions by a
2463
- future wave that uses the same scratch area, or a function call that creates a
2468
+ future wavefront that uses the same scratch area, or a function call that creates a
2464
2469
frame at the same address, respectively. There is no need for a ``s_dcache_inv ``
2465
2470
as all scalar writes are write-before-read in the same thread.
2466
2471
0 commit comments